Pandas provides several ways to make joins (merges) between DataFrames. Below are examples of common types of joins: inner, outer, left, and right joins.

Pandas join types
Photo by NaturEye Conservation on Pexels.com

Pandas: Various Join Types and Examples

1. Inner Join (Default Join)

This join returns only the rows that have matching values in both DataFrames.

import pandas as pd

# Example DataFrames
df1 = pd.DataFrame({
'id': [1, 2, 3, 4],
'name': ['Alice', 'Bob', 'Charlie', 'David']
})

df2 = pd.DataFrame({
'id': [3, 4, 5, 6],
'age': [23, 34, 45, 56]
})

# Inner Join (only rows with matching 'id' in both DataFrames)


result = pd.merge(df1, df2, on='id', how='inner')
print(result)

Output:

   id     name  age
0 3 Charlie 23
1 4 David 34

2. Left Join

A left join returns all rows from the left DataFrame and only the matching rows from the right DataFrame. Unmatched rows in the right DataFrame will have NaN.

# Left Join (all rows from df1, matched rows from df2)
result = pd.merge(df1, df2, on='id', how='left')
print(result)

Output:

   id     name   age
0 1 Alice NaN
1 2 Bob NaN
2 3 Charlie 23.0
3 4 David 34.0

3. Right Join

A right join returns all rows from the right DataFrame and only the matching rows from the left DataFrame. Unmatched rows in the left DataFrame will have NaN.

# Right Join (all rows from df2, matched rows from df1)
result = pd.merge(df1, df2, on='id', how='right')
print(result)

Output:

    id     name   age
0 3 Charlie 23.0
1 4 David 34.0
2 5 NaN 45.0
3 6 NaN 56.0

4. Outer Join

An outer join returns all rows when there is a match in either left or right DataFrame. Missing values are filled with NaN.

# Outer Join (all rows from both DataFrames)
result = pd.merge(df1, df2, on='id', how='outer')
print(result)

Output:

    id     name   age
0 1 Alice NaN
1 2 Bob NaN
2 3 Charlie 23.0
3 4 David 34.0
4 5 NaN 45.0
5 6 NaN 56.0

5. Joining on Different Column Names

If the column names are different in both DataFrames, you can specify left_on and right_on.

df1 = pd.DataFrame({
'emp_id': [1, 2, 3],
'name': ['Alice', 'Bob', 'Charlie']
})

df2 = pd.DataFrame({
'id': [1, 2, 4],
'salary': [5000, 6000, 7000]
})

# Merge on different column names
result = pd.merge(df1, df2, left_on='emp_id', right_on='id', how='inner')
print(result)

Output:

     emp_id   name  id  salary
0 1 Alice 1 5000
1 2 Bob 2 6000

6. Merging on Multiple Columns

You can merge on multiple columns by passing a list to on, left_on, and right_on.

df1 = pd.DataFrame({
'first_name': ['John', 'Jane', 'Jim'],
'last_name': ['Doe', 'Doe', 'Beam'],
'age': [28, 32, 45]
})

df2 = pd.DataFrame({
'first_name': ['John', 'Jane', 'Joe'],
'last_name': ['Doe', 'Doe', 'Black'],
'salary': [10000, 12000, 15000]
})

# Merge on multiple columns
result = pd.merge(df1, df2, on=['first_name', 'last_name'], how='inner')
print(result)

Output:

     first_name last_name  age  salary
0 John Doe 28 10000
1 Jane Doe 32 12000

7. Suffixes in Case of Overlapping Column Names

When merging DataFrames with overlapping column names, you can add suffixes to distinguish them.

df1 = pd.DataFrame({
'id': [1, 2, 3],
'name': ['Alice', 'Bob', 'Charlie'],
'age': [25, 30, 35]
})

df2 = pd.DataFrame({
'id': [1, 2, 3],
'name': ['Alice', 'Bob', 'Charlie'],
'salary': [5000, 6000, 7000]
})

# Merge with overlapping column names
result = pd.merge(df1, df2, on='id', how='inner', suffixes=('_left', '_right'))
print(result)

Output:

    id     name_left   age     name_right  salary
0 1 Alice 25 Alice 5000
1 2 Bob 30 Bob 6000
2 3 Charlie 35 Charlie 7000

Summary of Join Types:

  • Inner Join: Returns rows with matching values in both DataFrames.
  • Left Join: Returns all rows from the left DataFrame, with matching rows from the right.
  • Right Join: Returns all rows from the right DataFrame, with matching rows from the left.
  • Outer Join: Returns all rows when there is a match in either DataFrame.

Pandas does not have built-in functions specifically named for semi and anti-joins. These types of joins can still be performed using a combination of merge(), isin(), and filtering. Here’s how you can implement both:

1. Semi Join

A semi join returns the rows from the left DataFrame where there is a match in the right DataFrame. Hence, it only includes columns from the left DataFrame.

Example:

import pandas as pd

# Example DataFrames
df1 = pd.DataFrame({
'id': [1, 2, 3, 4],
'name': ['Alice', 'Bob', 'Charlie', 'David']
})

df2 = pd.DataFrame({
'id': [3, 4, 5, 6],
'age': [23, 34, 45, 56]
})

# Semi Join (rows from df1 where 'id' is in df2)
semi_join_result = df1[df1['id'].isin(df2['id'])]
print(semi_join_result)

Output:

   id     name
2 3 Charlie
3 4 David

2. Anti Join

An anti join returns the rows from the left DataFrame where there is no match in the right DataFrame.

Example:

# Anti Join (rows from df1 where 'id' is not in df2)
anti_join_result = df1[~df1['id'].isin(df2['id'])]
print(anti_join_result)

Output:

   id   name
0 1 Alice
1 2 Bob

Explanation

  • Semi Join: Uses isin() to filter rows in df1 where id exists in df2. This method does not include columns from df2.
  • Anti Join: Uses ~isin() to filter rows in df1 where id does not exist in df2.

Summary of Semi and Anti Joins

  • Semi Join: Returns rows from the left DataFrame that have a match in the right DataFrame. It only includes columns from the left DataFrame.
  • Anti Join: Returns rows from the left DataFrame that do not have a match in the right DataFrame.

These operations are useful for filtering data based on the presence or absence of matches in another DataFrame. They can be easily performed using isin() in Pandas.

Beyond these, there are a few other join concepts and variations that might be useful:

1. Cross Join (Cartesian Product)

A cross join returns the Cartesian product of the two DataFrames. This means every row of the first DataFrame is paired with every row of the second DataFrame. In Pandas, this can be achieved using merge() with no common columns or using pd.merge() with the how='cross' choice (available in Pandas 1.2.0+).

Example:

import pandas as pd

# Example DataFrames
df1 = pd.DataFrame({'A': [1, 2]})
df2 = pd.DataFrame({'B': ['X', 'Y']})

# Cross Join using 'how=cross' (Pandas 1.2.0+)
cross_join_result = pd.merge(df1, df2, how='cross')
print(cross_join_result)

Output:

   A  B
0 1 X
1 1 Y
2 2 X
3 2 Y

If you’re using an older version of Pandas, you can achieve the same result with another method:

# Cross Join using alternative method (if Pandas < 1.2.0)
df1['key'] = 1
df2['key'] = 1
cross_join_result = pd.merge(df1, df2, on='key').drop('key', axis=1)
print(cross_join_result)

2. Self Join

A self join is simply a join of a DataFrame with itself. This is useful for hierarchical data or when comparing rows within the same table.

Example:

# Example DataFrame
df = pd.DataFrame({
'id': [1, 2, 3],
'manager_id': [None, 1, 2],
'name': ['Alice', 'Bob', 'Charlie']
})

# Self Join to find employee and their manager's name
self_join_result = pd.merge(df, df, left_on='manager_id', right_on='id', suffixes=('_emp', '_mgr'))
print(self_join_result)

Output:

     id_emp  manager_id name_emp  id_mgr  name_mgr
1 2 1.0 Bob 1 Alice
2 3 2.0 Charlie 2 Bob

3. Equi Join

An equi join is a specific case of inner join where the join condition is based on equality (e.g., matching rows where column_a == column_b). In practice, this is the same as an inner join using merge() on a key.

Example:

This is akin to an inner join, so you use it like this:

# Using merge() which is essentially an equi join
equi_join_result = pd.merge(df1, df2, on='id', how='inner')

4. Natural Join

A natural join automatically joins two DataFrames based on all columns with the same names and values. Pandas doesn’t have a built-in method for a natural join. You can simulate it by merging the intersection of column names.

Example:

# Example DataFrames
df1 = pd.DataFrame({
'id': [1, 2, 3],
'name': ['Alice', 'Bob', 'Charlie']
})

df2 = pd.DataFrame({
'id': [1, 2, 3],
'age': [25, 30, 35],
'name': ['Alice', 'Bob', 'Charlie'] # Common column
})

# Natural Join (merging on intersection of column names)
common_columns = list(set(df1.columns) & set(df2.columns))
natural_join_result = pd.merge(df1, df2, on=common_columns)
print(natural_join_result)

Output:

    id     name  age
0 1 Alice 25
1 2 Bob 30
2 3 Charlie 35

Summary of Extra Joins:

  • Cross Join: Returns the Cartesian product of the two DataFrames.
  • Self Join: Joins a DataFrame with itself.
  • Equi Join: A type of inner join that uses equality to match columns.
  • Natural Join: Joins on all columns with the same name and value.

These extra join types and variations offer more flexibility in how you combine and analyze data in Pandas.