Merge, Join, and Concatenate DataFrames using Pandas

When working with data in Python, the Pandas library provides powerful tools for manipulating and analyzing data. One common task is combining multiple DataFrames. Pandas offers several functions to accomplish this: merge, join, and concatenate. In this blog, we’ll explore these functions, their differences, and when to use each.

1. Merge

The merge function is used to combine two DataFrames based on one or more common columns. It’s similar to SQL joins. You can specify different types of joins: inner, outer, left, and right.

Syntax:

Python
pd.merge(left, right, how='inner', on=None, left_on=None, right_on=None, left_index=False, right_index=False)
  • left: DataFrame.
  • right: DataFrame.
  • how: Type of merge to be performed (‘left’, ‘right’, ‘outer’, ‘inner’).
  • on: Column(s) to join on. Must be found in both DataFrames.
  • left_on: Column(s) from the left DataFrame to use as keys.
  • right_on: Column(s) from the right DataFrame to use as keys.
  • left_index: Use the index from the left DataFrame as the join key.
  • right_index: Use the index from the right DataFrame as the join key.

Example:

Python
import pandas as pd

df1 = pd.DataFrame({
    'key': ['A', 'B', 'C', 'D'],
    'value1': [1, 2, 3, 4]
})

df2 = pd.DataFrame({
    'key': ['B', 'D', 'E', 'F'],
    'value2': [5, 6, 7, 8]
})

merged_df = pd.merge(df1, df2, on='key', how='inner')
print(merged_df)

Output:

Markdown
  key  value1  value2
0   B       2       5
1   D       4       6

In this example, we have two DataFrames, df1 and df2, each with a column named key. DataFrame df1 has additional columns key and value1, while df2 has columns key and value2. We use the pd.merge() function to merge these DataFrames on the key column using an inner join. This means that only the rows with matching values in the key column from both DataFrames will be included in the result. The resulting DataFrame, merged_df, contains the key column, along with value1 from df1 and value2 from df2 for the rows where key is ‘B’ and ‘D’.


2. Join

The join function is used to combine two DataFrames based on their indexes. It’s a convenient way of combining DataFrames with a database-style join operation.

Syntax:

Python
left.join(right, on=None, how='left', lsuffix='', rsuffix='', sort=False)
  • on: Column or index level names to join on.
  • how: Type of join (‘left’, ‘right’, ‘outer’, ‘inner’).
  • lsuffix: Suffix to use from left DataFrame in case of overlapping column names.
  • rsuffix: Suffix to use from right DataFrame in case of overlapping column names.
  • sort: Order result DataFrame lexicographically by the join key.

Example:

Python
df3 = pd.DataFrame({
    'value1': [1, 2, 3, 4]
}, index=['A', 'B', 'C', 'D'])

df4 = pd.DataFrame({
    'value2': [5, 6, 7, 8]
}, index=['B', 'D', 'E', 'F'])

joined_df = df3.join(df4, how='inner')
print(joined_df)

Output:

Markdown
   value1  value2
B       2       5
D       4       6

Here, we have two DataFrames, df3 and df4, with indexes ‘A’, ‘B’, ‘C’, ‘D’ for df3 and ‘B’, ‘D’, ‘E’, ‘F’ for df4. df3 contains a column value1 and df4 contains a column value2. We use the join() method to combine these DataFrames based on their indexes with an inner join. The resulting DataFrame, joined_df, includes only the rows where the indexes match in both DataFrames, which are ‘B’ and ‘D’. The columns value1 from df3 and value2 from df4 are preserved.


3. Concatenate

The concat function is used to concatenate DataFrames along a particular axis (rows or columns). It can also handle hierarchical indexing (MultiIndex) on the concatenation axis.

Syntax:

Python
pd.concat(objs, axis=0, join='outer', ignore_index=False, keys=None, levels=None, names=None, verify_integrity=False, sort=False)
  • objs: A sequence or mapping of Series or DataFrame objects.
  • axis: The axis to concatenate along (0 for rows, 1 for columns).
  • join: How to handle indexes on other axes (‘outer’, ‘inner’).
  • ignore_index: If True, do not use the index values along the concatenation axis.
  • keys: Sequence to use to create hierarchical index.
  • levels: Specific levels (unique values) to use for constructing a MultiIndex.
  • names: Names for the levels in the resulting hierarchical index.
  • verify_integrity: Check for duplicate indexes.
  • sort: Sort non-concatenation axis if it is not already aligned.

Example:

Python
df5 = pd.DataFrame({
    'A': ['A0', 'A1', 'A2'],
    'B': ['B0', 'B1', 'B2']
})

df6 = pd.DataFrame({
    'A': ['A3', 'A4', 'A5'],
    'B': ['B3', 'B4', 'B5']
})

concatenated_df = pd.concat([df5, df6], axis=0)
print(concatenated_df)

Output:

Markdown
    A   B
0  A0  B0
1  A1  B1
2  A2  B2
0  A3  B3
1  A4  B4
2  A5  B5

In this example, we have two DataFrames, df5 and df6, each with columns A and B. df5 has values A0, A1, A2 in column A and B0, B1, B2 in column B. Similarly, df6 has values A3, A4, A5 in column A and B3, B4, B5 in column B. We use the pd.concat() function to concatenate these DataFrames along the rows (axis=0). This means the rows of df6 are added below the rows of df5. The resulting DataFrame, concatenated_df, contains all rows from both DataFrames, preserving the order and including the columns A and B.


Conclusion
  • Use merge when you need to combine DataFrames based on one or more keys.
  • Use join when you need to combine DataFrames based on their indexes.
  • Use concat when you need to concatenate DataFrames along a particular axis.

Understanding these methods will help you efficiently manage and manipulate your data, making your data analysis tasks easier and more effective. Happy coding!

Also Explore:

Leave a Comment