When working with data in Python, the Pandas library provides powerful tools for manipulating and analyzing data. One common task is combining multiple DataFrames. Pandas offers several functions to accomplish this: merge
, join
, and concatenate
. In this blog, we’ll explore these functions, their differences, and when to use each.
1. Merge
The merge
function is used to combine two DataFrames based on one or more common columns. It’s similar to SQL joins. You can specify different types of joins: inner, outer, left, and right.
Syntax:
pd.merge(left, right, how='inner', on=None, left_on=None, right_on=None, left_index=False, right_index=False)
left
: DataFrame.right
: DataFrame.how
: Type of merge to be performed (‘left’, ‘right’, ‘outer’, ‘inner’).on
: Column(s) to join on. Must be found in both DataFrames.left_on
: Column(s) from the left DataFrame to use as keys.right_on
: Column(s) from the right DataFrame to use as keys.left_index
: Use the index from the left DataFrame as the join key.right_index
: Use the index from the right DataFrame as the join key.
Example:
import pandas as pd
df1 = pd.DataFrame({
'key': ['A', 'B', 'C', 'D'],
'value1': [1, 2, 3, 4]
})
df2 = pd.DataFrame({
'key': ['B', 'D', 'E', 'F'],
'value2': [5, 6, 7, 8]
})
merged_df = pd.merge(df1, df2, on='key', how='inner')
print(merged_df)
Output:
key value1 value2
0 B 2 5
1 D 4 6
In this example, we have two DataFrames, df1
and df2
, each with a column named key
. DataFrame df1
has additional columns key
and value1
, while df2
has columns key
and value2
. We use the pd.merge()
function to merge these DataFrames on the key
column using an inner join. This means that only the rows with matching values in the key
column from both DataFrames will be included in the result. The resulting DataFrame, merged_df
, contains the key
column, along with value1
from df1
and value2
from df2
for the rows where key
is ‘B’ and ‘D’.
2. Join
The join
function is used to combine two DataFrames based on their indexes. It’s a convenient way of combining DataFrames with a database-style join operation.
Syntax:
left.join(right, on=None, how='left', lsuffix='', rsuffix='', sort=False)
on
: Column or index level names to join on.how
: Type of join (‘left’, ‘right’, ‘outer’, ‘inner’).lsuffix
: Suffix to use from left DataFrame in case of overlapping column names.rsuffix
: Suffix to use from right DataFrame in case of overlapping column names.sort
: Order result DataFrame lexicographically by the join key.
Example:
df3 = pd.DataFrame({
'value1': [1, 2, 3, 4]
}, index=['A', 'B', 'C', 'D'])
df4 = pd.DataFrame({
'value2': [5, 6, 7, 8]
}, index=['B', 'D', 'E', 'F'])
joined_df = df3.join(df4, how='inner')
print(joined_df)
Output:
value1 value2
B 2 5
D 4 6
Here, we have two DataFrames, df3
and df4
, with indexes ‘A’, ‘B’, ‘C’, ‘D’ for df3
and ‘B’, ‘D’, ‘E’, ‘F’ for df4
. df3
contains a column value1
and df4
contains a column value2
. We use the join()
method to combine these DataFrames based on their indexes with an inner join. The resulting DataFrame, joined_df
, includes only the rows where the indexes match in both DataFrames, which are ‘B’ and ‘D’. The columns value1
from df3
and value2
from df4
are preserved.
3. Concatenate
The concat
function is used to concatenate DataFrames along a particular axis (rows or columns). It can also handle hierarchical indexing (MultiIndex) on the concatenation axis.
Syntax:
pd.concat(objs, axis=0, join='outer', ignore_index=False, keys=None, levels=None, names=None, verify_integrity=False, sort=False)
objs
: A sequence or mapping of Series or DataFrame objects.axis
: The axis to concatenate along (0 for rows, 1 for columns).join
: How to handle indexes on other axes (‘outer’, ‘inner’).ignore_index
: If True, do not use the index values along the concatenation axis.keys
: Sequence to use to create hierarchical index.levels
: Specific levels (unique values) to use for constructing a MultiIndex.names
: Names for the levels in the resulting hierarchical index.verify_integrity
: Check for duplicate indexes.sort
: Sort non-concatenation axis if it is not already aligned.
Example:
df5 = pd.DataFrame({
'A': ['A0', 'A1', 'A2'],
'B': ['B0', 'B1', 'B2']
})
df6 = pd.DataFrame({
'A': ['A3', 'A4', 'A5'],
'B': ['B3', 'B4', 'B5']
})
concatenated_df = pd.concat([df5, df6], axis=0)
print(concatenated_df)
Output:
A B
0 A0 B0
1 A1 B1
2 A2 B2
0 A3 B3
1 A4 B4
2 A5 B5
In this example, we have two DataFrames, df5
and df6
, each with columns A
and B
. df5
has values A0
, A1
, A2
in column A
and B0
, B1
, B2
in column B
. Similarly, df6
has values A3
, A4
, A5
in column A
and B3
, B4
, B5
in column B
. We use the pd.concat()
function to concatenate these DataFrames along the rows (axis=0). This means the rows of df6
are added below the rows of df5
. The resulting DataFrame, concatenated_df
, contains all rows from both DataFrames, preserving the order and including the columns A
and B
.
Conclusion
- Use
merge
when you need to combine DataFrames based on one or more keys. - Use
join
when you need to combine DataFrames based on their indexes. - Use
concat
when you need to concatenate DataFrames along a particular axis.
Understanding these methods will help you efficiently manage and manipulate your data, making your data analysis tasks easier and more effective. Happy coding!
Also Explore: