Python provides us with the pandas dataframes to handle tabular data. In this article, we will discuss how we can merge two dataframes in python.
How to Merge Two DataFrames in Python?
Suppose that we have a dataframe that contains the names of some students, their roll number, and the class they have opted to study as shown below.
Class Roll Name
0 1 11 Aditya
1 1 12 Chris
2 1 13 Sam
3 1 14 Joel
4 1 15 Tom
5 1 16 Samantha
6 1 17 Pablo
7 1 20 Tina
8 1 24 Amy
9 1 30 Justin
10 1 31 Karl
Also, we have a dataframe that contains the roll number of the students and the marks obtained by them as shown below.
Roll Marks Grade
0 11 85 A
1 12 95 A
2 13 75 B
3 14 75 B
4 16 78 B
5 15 55 C
6 19 75 B
7 20 72 B
8 24 92 A
9 25 95 A
Now, we have to merge the given dataframes such that the resultant dataframe contains the name, roll number, grade, class, and the corresponding marks for each student as shown below.
Roll Marks Grade Class Name
0 11 85 A 1 Aditya
1 12 95 A 1 Chris
2 13 75 B 1 Sam
3 14 75 B 1 Joel
4 16 78 B 1 Samantha
5 15 55 C 1 Tom
6 20 72 B 1 Tina
7 24 92 A 1 Amy
To obtain the output, we will use the merge()
method defined in the pandas module. The syntax of the merge()
method is as follows.
pd.merge(df1,df2, on)
Here,
df1
denotes the first dataframe.- The parameter
df2
denotes the second dataframe that is to be merged. - The parameter
‘on’
is the column name of the dataframes that is used to compare the columns of the given dataframes. Rows from two dataframes are merged together if they have the same value in the column corresponding to the ‘on’ parameter.
The merge()
method also takes several other parameters. You can have a look at them in this documentation.
To merge the given dataframes, we will invoke the merge()
method first dataframe as the first input argument. Subsequently, we will pass the dataframe containing the marks as the second input argument to the merge()
method. To the ‘on’
parameter, we will pass the ‘Roll’
column name. In this way, the rows corresponding to the same roll numbers from the given dataframes will be merged and the resultant dataframe will be produced as shown below.
import numpy as np
import pandas as pd
df1=pd.read_csv("grade_with_roll1.csv")
print("First dataframe is:")
print(df1)
df2=pd.read_csv("name_data1.csv")
print("second dataframe is:")
print(df2)
df3=pd.merge(df1,df2,left_on="Roll", right_on="Roll")
print("Merged dataframe is:")
print(df3)
Output:
First dataframe is:
Roll Marks Grade
0 11 85 A
1 12 95 A
2 13 75 B
3 14 75 B
4 16 78 B
5 15 55 C
6 19 75 B
7 20 72 B
8 24 92 A
9 25 95 A
second dataframe is:
Class Roll Name
0 1 11 Aditya
1 1 12 Chris
2 1 13 Sam
3 1 14 Joel
4 1 15 Tom
5 1 16 Samantha
6 1 17 Pablo
7 1 20 Tina
8 1 24 Amy
9 1 30 Justin
10 1 31 Karl
Merged dataframe is:
Roll Marks Grade Class Name
0 11 85 A 1 Aditya
1 12 95 A 1 Chris
2 13 75 B 1 Sam
3 14 75 B 1 Joel
4 16 78 B 1 Samantha
5 15 55 C 1 Tom
6 20 72 B 1 Tina
7 24 92 A 1 Amy
If the second dataframe has rows that do not correspond to any row in the first dataframe, those rows are omitted from the result. Similarly, if the first dataframe has rows that don’t correspond to the second dataframe, those rows will be omitted from the result. You can observe this in the above example.
Merge DataFrames Using Outer Join in Python
To include the omitted rows, we can use the parameter ‘how’
in the merge()
method. The ‘how’
parameter has the default value ‘inner’
. Due to this, only those rows are included in the resultant dataframe that are present in both the input dataframes. To included the omitted dataframes, you can pass the value ‘outer’
to the ‘how’
parameter as shown below.
import numpy as np
import pandas as pd
df1=pd.read_csv("grade_with_roll1.csv")
print("First dataframe is:")
print(df1)
df2=pd.read_csv("name_data1.csv")
print("second dataframe is:")
print(df2)
df3=pd.merge(df1,df2,how="outer",left_on="Roll", right_on="Roll",suffixes=("_left","_right"))
print("Merged dataframe is:")
print(df3)
Output:
First dataframe is:
Roll Marks Grade
0 11 85 A
1 12 95 A
2 13 75 B
3 14 75 B
4 16 78 B
5 15 55 C
6 19 75 B
7 20 72 B
8 24 92 A
9 25 95 A
second dataframe is:
Class Roll Name
0 1 11 Aditya
1 1 12 Chris
2 1 13 Sam
3 1 14 Joel
4 1 15 Tom
5 1 16 Samantha
6 1 17 Pablo
7 1 20 Tina
8 1 24 Amy
9 1 30 Justin
10 1 31 Karl
Merged dataframe is:
Roll Marks Grade Class Name
0 11 85.0 A 1.0 Aditya
1 12 95.0 A 1.0 Chris
2 13 75.0 B 1.0 Sam
3 14 75.0 B 1.0 Joel
4 16 78.0 B 1.0 Samantha
5 15 55.0 C 1.0 Tom
6 19 75.0 B NaN NaN
7 20 72.0 B 1.0 Tina
8 24 92.0 A 1.0 Amy
9 25 95.0 A NaN NaN
10 17 NaN NaN 1.0 Pablo
11 30 NaN NaN 1.0 Justin
12 31 NaN NaN 1.0 Karl
Suggested Reading: If you are into machine learning, you can read this article on regression in machine learning. You might also like this article on k-means clustering with numerical example.
Merge DataFrames Using Left and Right Join in Python
If you want to include omitted rows only from the first dataframe, you can pass the value ‘left’
to the parameter ‘how’
.
import numpy as np
import pandas as pd
df1=pd.read_csv("grade_with_roll1.csv")
print("First dataframe is:")
print(df1)
df2=pd.read_csv("name_data1.csv")
print("second dataframe is:")
print(df2)
df3=pd.merge(df1,df2,how="left",left_on="Roll", right_on="Roll",suffixes=("_left","_right"))
print("Merged dataframe is:")
print(df3)
Output:
First dataframe is:
Roll Marks Grade
0 11 85 A
1 12 95 A
2 13 75 B
3 14 75 B
4 16 78 B
5 15 55 C
6 19 75 B
7 20 72 B
8 24 92 A
9 25 95 A
second dataframe is:
Class Roll Name
0 1 11 Aditya
1 1 12 Chris
2 1 13 Sam
3 1 14 Joel
4 1 15 Tom
5 1 16 Samantha
6 1 17 Pablo
7 1 20 Tina
8 1 24 Amy
9 1 30 Justin
10 1 31 Karl
Merged dataframe is:
Roll Marks Grade Class Name
0 11 85 A 1.0 Aditya
1 12 95 A 1.0 Chris
2 13 75 B 1.0 Sam
3 14 75 B 1.0 Joel
4 16 78 B 1.0 Samantha
5 15 55 C 1.0 Tom
6 19 75 B NaN NaN
7 20 72 B 1.0 Tina
8 24 92 A 1.0 Amy
9 25 95 A NaN NaN
Similarly, if you want to include omitted rows only from the second dataframe, you can pass the value ‘right’
to the ‘how’
parameter as shown below.
import numpy as np
import pandas as pd
df1=pd.read_csv("grade_with_roll1.csv")
print("First dataframe is:")
print(df1)
df2=pd.read_csv("name_data1.csv")
print("second dataframe is:")
print(df2)
df3=pd.merge(df1,df2,how="right",left_on="Roll", right_on="Roll",suffixes=("_left","_right"))
print("Merged dataframe is:")
print(df3)
Output:
First dataframe is:
Roll Marks Grade
0 11 85 A
1 12 95 A
2 13 75 B
3 14 75 B
4 16 78 B
5 15 55 C
6 19 75 B
7 20 72 B
8 24 92 A
9 25 95 A
second dataframe is:
Class Roll Name
0 1 11 Aditya
1 1 12 Chris
2 1 13 Sam
3 1 14 Joel
4 1 15 Tom
5 1 16 Samantha
6 1 17 Pablo
7 1 20 Tina
8 1 24 Amy
9 1 30 Justin
10 1 31 Karl
Merged dataframe is:
Roll Marks Grade Class Name
0 11 85.0 A 1 Aditya
1 12 95.0 A 1 Chris
2 13 75.0 B 1 Sam
3 14 75.0 B 1 Joel
4 15 55.0 C 1 Tom
5 16 78.0 B 1 Samantha
6 17 NaN NaN 1 Pablo
7 20 72.0 B 1 Tina
8 24 92.0 A 1 Amy
9 30 NaN NaN 1 Justin
10 31 NaN NaN 1 Karl
Conclusion
In this article, we have discussed how to merge two dataframes in python using the merge()
method. To learn more about python programming, you can read this article on dictionary comprehension in python. You might also like this article on list comprehension in python.
Recommended Python Training
Course: Python 3 For Beginners
Over 15 hours of video content with guided instruction for beginners. Learn how to create real world applications and master the basics.