Pandas dataframes are used to handle tabular data in Python. The data sometimes contains duplicate values which might be undesired. In this article, we will discuss different ways to drop duplicate rows from a pandas dataframe using the drop_duplicates()
method.
The drop_duplicates() Method
The drop_duplicates()
method is used to drop duplicate rows from a pandas dataframe. It has the following syntax.
DataFrame.drop_duplicates(subset=None, *, keep='first', inplace=False, ignore_index=False)
Here,
- The
subset
parameter is used to compare two rows to determine duplicate rows. By default, thesubset
parameter is set to None. Due to this, values from all the columns are used from rows for comparison. If you want to compare two rows by only a single column, you can pass the column name to thesubset
parameter as the input argument. If you want to compare rows by two or more columns, you can pass the list of column names to thesubset
parameter. - The
keep
parameter is used to decide whether we want to keep one of the duplicate rows in the output dataframe. If we want to drop all the duplicate rows except the first occurrence, we can set thekeep
parameter to“first”
which is its default value. If we want to drop all the duplicate rows except the last occurrence, we can set thekeep
parameter to“last”
. If we need to drop all the rows having duplicates, we can set thekeep
parameter to False. - The
inplace
parameter is used to decide if we get a new dataframe after the drop operation or if we want to modify the original dataframe. When inplace is set to False, which is its default value, the original dataframe isn’t changed and the drop_duplicates() method returns the modified dataframe after execution. To alter the original dataframe, you can set inplace to True. - When rows are dropped from a dataframe, the order of the indices becomes irregular. If you want to refresh the index and assign the ordered index from 0 to
(length of dataframe)-1
, you can setignore_index
to True.
After execution, the drop_duplicates()
method returns a dataframe if the inplace
parameter is set to False. Otherwise, it returns None.
Drop Duplicate Rows From a Pandas Dataframe
To drop duplicate rows from a pandas dataframe, you can invoke the drop_duplicates()
method on the dataframe. After execution, it returns a dataframe containing all the unique rows. You can observe this in the following example.
import pandas as pd
df=pd.read_csv("grade2.csv")
print("The dataframe is:")
print(df)
df=df.drop_duplicates()
print("After dropping duplicates:")
print(df)
Output:
The dataframe is:
Class Roll Name Marks Grade
0 2 27 Harsh 55 C
1 2 23 Clara 78 B
2 3 33 Tina 82 A
3 3 34 Amy 88 A
4 3 15 Prashant 78 B
5 3 27 Aditya 55 C
6 3 34 Amy 88 A
7 3 23 Radheshyam 78 B
8 3 11 Bobby 50 D
9 2 27 Harsh 55 C
10 3 15 Lokesh 88 A
After dropping duplicates:
Class Roll Name Marks Grade
0 2 27 Harsh 55 C
1 2 23 Clara 78 B
2 3 33 Tina 82 A
3 3 34 Amy 88 A
4 3 15 Prashant 78 B
5 3 27 Aditya 55 C
7 3 23 Radheshyam 78 B
8 3 11 Bobby 50 D
10 3 15 Lokesh 88 A
In the above example, we have an input dataframe containing the Class, Roll, Name, Marks, and Grades of some students. As you can observe, the input dataframe contains some duplicate rows. The rows at index 0 and 9 are the same. Similarly, rows at the index 3 and 6 are the same. After execution of the drop_duplicates()
method, we get a pandas dataframe in which all the rows are unique. Hence, the rows at indexes 6 and 9 are dropped from the dataframe so that the rows at indexes 0 and 3 become unique.
Drop All Duplicate Rows From a Pandas Dataframe
In the above example, one entry from each set of duplicate rows is preserved. If you want to delete all the duplicate rows from the dataframe, you can set the keep
parameter to False in the drop_duplicates()
method. After this, all the rows having duplicate values will be deleted. You can observe this in the following example.
import pandas as pd
df=pd.read_csv("grade2.csv")
print("The dataframe is:")
print(df)
df=df.drop_duplicates(keep=False)
print("After dropping duplicates:")
print(df)
Output:
The dataframe is:
Class Roll Name Marks Grade
0 2 27 Harsh 55 C
1 2 23 Clara 78 B
2 3 33 Tina 82 A
3 3 34 Amy 88 A
4 3 15 Prashant 78 B
5 3 27 Aditya 55 C
6 3 34 Amy 88 A
7 3 23 Radheshyam 78 B
8 3 11 Bobby 50 D
9 2 27 Harsh 55 C
10 3 15 Lokesh 88 A
After dropping duplicates:
Class Roll Name Marks Grade
1 2 23 Clara 78 B
2 3 33 Tina 82 A
4 3 15 Prashant 78 B
5 3 27 Aditya 55 C
7 3 23 Radheshyam 78 B
8 3 11 Bobby 50 D
10 3 15 Lokesh 88 A
In this example, you can observe that rows at index 0 and 9 are the same. Similarly, rows at the index 3 and 6 are the same. When we set the keep
parameter to False in the drop_duplicates()
method, you can observe that all the rows that have duplicate values i.e. rows at index 0, 3, 6, and 9 are dropped from the input dataframe.
Suggested Reading: If you are into machine learning, you can read this MLFlow tutorial with code examples. You might also like this article on 15 Free Data Visualization Tools for 2023.
Drop Duplicate Rows Inplace From a Pandas Dataframe
By default, the drop_duplicates()
method returns a new dataframe. If you want to alter the original dataframe instead of creating a new one, you can set the inplace
parameter to True in the drop_duplicates()
method as shown below.
import pandas as pd
df=pd.read_csv("grade2.csv")
print("The dataframe is:")
print(df)
df.drop_duplicates(keep=False,inplace=True)
print("After dropping duplicates:")
print(df)
Output:
The dataframe is:
Class Roll Name Marks Grade
0 2 27 Harsh 55 C
1 2 23 Clara 78 B
2 3 33 Tina 82 A
3 3 34 Amy 88 A
4 3 15 Prashant 78 B
5 3 27 Aditya 55 C
6 3 34 Amy 88 A
7 3 23 Radheshyam 78 B
8 3 11 Bobby 50 D
9 2 27 Harsh 55 C
10 3 15 Lokesh 88 A
After dropping duplicates:
Class Roll Name Marks Grade
1 2 23 Clara 78 B
2 3 33 Tina 82 A
4 3 15 Prashant 78 B
5 3 27 Aditya 55 C
7 3 23 Radheshyam 78 B
8 3 11 Bobby 50 D
10 3 15 Lokesh 88 A
In this example, we have set the inplace
parameter to True in the drop_duplicates()
method. Hence, the drop_duplicates()
method modifies the input dataframe instead of creating a new one. Here, the drop_duplicates()
method returns None.
Drop Rows Having Duplicate Values in Specific Columns
By default, the drop_duplicates()
method compares all the columns for similarity to check for duplicate rows. If you want to compare the rows for duplicate values on the basis of specific columns, you can use the subset
parameter in the drop_duplicates()
method.
The subset
parameter takes a list of columns as its input argument. After this, the drop_duplicates()
method compares the rows only based on the specified columns. You can observe this in the following example.
import pandas as pd
df=pd.read_csv("grade2.csv")
print("The dataframe is:")
print(df)
df.drop_duplicates(subset=["Class","Roll"],inplace=True)
print("After dropping duplicates:")
print(df)
Output:
The dataframe is:
Class Roll Name Marks Grade
0 2 27 Harsh 55 C
1 2 23 Clara 78 B
2 3 33 Tina 82 A
3 3 34 Amy 88 A
4 3 15 Prashant 78 B
5 3 27 Aditya 55 C
6 3 34 Amy 88 A
7 3 23 Radheshyam 78 B
8 3 11 Bobby 50 D
9 2 27 Harsh 55 C
10 3 15 Lokesh 88 A
After dropping duplicates:
Class Roll Name Marks Grade
0 2 27 Harsh 55 C
1 2 23 Clara 78 B
2 3 33 Tina 82 A
3 3 34 Amy 88 A
4 3 15 Prashant 78 B
5 3 27 Aditya 55 C
7 3 23 Radheshyam 78 B
8 3 11 Bobby 50 D
In this example, we have passed the python list [“Class”, “Roll”] to the subset
parameter in the drop_duplicates()
method. Hence, the duplicate rows are decided on the basis of these two columns only. As a result, the rows having the same value in the Class
and Roll
columns are considered duplicates and are dropped from the dataframe.
Conclusion
In this article, we have discussed different ways to drop duplicate rows from a dataframe using the drop_duplicates()
method.
To know more about the pandas module, you can read this article on how to sort a pandas dataframe. You might also like this article on how to drop columns from a pandas dataframe.
I hope you enjoyed reading this article. Stay tuned for more informative articles.
Happy Learning!
Recommended Python Training
Course: Python 3 For Beginners
Over 15 hours of video content with guided instruction for beginners. Learn how to create real world applications and master the basics.