PySpark Select Distinct Rows From DataFrame

When using a pyspark dataframe, we sometimes need to select unique rows or unique values from a particular column. In this article, we will discuss how to select distinct rows or values in a column of a pyspark dataframe using three different ways.

Table of Contents

Select Distinct Rows From PySpark DataFrame
Select Distinct Rows Based on Multiple Columns in PySpark DataFrame
PySpark Select Unique Values in A Column
Pyspark Select Distinct From Multiple Columns
Conclusion

Select Distinct Rows From PySpark DataFrame

You can use three ways to select distinct rows in a dataframe in pyspark.

Using the distinct() method
By using the dropDuplicates() method
Using SQL Statement

Let us discuss each method one at a time.

PySpark Select Distinct Rows Using The distinct() Method

The distinct() method, when invoked on a pyspark dataframe, returns all the unique rows in the dataframe. Hence, we can directly invoke the distinct() method on a pyspark dataframe to select unique rows.

To observe this, we will first read a csv file into a pyspark dataframe. You can download the file using this link.

sample_csv_file Download

After reading the csv file into the pyspark dataframe, you can invoke the distinct() method on the pyspark dataframe to get distinct rows as shown below.

import pyspark.sql as ps
from pyspark.sql.functions import col,countDistinct
spark = ps.SparkSession.builder \
      .master("local[*]") \
      .appName("selectdistinct_example") \
      .getOrCreate()

dfs=spark.read.csv("sample_csv_file.csv",header=True)
print("The input dataframe is:")
dfs.show()
dfs=dfs.distinct()
print("The dataframe with distinct rows is:")
dfs.show()
spark.sparkContext.stop()

Output:

The input dataframe is:
+-------+-----+-------+---------+
|   Name|Maths|Physics|Chemistry|
+-------+-----+-------+---------+
| Aditya|   45|     89|       71|
|  Chris|   78|     85|       82|
|   Joel|   45|     75|       87|
|Katrina|   49|     47|       83|
|   Joel|   45|     75|       87|
| Agatha|   76|     93|       83|
|    Sam|   99|     98|       95|
| Aditya|   45|     98|       71|
+-------+-----+-------+---------+

The dataframe with distinct rows is:
+-------+-----+-------+---------+
|   Name|Maths|Physics|Chemistry|
+-------+-----+-------+---------+
|   Joel|   45|     75|       87|
| Aditya|   45|     89|       71|
|  Chris|   78|     85|       82|
|Katrina|   49|     47|       83|
|    Sam|   99|     98|       95|
| Agatha|   76|     93|       83|
| Aditya|   45|     98|       71|
+-------+-----+-------+---------+

In the above example, you can observe that the original dataframe contains eight rows. After using the distinct() method, we get seven distinct rows from the dataframe.

Select Distinct Rows Using The dropDuplicates() Method

The dropDuplicates() method works in a similar manner to the distinct() method. When we invoke the dropDuplicates() method on a dataframe, it returns unique rows in the dataframe as shown below.

import pyspark.sql as ps
from pyspark.sql.functions import col,countDistinct
spark = ps.SparkSession.builder \
      .master("local[*]") \
      .appName("selectdistinct_example") \
      .getOrCreate()

dfs=spark.read.csv("sample_csv_file.csv",header=True)
print("The input dataframe is:")
dfs.show()
dfs=dfs.dropDuplicates()
print("The dataframe with distinct rows is:")
dfs.show()
spark.sparkContext.stop()

Output:

The input dataframe is:
+-------+-----+-------+---------+
|   Name|Maths|Physics|Chemistry|
+-------+-----+-------+---------+
| Aditya|   45|     89|       71|
|  Chris|   78|     85|       82|
|   Joel|   45|     75|       87|
|Katrina|   49|     47|       83|
|   Joel|   45|     75|       87|
| Agatha|   76|     93|       83|
|    Sam|   99|     98|       95|
| Aditya|   45|     98|       71|
+-------+-----+-------+---------+

The dataframe with distinct rows is:
+-------+-----+-------+---------+
|   Name|Maths|Physics|Chemistry|
+-------+-----+-------+---------+
|   Joel|   45|     75|       87|
| Aditya|   45|     89|       71|
|  Chris|   78|     85|       82|
|Katrina|   49|     47|       83|
|    Sam|   99|     98|       95|
| Agatha|   76|     93|       83|
| Aditya|   45|     98|       71|
+-------+-----+-------+---------+

In this example, you can observe that we have used the dropDuplicates() method instead of the distinct() method.

Distinct Rows From PySpark DataFrame Using SQL

We can also select distinct rows in a pyspark dataframe using SQL syntax. For this, we will use the following steps.

First, we need to create a temporary view of the dataframe using the createOrReplaceTempView() method. The createOrReplaceTempView() method takes the desired name of the temporary view of the dataframe and returns a temporary view.
Once we get the temporary view, we can execute the SELECT DISTINCT SQL statement on the view of the dataframe to select unique rows. For this, we will create the string containing the SQL statement and pass it to the sql() function.
After execution of the sql() function, we get the output dataframe with distinct rows.

After executing the above statements, we can get the pyspark dataframe with distinct rows as shown in the following example.

import pyspark.sql as ps
from pyspark.sql.functions import col
spark = ps.SparkSession.builder \
      .master("local[*]") \
      .appName("selectdistinct_example") \
      .getOrCreate()

dfs=spark.read.csv("sample_csv_file.csv",header=True)
print("The input dataframe is:")
dfs.show()
dfs.createOrReplaceTempView("df_sql")
dfs=spark.sql("SELECT DISTINCT * FROM df_sql")
print("The dataframe with distinct rows is:")
dfs.show()
spark.sparkContext.stop()

Output:

The input dataframe is:
+-------+-----+-------+---------+
|   Name|Maths|Physics|Chemistry|
+-------+-----+-------+---------+
| Aditya|   45|     89|       71|
|  Chris|   78|     85|       82|
|   Joel|   45|     75|       87|
|Katrina|   49|     47|       83|
|   Joel|   45|     75|       87|
| Agatha|   76|     93|       83|
|    Sam|   99|     98|       95|
| Aditya|   45|     98|       71|
+-------+-----+-------+---------+

The dataframe with distinct rows is:
+-------+-----+-------+---------+
|   Name|Maths|Physics|Chemistry|
+-------+-----+-------+---------+
|   Joel|   45|     75|       87|
| Aditya|   45|     89|       71|
|  Chris|   78|     85|       82|
|Katrina|   49|     47|       83|
|    Sam|   99|     98|       95|
| Agatha|   76|     93|       83|
| Aditya|   45|     98|       71|
+-------+-----+-------+---------+

Select Distinct Rows Based on Multiple Columns in PySpark DataFrame

In the previous examples, we have selected unique rows based on all the columns. However, we can also use specific columns to decide on unique rows.

To select distinct rows based on multiple columns, we can pass the column names by which we want to decide the uniqueness of the rows in a list to the dropDuplicates() method. After execution, the dropDuplicates() method will return a dataframe containing a unique set of values in the specified columns. You can observe this in the following example.

import pyspark.sql as ps
from pyspark.sql.functions import col,countDistinct
spark = ps.SparkSession.builder \
      .master("local[*]") \
      .appName("selectdistinct_example") \
      .getOrCreate()

dfs=spark.read.csv("sample_csv_file.csv",header=True)
print("The input dataframe is:")
dfs.show()
dfs=dfs.dropDuplicates(["Name","Maths"])
print("The dataframe with distinct rows is:")
dfs.show()
spark.sparkContext.stop()

Output:

The input dataframe is:
+-------+-----+-------+---------+
|   Name|Maths|Physics|Chemistry|
+-------+-----+-------+---------+
| Aditya|   45|     89|       71|
|  Chris|   78|     85|       82|
|   Joel|   45|     75|       87|
|Katrina|   49|     47|       83|
|   Joel|   45|     75|       87|
| Agatha|   76|     93|       83|
|    Sam|   99|     98|       95|
| Aditya|   45|     98|       71|
+-------+-----+-------+---------+

The dataframe with distinct rows is:
+-------+-----+-------+---------+
|   Name|Maths|Physics|Chemistry|
+-------+-----+-------+---------+
| Aditya|   45|     89|       71|
| Agatha|   76|     93|       83|
|  Chris|   78|     85|       82|
|   Joel|   45|     75|       87|
|Katrina|   49|     47|       83|
|    Sam|   99|     98|       95|
+-------+-----+-------+---------+

In this example, we first read a csv file to create a pyspark dataframe. Then, we used the dropDuplicates() method to select distinct rows having unique values in the Name and Maths Column. For this, we passed the list ["Name", "Maths"] to the dropDuplicates() method. In the output, you can observe that the pyspark dataframe contains all the columns. However, the combination of the Name and Maths columns is unique in each row.

PySpark Select Unique Values in A Column

To select distinct values from one column in a pyspark dataframe, we first need to select the particular column using the select() method. Then, we can get distinct values from the column using the distinct() method as shown below.

import pyspark.sql as ps
from pyspark.sql.functions import col,countDistinct
spark = ps.SparkSession.builder \
      .master("local[*]") \
      .appName("selectdistinct_example") \
      .getOrCreate()

dfs=spark.read.csv("sample_csv_file.csv",header=True)
print("The input dataframe is:")
dfs.show()
dfs=dfs.select("Name").distinct()
print("The distinct values in the column are:")
dfs.show()
spark.sparkContext.stop()

Output:

The input dataframe is:
+-------+-----+-------+---------+
|   Name|Maths|Physics|Chemistry|
+-------+-----+-------+---------+
| Aditya|   45|     89|       71|
|  Chris|   78|     85|       82|
|   Joel|   45|     75|       87|
|Katrina|   49|     47|       83|
|   Joel|   45|     75|       87|
| Agatha|   76|     93|       83|
|    Sam|   99|     98|       95|
| Aditya|   45|     98|       71|
+-------+-----+-------+---------+

The distinct values in the column are:
+-------+
|   Name|
+-------+
|Katrina|
|  Chris|
| Agatha|
|    Sam|
|   Joel|
| Aditya|
+-------+

In this example, we first selected the Name column using the select() method. Then, we invoked the distinct() method on the selected column to get all the unique values.

Instead of the distinct() method, you can use the dropDuplicates() method to select unique values from a column in a pyspark dataframe as shown in the following example.

import pyspark.sql as ps
from pyspark.sql.functions import col,countDistinct
spark = ps.SparkSession.builder \
      .master("local[*]") \
      .appName("selectdistinct_example") \
      .getOrCreate()

dfs=spark.read.csv("sample_csv_file.csv",header=True)
print("The input dataframe is:")
dfs.show()
dfs=dfs.select("Name").dropDuplicates()
print("The distinct values in the column are:")
dfs.show()
spark.sparkContext.stop()

Output:

The input dataframe is:
+-------+-----+-------+---------+
|   Name|Maths|Physics|Chemistry|
+-------+-----+-------+---------+
| Aditya|   45|     89|       71|
|  Chris|   78|     85|       82|
|   Joel|   45|     75|       87|
|Katrina|   49|     47|       83|
|   Joel|   45|     75|       87|
| Agatha|   76|     93|       83|
|    Sam|   99|     98|       95|
| Aditya|   45|     98|       71|
+-------+-----+-------+---------+

The distinct values in the column are:
+-------+
|   Name|
+-------+
|Katrina|
|  Chris|
| Agatha|
|    Sam|
|   Joel|
| Aditya|
+-------+

We can also use the SQL SELECT DISTINCT statement to select distinct values from a column in a pyspark dataframe as shown below.

import pyspark.sql as ps
from pyspark.sql.functions import col
spark = ps.SparkSession.builder \
      .master("local[*]") \
      .appName("selectdistinct_example") \
      .getOrCreate()

dfs=spark.read.csv("sample_csv_file.csv",header=True)
print("The input dataframe is:")
dfs.show()
dfs.createOrReplaceTempView("df_sql")
dfs=spark.sql("SELECT DISTINCT Name FROM df_sql")
print("The distinct values in the column are:")
dfs.show()
spark.sparkContext.stop()

Output:

The input dataframe is:
+-------+-----+-------+---------+
|   Name|Maths|Physics|Chemistry|
+-------+-----+-------+---------+
| Aditya|   45|     89|       71|
|  Chris|   78|     85|       82|
|   Joel|   45|     75|       87|
|Katrina|   49|     47|       83|
|   Joel|   45|     75|       87|
| Agatha|   76|     93|       83|
|    Sam|   99|     98|       95|
| Aditya|   45|     98|       71|
+-------+-----+-------+---------+

The distinct values in the column are:
+-------+
|   Name|
+-------+
|Katrina|
|  Chris|
| Agatha|
|    Sam|
|   Joel|
| Aditya|
+-------+

Pyspark Select Distinct From Multiple Columns

To select distinct values from multiple columns, we will first select the desired columns using the select() statement. After this, we will use the distinct() method to get the unique values from the selected columns as shown below.

import pyspark.sql as ps
from pyspark.sql.functions import col,countDistinct
spark = ps.SparkSession.builder \
      .master("local[*]") \
      .appName("selectdistinct_example") \
      .getOrCreate()

dfs=spark.read.csv("sample_csv_file.csv",header=True)
print("The input dataframe is:")
dfs.show()
dfs=dfs.select("Name","Maths").distinct()
print("The distinct values in the columns are:")
dfs.show()
spark.sparkContext.stop()

Output:

The input dataframe is:
+-------+-----+-------+---------+
|   Name|Maths|Physics|Chemistry|
+-------+-----+-------+---------+
| Aditya|   45|     89|       71|
|  Chris|   78|     85|       82|
|   Joel|   45|     75|       87|
|Katrina|   49|     47|       83|
|   Joel|   45|     75|       87|
| Agatha|   76|     93|       83|
|    Sam|   99|     98|       95|
| Aditya|   45|     98|       71|
+-------+-----+-------+---------+

The distinct values in the columns are:
+-------+-----+
|   Name|Maths|
+-------+-----+
| Aditya|   45|
|Katrina|   49|
| Agatha|   76|
|  Chris|   78|
|    Sam|   99|
|   Joel|   45|
+-------+-----+

In this example, we have selected distinct values from the Name and Maths column. For this, we first selected both these columns from the original dataframe using the select() method. Then, we used the distinct() method to select distinct values in the columns.

Instead of the distinct() method, you can also use the dropDuplicates() method to select distinct values from multiple columns as shown below.

import pyspark.sql as ps
from pyspark.sql.functions import col,countDistinct
spark = ps.SparkSession.builder \
      .master("local[*]") \
      .appName("selectdistinct_example") \
      .getOrCreate()

dfs=spark.read.csv("sample_csv_file.csv",header=True)
print("The input dataframe is:")
dfs.show()
dfs=dfs.select("Name","Maths").dropDuplicates()
print("The distinct values in the columns are:")
dfs.show()
spark.sparkContext.stop()

Output:

The input dataframe is:
+-------+-----+-------+---------+
|   Name|Maths|Physics|Chemistry|
+-------+-----+-------+---------+
| Aditya|   45|     89|       71|
|  Chris|   78|     85|       82|
|   Joel|   45|     75|       87|
|Katrina|   49|     47|       83|
|   Joel|   45|     75|       87|
| Agatha|   76|     93|       83|
|    Sam|   99|     98|       95|
| Aditya|   45|     98|       71|
+-------+-----+-------+---------+

The distinct values in the columns are:
+-------+-----+
|   Name|Maths|
+-------+-----+
| Aditya|   45|
|Katrina|   49|
| Agatha|   76|
|  Chris|   78|
|    Sam|   99|
|   Joel|   45|
+-------+-----+

You can also use the SQL SELECT DISTINCT statement with column names to select unique values from multiple columns as shown in the following example.

import pyspark.sql as ps
from pyspark.sql.functions import col
spark = ps.SparkSession.builder \
      .master("local[*]") \
      .appName("selectdistinct_example") \
      .getOrCreate()

dfs=spark.read.csv("sample_csv_file.csv",header=True)
print("The input dataframe is:")
dfs.show()
dfs.createOrReplaceTempView("df_sql")
dfs=spark.sql("SELECT DISTINCT Name, Maths FROM df_sql")
print("The distinct values in the columns are:")
dfs.show()
spark.sparkContext.stop()

Output:

The input dataframe is:
+-------+-----+-------+---------+
|   Name|Maths|Physics|Chemistry|
+-------+-----+-------+---------+
| Aditya|   45|     89|       71|
|  Chris|   78|     85|       82|
|   Joel|   45|     75|       87|
|Katrina|   49|     47|       83|
|   Joel|   45|     75|       87|
| Agatha|   76|     93|       83|
|    Sam|   99|     98|       95|
| Aditya|   45|     98|       71|
+-------+-----+-------+---------+

The distinct values in the columns are:
+-------+-----+
|   Name|Maths|
+-------+-----+
| Aditya|   45|
|Katrina|   49|
| Agatha|   76|
|  Chris|   78|
|    Sam|   99|
|   Joel|   45|
+-------+-----+

Conclusion

In this article, we discussed different ways to select distinct rows or values from a PySpark dataframe. To learn more about PySpark, you can read this article on how to sort dataframe in pyspark. You might also like this article on fill nan values in pandas dataframe.

I hope you enjoyed reading this article. Stay tuned for more informative articles.

Happy Learning!

Recommended Python Training

Course: Python 3 For Beginners

Over 15 hours of video content with guided instruction for beginners. Learn how to create real world applications and master the basics.

Enroll Now

Select Distinct Rows From PySpark DataFrame

PySpark Select Distinct Rows Using The distinct() Method

Select Distinct Rows Using The dropDuplicates() Method

Distinct Rows From PySpark DataFrame Using SQL

Select Distinct Rows Based on Multiple Columns in PySpark DataFrame

PySpark Select Unique Values in A Column

Pyspark Select Distinct From Multiple Columns

Conclusion

Related

Recommended Python Training

More Python Topics