While handling data in pyspark, we often need to find the count of distinct values in one or multiple columns in a pyspark dataframe. In this article, we will discuss how to count distinct values in one or multiple columns in pyspark.
Pyspark Count Rows in A DataFrame
The count()
method counts the number of rows in a pyspark dataframe. When we invoke the count()
method on a dataframe, it returns the number of rows in the data frame as shown below.
import pyspark.sql as ps
spark = ps.SparkSession.builder \
.master("local[*]") \
.appName("countdistinct_example") \
.getOrCreate()
dfs=spark.read.csv("sample_csv_file.csv",header=True)
print("The input dataframe is:")
dfs.show()
num_rows=dfs.count()
print("The number of rows in the dataframe are:")
print(num_rows)
spark.sparkContext.stop()
Output:
The input dataframe is:
+-------+-----+-------+---------+
| Name|Maths|Physics|Chemistry|
+-------+-----+-------+---------+
| Aditya| 45| 89| 71|
| Chris| null| 85| 82|
| Joel| 45| 75| 87|
|Katrina| 49| 47| 83|
| Joel| 45| 75| 87|
| Agatha| 77| 76| 93|
| Sam| 99| 62| 95|
| Aditya| 65| 89| 71|
+-------+-----+-------+---------+
The number of rows in the dataframe are:
8
In this example, we first read a csv file into a pyspark dataframe. Then, we used the count()
method to find the number of rows in the dataframe. As there are eight rows in the data, the count()
method returns the value 8.
Count Distinct Rows in a PySpark DataFrame
To count distinct rows in a dataframe, we will first use the distinct()
method to select distinct rows in the pyspark dataframe. Then, we can use the count()
method to count unique rows in a given dataframe as shown in the following example.
import pyspark.sql as ps
spark = ps.SparkSession.builder \
.master("local[*]") \
.appName("countdistinct_example") \
.getOrCreate()
dfs=spark.read.csv("sample_csv_file.csv",header=True)
print("The input dataframe is:")
dfs.show()
num_rows=dfs.distinct().count()
print("The number of distinct rows in the dataframe are:")
print(num_rows)
spark.sparkContext.stop()
Output:
The input dataframe is:
+-------+-----+-------+---------+
| Name|Maths|Physics|Chemistry|
+-------+-----+-------+---------+
| Aditya| 45| 89| 71|
| Chris| null| 85| 82|
| Joel| 45| 75| 87|
|Katrina| 49| 47| 83|
| Joel| 45| 75| 87|
| Agatha| 77| 76| 93|
| Sam| 99| 62| 95|
| Aditya| 65| 89| 71|
+-------+-----+-------+---------+
The number of distinct rows in the dataframe are:
7
The dataframe that we create using the csv file has duplicate rows. Hence, when we invoke the distinct()
method on the pyspark dataframe, the duplicate rows are dropped. After this, when we invoke the count()
method on the output of the distinct()
method, we get the number of distinct rows in the given pyspark dataframe.
Instead of the distinct()
method, you can also use the dropDuplicates()
method to count distinct rows in a dataframe as shown below.
import pyspark.sql as ps
spark = ps.SparkSession.builder \
.master("local[*]") \
.appName("countdistinct_example") \
.getOrCreate()
dfs=spark.read.csv("sample_csv_file.csv",header=True)
print("The input dataframe is:")
dfs.show()
num_rows=dfs.dropDuplicates().count()
print("The number of distinct rows in the dataframe are:")
print(num_rows)
spark.sparkContext.stop()
Output:
The input dataframe is:
+-------+-----+-------+---------+
| Name|Maths|Physics|Chemistry|
+-------+-----+-------+---------+
| Aditya| 45| 89| 71|
| Chris| null| 85| 82|
| Joel| 45| 75| 87|
|Katrina| 49| 47| 83|
| Joel| 45| 75| 87|
| Agatha| 77| 76| 93|
| Sam| 99| 62| 95|
| Aditya| 65| 89| 71|
+-------+-----+-------+---------+
The number of distinct rows in the dataframe are:
7
The dropDuplicates()
method, when invoked on a pyspark dataframe, drops all the duplicate rows. Hence, when we invoke the count()
method on the dataframe returned by the dropDuplicates()
method, we get the count of distinct rows in the dataframe.
Pyspark Count Values in a Column
To count the values in a column in a pyspark dataframe, we can use the select()
method and the count()
method. The select()
method takes the column names as its input and returns a dataframe containing the specified columns.
To count the values in a column of a pyspark dataframe, we will first select the particular column using the select()
method by passing the column name as input to the select()
method.
Next, we will use the count()
method to count the number of values in the selected column as shown in the following example.
import pyspark.sql as ps
spark = ps.SparkSession.builder \
.master("local[*]") \
.appName("countdistinct_example") \
.getOrCreate()
dfs=spark.read.csv("sample_csv_file.csv",header=True)
print("The input dataframe is:")
dfs.show()
num_rows=dfs.select("Name").count()
print("The number of values in the Name column are:")
print(num_rows)
spark.sparkContext.stop()
Output:
The input dataframe is:
+-------+-----+-------+---------+
| Name|Maths|Physics|Chemistry|
+-------+-----+-------+---------+
| Aditya| 45| 89| 71|
| Chris| null| 85| 82|
| Joel| 45| 75| 87|
|Katrina| 49| 47| 83|
| Joel| 45| 75| 87|
| Agatha| 77| 76| 93|
| Sam| 99| 62| 95|
| Aditya| 65| 89| 71|
+-------+-----+-------+---------+
The number of values in the Name column are:
8
By default, a column will have the same number of values as the rows in the dataframe. Hence, this example doesn’t make any sense. However, we can combine the select()
method with the distinct()
method to count distinct values in a column in the pyspark dataframe.
Count Distinct Values in a Column in PySpark DataFrame
To count distinct values in a column in a pyspark dataframe, we will use the following steps.
- First, we will select the particular column from the dataframe using the
select()
method. - Next, we will use the
distinct()
method to get a column with distinct values. - Finally, we will use the
count()
method to count distinct values in the column.
You can observe this in the following example.
import pyspark.sql as ps
spark = ps.SparkSession.builder \
.master("local[*]") \
.appName("countdistinct_example") \
.getOrCreate()
dfs=spark.read.csv("sample_csv_file.csv",header=True)
print("The input dataframe is:")
dfs.show()
num_rows=dfs.select("Name").distinct().count()
print("The number of distinct values in the Name column are:")
print(num_rows)
spark.sparkContext.stop()
Output:
The input dataframe is:
+-------+-----+-------+---------+
| Name|Maths|Physics|Chemistry|
+-------+-----+-------+---------+
| Aditya| 45| 89| 71|
| Chris| null| 85| 82|
| Joel| 45| 75| 87|
|Katrina| 49| 47| 83|
| Joel| 45| 75| 87|
| Agatha| 77| 76| 93|
| Sam| 99| 62| 95|
| Aditya| 65| 89| 71|
+-------+-----+-------+---------+
The number of distinct values in the Name column are:
6
In the above example, we first selected the Name column using the select()
method. Then, we used the distinct()
method to get the distinct values in the Name
column. Finally, we used the count()
method to count distinct values in the specified column.
Instead of the distinct()
method, you can also use the dropDuplicates()
method to count distinct values in a column as shown below.
import pyspark.sql as ps
spark = ps.SparkSession.builder \
.master("local[*]") \
.appName("countdistinct_example") \
.getOrCreate()
dfs=spark.read.csv("sample_csv_file.csv",header=True)
print("The input dataframe is:")
dfs.show()
num_rows=dfs.select("Name").dropDuplicates().count()
print("The number of distinct values in the Name column are:")
print(num_rows)
spark.sparkContext.stop()
Output:
The input dataframe is:
+-------+-----+-------+---------+
| Name|Maths|Physics|Chemistry|
+-------+-----+-------+---------+
| Aditya| 45| 89| 71|
| Chris| null| 85| 82|
| Joel| 45| 75| 87|
|Katrina| 49| 47| 83|
| Joel| 45| 75| 87|
| Agatha| 77| 76| 93|
| Sam| 99| 62| 95|
| Aditya| 65| 89| 71|
+-------+-----+-------+---------+
The number of distinct values in the Name column are:
6
PySpark Count Distinct Multiple Columns
To count the number of distinct values in multiple columns, we will use the following steps.
- We will first select the specified columns using the
select()
method. - Next, we will use the
distinct()
method to find thedistinct()
pairs of values in the given columns. - Finally, we will use the
count()
method to get the count of distinct values.
You can observe this in the following example.
import pyspark.sql as ps
spark = ps.SparkSession.builder \
.master("local[*]") \
.appName("countdistinct_example") \
.getOrCreate()
dfs=spark.read.csv("sample_csv_file.csv",header=True)
print("The input dataframe is:")
dfs.show()
num_rows=dfs.select("Name","Maths").distinct().count()
print("The number of distinct values in the Name and Maths column are:")
print(num_rows)
spark.sparkContext.stop()
Output:
The input dataframe is:
+-------+-----+-------+---------+
| Name|Maths|Physics|Chemistry|
+-------+-----+-------+---------+
| Aditya| 45| 89| 71|
| Chris| null| 85| 82|
| Joel| 45| 75| 87|
|Katrina| 49| 47| 83|
| Joel| 45| 75| 87|
| Agatha| 77| 76| 93|
| Sam| 99| 62| 95|
| Aditya| 65| 89| 71|
+-------+-----+-------+---------+
The number of distinct values in the Name and Maths column are:
7
In this example, we have counted the distinct values in the Name
and Maths
column. For this, we first selected both these columns using the select()
method. Next, we used the distinct()
method to drop duplicate pairs from both columns. Finally, we used the count()
method to count distinct values in multiple columns in the given pyspark dataframe.
Instead of the distinct()
method, you can also use the dropDuplicates()
method in the previous approach to count distinct values in multiple columns in a pyspark dataframe.
import pyspark.sql as ps
spark = ps.SparkSession.builder \
.master("local[*]") \
.appName("countdistinct_example") \
.getOrCreate()
dfs=spark.read.csv("sample_csv_file.csv",header=True)
print("The input dataframe is:")
dfs.show()
num_rows=dfs.select("Name","Maths").dropDuplicates().count()
print("The number of distinct values in the Name and Maths column are:")
print(num_rows)
spark.sparkContext.stop()
Output:
The input dataframe is:
+-------+-----+-------+---------+
| Name|Maths|Physics|Chemistry|
+-------+-----+-------+---------+
| Aditya| 45| 89| 71|
| Chris| null| 85| 82|
| Joel| 45| 75| 87|
|Katrina| 49| 47| 83|
| Joel| 45| 75| 87|
| Agatha| 77| 76| 93|
| Sam| 99| 62| 95|
| Aditya| 65| 89| 71|
+-------+-----+-------+---------+
The number of distinct values in the Name and Maths column are:
7
In the above example, we can avoid using the select statement while finding distinct counts for multiple columns. We can directly invoke the dropDuplicates()
method on the original dataframe for this.
Here, we will pass the column names for which we want to find distinct values as input to the subset
parameter in the dropDuplicates()
method in a list. After this, we will get the initial dataframe with all the columns, but only the unique set of values in the columns passed to the dropDuplicates()
method.
After getting the dataframe, you can use the count()
method to find the number of distinct values in multiple columns as shown below.
import pyspark.sql as ps
from pyspark.sql.functions import col,countDistinct
spark = ps.SparkSession.builder \
.master("local[*]") \
.appName("countdistinct_example") \
.getOrCreate()
dfs=spark.read.csv("sample_csv_file.csv",header=True)
print("The input dataframe is:")
dfs.show()
num_rows=dfs.dropDuplicates(subset=["Name","Maths"]).count()
print("The number of distinct values in the Name and Maths column are:")
print(num_rows)
spark.sparkContext.stop()
Output:
The input dataframe is:
+-------+-----+-------+---------+
| Name|Maths|Physics|Chemistry|
+-------+-----+-------+---------+
| Aditya| 45| 89| 71|
| Chris| null| 85| 82|
| Joel| 45| 75| 87|
|Katrina| 49| 47| 83|
| Joel| 45| 75| 87|
| Agatha| 77| 76| 93|
| Sam| 99| 62| 95|
| Aditya| 65| 89| 71|
+-------+-----+-------+---------+
The number of distinct values in the Name and Maths column are:
7
In the output, you can observe that we get the same result as the previous example even if we haven’t used the select()
statement.
Count Unique Values in Columns Using the countDistinct() Function
The countDistinct()
function is defined in the pyspark.sql.functions module. It is often used with the groupby()
method to count distinct values in different subsets of a pyspark dataframe. However, we can also use the countDistinct()
method to count distinct values in one or multiple columns.
To count the number of distinct values in a column in pyspark using the countDistinct()
function, we will use the agg()
method. Here, we will pass the countDistinct()
function to the agg()
method as input. Also, we will pass the column name for we want to count the distinct values as input to the countDistinct()
function. After execution of the agg()
method, we get a dataframe containing the count of unique values in the given column. You can observe this in the following example.
import pyspark.sql as ps
from pyspark.sql.functions import col,countDistinct
spark = ps.SparkSession.builder \
.master("local[*]") \
.appName("countdistinct_example") \
.getOrCreate()
dfs=spark.read.csv("sample_csv_file.csv",header=True)
print("The input dataframe is:")
dfs.show()
count_df=dfs.agg(countDistinct("Name"))
print("The number of distinct values in the Name column are:")
count_df.show()
spark.sparkContext.stop()
Output:
The input dataframe is:
+-------+-----+-------+---------+
| Name|Maths|Physics|Chemistry|
+-------+-----+-------+---------+
| Aditya| 45| 89| 71|
| Chris| null| 85| 82|
| Joel| 45| 75| 87|
|Katrina| 49| 47| 83|
| Joel| 45| 75| 87|
| Agatha| 77| 76| 93|
| Sam| 99| 62| 95|
| Aditya| 65| 89| 71|
+-------+-----+-------+---------+
The number of distinct values in the Name column are:
+-----------+
|count(Name)|
+-----------+
| 6|
+-----------+
If you want to count unique values in multiple columns using the countDistinct()
function, you can pass the required column names to the countDistinct()
function as shown below.
import pyspark.sql as ps
from pyspark.sql.functions import col,countDistinct
spark = ps.SparkSession.builder \
.master("local[*]") \
.appName("countdistinct_example") \
.getOrCreate()
dfs=spark.read.csv("sample_csv_file.csv",header=True)
print("The input dataframe is:")
dfs.show()
count_df=dfs.agg(countDistinct("Name","Maths"))
print("The number of distinct values in the Name and Maths column are:")
count_df.show()
spark.sparkContext.stop()
Output:
The input dataframe is:
+-------+-----+-------+---------+
| Name|Maths|Physics|Chemistry|
+-------+-----+-------+---------+
| Aditya| 45| 89| 71|
| Chris| null| 85| 82|
| Joel| 45| 75| 87|
|Katrina| 49| 47| 83|
| Joel| 45| 75| 87|
| Agatha| 77| 76| 93|
| Sam| 99| 62| 95|
| Aditya| 65| 89| 71|
+-------+-----+-------+---------+
The number of distinct values in the Name and Maths column are:
+------------------+
|count(Name, Maths)|
+------------------+
| 6|
+------------------+
In this output, you can observe that the countDistinct()
method gives the count of distinct values in the Name
and Maths
columns as 6 whereas the previous example gave 7 as the output. This is due to the reason that the countDistinct()
method ignores the rows with null values.
Conclusion
In this article, we discussed different ways to count distinct values in one or multiple columns in a pyspark dataframe. To learn more topics in pyspark, you can read this article on how to sort a pyspark dataframe. YOu might also like this article on tuple unpacking in Python.
I hope you enjoyed reading this article. Stay tuned for more informative articles.
Happy Learning!
Recommended Python Training
Course: Python 3 For Beginners
Over 15 hours of video content with guided instruction for beginners. Learn how to create real world applications and master the basics.