Missing values in tabular data are a common problem. When we load tabular data with missing values into a pyspark dataframe, the empty values are replaced with null values. In this article, we will discuss how to count rows with null values in a given pyspark dataframe.
Count Rows With Null Values in a Column in PySpark DataFrame
To count rows with null values in a column in a pyspark dataframe, we can use the following approaches.
- Using
filter()
method and theisNull()
method with count() method - By using the
where()
method and theisNull()
method with count() method - By Using sql
IS NULL
statement with COUNT() function.
Let us discuss all these approaches one by one.
Count Rows With Null Values Using The filter() Method
To count rows with null values in a particular column in a pyspark dataframe, we will first invoke the isNull()
method on the given column. The isNull()
method will return a masked column having True and False values. We will pass the mask column object returned by the isNull()
method to the filter()
method. After this, we will get the rows with null values in the given column.
Once we get the dataframe with rows having null values in the specified column, we will use the count() method to get the count of rows with null values.
You can observe this in the following example.
import pyspark.sql as ps
spark = ps.SparkSession.builder \
.master("local[*]") \
.appName("count_null_example") \
.getOrCreate()
dfs=spark.read.csv("StudentMarks.csv",header=True,inferSchema=True)
print("The input dataframe is:")
dfs.show()
dfs=dfs.filter(dfs.Physics.isNull())
print("Rows at which there are null values in Physics Column:")
dfs.show()
print("Count of the rows with null values in Physics column is:")
print(dfs.count())
spark.sparkContext.stop()
Output:
The input dataframe is:
+-------+-------+---------+----+-------+----+
|Student|Physics|Chemistry|Math|Biology|Arts|
+-------+-------+---------+----+-------+----+
| Aditya| 92| 76| 95| 73| 91|
| Chris| 95| null| 79| 71| 93|
| Sam| null| 62| 75| 95| 63|
| Harry| 68| 92| 69| 66| 98|
| Golu| 74| null| 96| 76| 64|
| Joel| 99| 79|null| null| 61|
| Tom| null| 94| 61| 65| 69|
| Harsh| 98| 99| 93| 95| 91|
| Clara| 93| null| 78| null| 71|
| Tina| 99| 76| 78| 94| 95|
+-------+-------+---------+----+-------+----+
Rows at which there are null values in Physics Column:
+-------+-------+---------+----+-------+----+
|Student|Physics|Chemistry|Math|Biology|Arts|
+-------+-------+---------+----+-------+----+
| Sam| null| 62| 75| 95| 63|
| Tom| null| 94| 61| 65| 69|
+-------+-------+---------+----+-------+----+
Count of the rows with null values in Physics column is:
2
In the above example, we first read a csv file into a pyspark dataframe. Then, we used the isNull()
method and the filter()
method to filter rows with Null values in the Physics
column. In the output, you can observe that the final dataframe returned by the filter() method contains only those rows in which the Physics
column is set to Null. Hence, the count() method returns the value 2.
Rows With Null Values using the where() Method
Instead of the filter()
method, you can also use the where()
method to count rows with null values in a given column. The where()
method also takes the mask returned by the isNull()
method and returns a dataframe containing the rows where the mask column contains True.
We can invoke the count() method on the dataframe returned by the where() method to get the number of rows with null values as shown below.
import pyspark.sql as ps
spark = ps.SparkSession.builder \
.master("local[*]") \
.appName("count_null_example") \
.getOrCreate()
dfs=spark.read.csv("StudentMarks.csv",header=True,inferSchema=True)
print("The input dataframe is:")
dfs.show()
dfs=dfs.where(dfs.Physics.isNull())
print("Rows at which there are null values in Physics Column:")
dfs.show()
print("Count of the rows with null values in Physics column is:")
print(dfs.count())
spark.sparkContext.stop()
Output:
The input dataframe is:
+-------+-------+---------+----+-------+----+
|Student|Physics|Chemistry|Math|Biology|Arts|
+-------+-------+---------+----+-------+----+
| Aditya| 92| 76| 95| 73| 91|
| Chris| 95| null| 79| 71| 93|
| Sam| null| 62| 75| 95| 63|
| Harry| 68| 92| 69| 66| 98|
| Golu| 74| null| 96| 76| 64|
| Joel| 99| 79|null| null| 61|
| Tom| null| 94| 61| 65| 69|
| Harsh| 98| 99| 93| 95| 91|
| Clara| 93| null| 78| null| 71|
| Tina| 99| 76| 78| 94| 95|
+-------+-------+---------+----+-------+----+
Rows at which there are null values in Physics Column:
+-------+-------+---------+----+-------+----+
|Student|Physics|Chemistry|Math|Biology|Arts|
+-------+-------+---------+----+-------+----+
| Sam| null| 62| 75| 95| 63|
| Tom| null| 94| 61| 65| 69|
+-------+-------+---------+----+-------+----+
Count of the rows with null values in Physics column is:
2
In this example, we have used the where()
method instead of the filter()
method. However, you can observe that the output remains the same.
Instead of the syntax used in the above examples, you can use the col()
function with the isNull()
method to create the mask containing True and False values. The col()
function is defined in the pyspark.sql.functions module. It takes a column name as an input argument and returns the column object containing all the values in the given column in a pyspark dataframe.
We can invoke the isNull()
method on the column object to select rows with null values from the pyspark dataframe. Then, we can use the count() method to get the count of rows in the dataframe as shown below.
import pyspark.sql as ps
from pyspark.sql.functions import col
spark = ps.SparkSession.builder \
.master("local[*]") \
.appName("count_null_example") \
.getOrCreate()
dfs=spark.read.csv("StudentMarks.csv",header=True,inferSchema=True)
print("The input dataframe is:")
dfs.show()
dfs=dfs.where(col("Physics").isNull())
print("Rows at which there are null values in Physics Column:")
dfs.show()
print("Count of the rows with null values in Physics column is:")
print(dfs.count())
spark.sparkContext.stop()
Output:
The input dataframe is:
+-------+-------+---------+----+-------+----+
|Student|Physics|Chemistry|Math|Biology|Arts|
+-------+-------+---------+----+-------+----+
| Aditya| 92| 76| 95| 73| 91|
| Chris| 95| null| 79| 71| 93|
| Sam| null| 62| 75| 95| 63|
| Harry| 68| 92| 69| 66| 98|
| Golu| 74| null| 96| 76| 64|
| Joel| 99| 79|null| null| 61|
| Tom| null| 94| 61| 65| 69|
| Harsh| 98| 99| 93| 95| 91|
| Clara| 93| null| 78| null| 71|
| Tina| 99| 76| 78| 94| 95|
+-------+-------+---------+----+-------+----+
Rows at which there are null values in Physics Column:
+-------+-------+---------+----+-------+----+
|Student|Physics|Chemistry|Math|Biology|Arts|
+-------+-------+---------+----+-------+----+
| Sam| null| 62| 75| 95| 63|
| Tom| null| 94| 61| 65| 69|
+-------+-------+---------+----+-------+----+
Count of the rows with null values in Physics column is:
2
In this example, we have used the col()
function to access the Physics
column of the dataframe instead of using the name of the dataframe. You can observe the number of rows with null values in the Physics column still remains at 2.
Get Number of Rows With Null Values Using SQL syntax
We can also use spark SQL to get the number of rows with null values from a pyspark dataframe. For this, we will first create a view of the input dataframe using the createOrReplaceTempView()
method. The createOrReplaceTempView()
, when invoked on a pyspark dataframe, takes the name of the view object as its input argument and creates a view of the dataframe.
After getting the view of the dataframe, we can use the sql SELECT statement with IS NULL clause and the COUNT(*) function to count rows with null values in a given column in the pyspark dataframe.
For this, we can execute the SQL query using the sql()
function as shown below.
import pyspark.sql as ps
from pyspark.sql.functions import col
spark = ps.SparkSession.builder \
.master("local[*]") \
.appName("count_null_example") \
.getOrCreate()
dfs=spark.read.csv("StudentMarks.csv",header=True,inferSchema=True)
print("The input dataframe is:")
dfs.show()
dfs.createOrReplaceTempView("df_sql")
new_df=spark.sql("SELECT COUNT(*) FROM df_sql WHERE Physics IS NULL")
print("Count of the rows with null values in Physics column is:")
new_df.show()
spark.sparkContext.stop()
Output:
The input dataframe is:
+-------+-------+---------+----+-------+----+
|Student|Physics|Chemistry|Math|Biology|Arts|
+-------+-------+---------+----+-------+----+
| Aditya| 92| 76| 95| 73| 91|
| Chris| 95| null| 79| 71| 93|
| Sam| null| 62| 75| 95| 63|
| Harry| 68| 92| 69| 66| 98|
| Golu| 74| null| 96| 76| 64|
| Joel| 99| 79|null| null| 61|
| Tom| null| 94| 61| 65| 69|
| Harsh| 98| 99| 93| 95| 91|
| Clara| 93| null| 78| null| 71|
| Tina| 99| 76| 78| 94| 95|
+-------+-------+---------+----+-------+----+
Count of the rows with null values in Physics column is:
+--------+
|count(1)|
+--------+
| 2|
+--------+
We first created a view of the input dataframe in the above example. This is because we cannot execute SQL statements on a pyspark dataframe. Then, we used the sql()
function to execute the SQL query for retrieving the number of rows in which the Physics
column is Null. The sql()
function takes the SQL query as its input and returns the output dataframe containing the count.
Instead of creating the view of the dataframe and executing the sql()
function, you can directly pass the statement in the WHERE clause in the sql statement to the filter()
method. Then, you can invoke the count() method on the dataframe returned by the filter() method to get the number of rows in which the given column has null values as shown below.
import pyspark.sql as ps
from pyspark.sql.functions import col
spark = ps.SparkSession.builder \
.master("local[*]") \
.appName("count_null_example") \
.getOrCreate()
dfs=spark.read.csv("StudentMarks.csv",header=True,inferSchema=True)
print("The input dataframe is:")
dfs.show()
dfs=dfs.filter("Physics IS NULL")
print("Rows at which there are null values in Physics Column:")
dfs.show()
print("Count of the rows with null values in Physics column is:")
print(dfs.count())
spark.sparkContext.stop()
Output:
The input dataframe is:
+-------+-------+---------+----+-------+----+
|Student|Physics|Chemistry|Math|Biology|Arts|
+-------+-------+---------+----+-------+----+
| Aditya| 92| 76| 95| 73| 91|
| Chris| 95| null| 79| 71| 93|
| Sam| null| 62| 75| 95| 63|
| Harry| 68| 92| 69| 66| 98|
| Golu| 74| null| 96| 76| 64|
| Joel| 99| 79|null| null| 61|
| Tom| null| 94| 61| 65| 69|
| Harsh| 98| 99| 93| 95| 91|
| Clara| 93| null| 78| null| 71|
| Tina| 99| 76| 78| 94| 95|
+-------+-------+---------+----+-------+----+
Rows at which there are null values in Physics Column:
+-------+-------+---------+----+-------+----+
|Student|Physics|Chemistry|Math|Biology|Arts|
+-------+-------+---------+----+-------+----+
| Sam| null| 62| 75| 95| 63|
| Tom| null| 94| 61| 65| 69|
+-------+-------+---------+----+-------+----+
Count of the rows with null values in Physics column is:
2
In this example, we have passed the "Physics IS NULL"
string to the filter()
method. Hence, the filter()
method treats the string as a statement of the WHERE clause of the SQL statement and returns the output dataframe in which the Physics
column contains only null values. Then, we get the count of the output dataframe using the count() method and print it.
Get the Number of Rows With Not Null Values in a Column
We can use the approaches used for counting rows with null values to count rows with not null values. The only difference is that we need to use the isNotNull()
method in place of the isNull()
method.
Count Rows With Not Null Values Using The filter() Method
To get the number of rows with no null values in a particular column in a pyspark dataframe, we will first invoke the isNotNull() method on the given column. The isNotNull() method will return a masked column containing True and False values. Next, we will pass the mask column object returned by the isNotNull() method to the filter() method. After this, we will get the dataframe with rows having no null values in the given column. Finally, we will use the count() method to get the number of rows in the dataframe returned by the filter() method.
You can observe this in the following example.
import pyspark.sql as ps
from pyspark.sql.functions import col
spark = ps.SparkSession.builder \
.master("local[*]") \
.appName("count_null_example") \
.getOrCreate()
dfs=spark.read.csv("StudentMarks.csv",header=True,inferSchema=True)
print("The input dataframe is:")
dfs.show()
dfs=dfs.filter(dfs.Physics.isNotNull())
print("Rows at which there are not null values in Physics Column:")
dfs.show()
print("Count of the rows with not null values in Physics column is:")
print(dfs.count())
spark.sparkContext.stop()
Output:
The input dataframe is:
+-------+-------+---------+----+-------+----+
|Student|Physics|Chemistry|Math|Biology|Arts|
+-------+-------+---------+----+-------+----+
| Aditya| 92| 76| 95| 73| 91|
| Chris| 95| null| 79| 71| 93|
| Sam| null| 62| 75| 95| 63|
| Harry| 68| 92| 69| 66| 98|
| Golu| 74| null| 96| 76| 64|
| Joel| 99| 79|null| null| 61|
| Tom| null| 94| 61| 65| 69|
| Harsh| 98| 99| 93| 95| 91|
| Clara| 93| null| 78| null| 71|
| Tina| 99| 76| 78| 94| 95|
+-------+-------+---------+----+-------+----+
Rows at which there are not null values in Physics Column:
+-------+-------+---------+----+-------+----+
|Student|Physics|Chemistry|Math|Biology|Arts|
+-------+-------+---------+----+-------+----+
| Aditya| 92| 76| 95| 73| 91|
| Chris| 95| null| 79| 71| 93|
| Harry| 68| 92| 69| 66| 98|
| Golu| 74| null| 96| 76| 64|
| Joel| 99| 79|null| null| 61|
| Harsh| 98| 99| 93| 95| 91|
| Clara| 93| null| 78| null| 71|
| Tina| 99| 76| 78| 94| 95|
+-------+-------+---------+----+-------+----+
Count of the rows with not null values in Physics column is:
8
In this example, we used the isNotNull()
method with the filter()
method and the count() method to get the number of rows with not null values from the pyspark dataframe.
Get the Number Of Rows With Not Null Values Using the where() Method
Instead of the filter() method, you can also use the where()
method to count rows with not null values in a given column. The where()
method also takes the mask returned by the isNotNull()
method and returns a dataframe containing the rows where the mask column contains True. We can use the count() method on the dataframe returned by the where() method to get the number of rows in which the specified column is not null.
You can observe this in the following example.
import pyspark.sql as ps
from pyspark.sql.functions import col
spark = ps.SparkSession.builder \
.master("local[*]") \
.appName("count_null_example") \
.getOrCreate()
dfs=spark.read.csv("StudentMarks.csv",header=True,inferSchema=True)
print("The input dataframe is:")
dfs.show()
dfs=dfs.where(dfs.Physics.isNotNull())
print("Rows at which there are not null values in Physics Column:")
dfs.show()
print("Count of the rows with not null values in Physics column is:")
print(dfs.count())
spark.sparkContext.stop()
Output:
The input dataframe is:
+-------+-------+---------+----+-------+----+
|Student|Physics|Chemistry|Math|Biology|Arts|
+-------+-------+---------+----+-------+----+
| Aditya| 92| 76| 95| 73| 91|
| Chris| 95| null| 79| 71| 93|
| Sam| null| 62| 75| 95| 63|
| Harry| 68| 92| 69| 66| 98|
| Golu| 74| null| 96| 76| 64|
| Joel| 99| 79|null| null| 61|
| Tom| null| 94| 61| 65| 69|
| Harsh| 98| 99| 93| 95| 91|
| Clara| 93| null| 78| null| 71|
| Tina| 99| 76| 78| 94| 95|
+-------+-------+---------+----+-------+----+
Rows at which there are not null values in Physics Column:
+-------+-------+---------+----+-------+----+
|Student|Physics|Chemistry|Math|Biology|Arts|
+-------+-------+---------+----+-------+----+
| Aditya| 92| 76| 95| 73| 91|
| Chris| 95| null| 79| 71| 93|
| Harry| 68| 92| 69| 66| 98|
| Golu| 74| null| 96| 76| 64|
| Joel| 99| 79|null| null| 61|
| Harsh| 98| 99| 93| 95| 91|
| Clara| 93| null| 78| null| 71|
| Tina| 99| 76| 78| 94| 95|
+-------+-------+---------+----+-------+----+
Count of the rows with not null values in Physics column is:
8
In this example, we used the where()
method instead of the filter()
method. However, the output remains the same.
Instead of the syntax used in the above examples, you can also use the col()
function with the isNotNull()
method to count rows with not null values in a pyspark dataframe as shown below.
import pyspark.sql as ps
from pyspark.sql.functions import col
spark = ps.SparkSession.builder \
.master("local[*]") \
.appName("count_null_example") \
.getOrCreate()
dfs=spark.read.csv("StudentMarks.csv",header=True,inferSchema=True)
print("The input dataframe is:")
dfs.show()
dfs=dfs.where(col("Physics").isNotNull())
print("Rows at which there are not null values in Physics Column:")
dfs.show()
print("Count of the rows with not null values in Physics column is:")
print(dfs.count())
spark.sparkContext.stop()
Output:
The input dataframe is:
+-------+-------+---------+----+-------+----+
|Student|Physics|Chemistry|Math|Biology|Arts|
+-------+-------+---------+----+-------+----+
| Aditya| 92| 76| 95| 73| 91|
| Chris| 95| null| 79| 71| 93|
| Sam| null| 62| 75| 95| 63|
| Harry| 68| 92| 69| 66| 98|
| Golu| 74| null| 96| 76| 64|
| Joel| 99| 79|null| null| 61|
| Tom| null| 94| 61| 65| 69|
| Harsh| 98| 99| 93| 95| 91|
| Clara| 93| null| 78| null| 71|
| Tina| 99| 76| 78| 94| 95|
+-------+-------+---------+----+-------+----+
Rows at which there are not null values in Physics Column:
+-------+-------+---------+----+-------+----+
|Student|Physics|Chemistry|Math|Biology|Arts|
+-------+-------+---------+----+-------+----+
| Aditya| 92| 76| 95| 73| 91|
| Chris| 95| null| 79| 71| 93|
| Harry| 68| 92| 69| 66| 98|
| Golu| 74| null| 96| 76| 64|
| Joel| 99| 79|null| null| 61|
| Harsh| 98| 99| 93| 95| 91|
| Clara| 93| null| 78| null| 71|
| Tina| 99| 76| 78| 94| 95|
+-------+-------+---------+----+-------+----+
Count of the rows with not null values in Physics column is:
8
Get the Number of Rows With Not Null Values Using the dropna() Method
We can also use the dropna()
method to get the number of rows with not null values in a single column from a pyspark dataframe. For this, we can invoke the dropna() method on the pyspark dataframe and pass the column name as input to the subset parameter. After execution of the dropna() method, we will get a dataframe with rows having no null values in the specified column. Then, we can use the count() method to get the number of rows having no null values in the dataframe in the given column.
You can observe this in the following example.
import pyspark.sql as ps
from pyspark.sql.functions import col
spark = ps.SparkSession.builder \
.master("local[*]") \
.appName("count_null_example") \
.getOrCreate()
dfs=spark.read.csv("StudentMarks.csv",header=True,inferSchema=True)
print("The input dataframe is:")
dfs.show()
dfs=dfs.dropna(subset="Physics")
print("Rows at which there are not null values in Physics Column:")
dfs.show()
print("Count of the rows with not null values in Physics column is:")
print(dfs.count())
spark.sparkContext.stop()
Output:
The input dataframe is:
+-------+-------+---------+----+-------+----+
|Student|Physics|Chemistry|Math|Biology|Arts|
+-------+-------+---------+----+-------+----+
| Aditya| 92| 76| 95| 73| 91|
| Chris| 95| null| 79| 71| 93|
| Sam| null| 62| 75| 95| 63|
| Harry| 68| 92| 69| 66| 98|
| Golu| 74| null| 96| 76| 64|
| Joel| 99| 79|null| null| 61|
| Tom| null| 94| 61| 65| 69|
| Harsh| 98| 99| 93| 95| 91|
| Clara| 93| null| 78| null| 71|
| Tina| 99| 76| 78| 94| 95|
+-------+-------+---------+----+-------+----+
Rows at which there are not null values in Physics Column:
+-------+-------+---------+----+-------+----+
|Student|Physics|Chemistry|Math|Biology|Arts|
+-------+-------+---------+----+-------+----+
| Aditya| 92| 76| 95| 73| 91|
| Chris| 95| null| 79| 71| 93|
| Harry| 68| 92| 69| 66| 98|
| Golu| 74| null| 96| 76| 64|
| Joel| 99| 79|null| null| 61|
| Harsh| 98| 99| 93| 95| 91|
| Clara| 93| null| 78| null| 71|
| Tina| 99| 76| 78| 94| 95|
+-------+-------+---------+----+-------+----+
Count of the rows with not null values in Physics column is:
8
In this example, we invoked the dropna()
method on the input pyspark dataframe. Here, we have passed the column name Physics
as input to the subset
parameter in the dropna()
method. Due to this, all the rows in which the Physics
column contains null values are dropped from the dataframe. Thus, we get the required rows with no null values in the Physics
column in the dataframe returned by the dropna() method. Finally, we use the count() method to get the number of rows in the dataframe returned by the dropna() method.
Count Rows With Not Null Values using SQL in a PySpark DataFrame
We can also use pyspark SQL syntax to get the number of rows with no null values in a column from a pyspark dataframe. For this, we will use the following steps.
- First, we will create a view of the input dataframe using the
createOrReplaceTempView()
method. ThecreateOrReplaceTempView()
, when invoked on a pyspark dataframe, takes the name of the view object as its input argument and creates a view of the dataframe. - After getting the view of the dataframe, we can use the sql SELECT statement with IS NOT NULL clause and the COUNT(*) function to get the count of rows with not null values in the given column in the pyspark dataframe. For this, we can execute the SQL query using the
sql()
function.
You can observe this in the following example.
import pyspark.sql as ps
from pyspark.sql.functions import col
spark = ps.SparkSession.builder \
.master("local[*]") \
.appName("count_null_example") \
.getOrCreate()
dfs=spark.read.csv("StudentMarks.csv",header=True,inferSchema=True)
print("The input dataframe is:")
dfs.show()
dfs.createOrReplaceTempView("df_sql")
new_df=spark.sql("SELECT COUNT(*) FROM df_sql WHERE Physics IS NOT NULL")
print("Count of the rows with not null values in Physics column is:")
new_df.show()
spark.sparkContext.stop()
Output:
The input dataframe is:
+-------+-------+---------+----+-------+----+
|Student|Physics|Chemistry|Math|Biology|Arts|
+-------+-------+---------+----+-------+----+
| Aditya| 92| 76| 95| 73| 91|
| Chris| 95| null| 79| 71| 93|
| Sam| null| 62| 75| 95| 63|
| Harry| 68| 92| 69| 66| 98|
| Golu| 74| null| 96| 76| 64|
| Joel| 99| 79|null| null| 61|
| Tom| null| 94| 61| 65| 69|
| Harsh| 98| 99| 93| 95| 91|
| Clara| 93| null| 78| null| 71|
| Tina| 99| 76| 78| 94| 95|
+-------+-------+---------+----+-------+----+
Count of the rows with not null values in Physics column is:
+--------+
|count(1)|
+--------+
| 8|
+--------+
In this example, we have used the sql()
function to execute the sql statement for obtaining the number of rows with not null values in the Physics
column.
Instead of using the above approach, you can directly pass the statement in the WHERE clause of the sql statement to the filter()
method as shown below.
import pyspark.sql as ps
from pyspark.sql.functions import col
spark = ps.SparkSession.builder \
.master("local[*]") \
.appName("count_null_example") \
.getOrCreate()
dfs=spark.read.csv("StudentMarks.csv",header=True,inferSchema=True)
print("The input dataframe is:")
dfs.show()
dfs=dfs.filter("Physics IS NOT NULL")
print("Rows at which there are not null values in Physics Column:")
dfs.show()
print("Count of the rows with not null values in Physics column is:")
print(dfs.count())
spark.sparkContext.stop()
Output:
The input dataframe is:
+-------+-------+---------+----+-------+----+
|Student|Physics|Chemistry|Math|Biology|Arts|
+-------+-------+---------+----+-------+----+
| Aditya| 92| 76| 95| 73| 91|
| Chris| 95| null| 79| 71| 93|
| Sam| null| 62| 75| 95| 63|
| Harry| 68| 92| 69| 66| 98|
| Golu| 74| null| 96| 76| 64|
| Joel| 99| 79|null| null| 61|
| Tom| null| 94| 61| 65| 69|
| Harsh| 98| 99| 93| 95| 91|
| Clara| 93| null| 78| null| 71|
| Tina| 99| 76| 78| 94| 95|
+-------+-------+---------+----+-------+----+
Rows at which there are not null values in Physics Column:
+-------+-------+---------+----+-------+----+
|Student|Physics|Chemistry|Math|Biology|Arts|
+-------+-------+---------+----+-------+----+
| Aditya| 92| 76| 95| 73| 91|
| Chris| 95| null| 79| 71| 93|
| Harry| 68| 92| 69| 66| 98|
| Golu| 74| null| 96| 76| 64|
| Joel| 99| 79|null| null| 61|
| Harsh| 98| 99| 93| 95| 91|
| Clara| 93| null| 78| null| 71|
| Tina| 99| 76| 78| 94| 95|
+-------+-------+---------+----+-------+----+
Count of the rows with not null values in Physics column is:
8
In this example, we passed the string "Physics IS NOT NULL"
to the filter()
method instead of executing the entire SQL query. However, the output remains the same.
Count Rows with Null Values in Multiple columns in a DataFrame
To count rows with Null values in Multiple columns, we can use the conditional operators along with the isNull() method inside the filter() and where() method. Then, we can use the count() method on the dataframes returned by the filter() method or the where() method to get the number of rows with null values. For this, we will use the following steps.
- We will first invoke the
isNull()
method on all the required columns. - Next, we will enclose the masked columns returned by the
isNull()
method inside parentheses and combine them with conditional operators like AND (&) and OR (|) to create a conditional statement. - After this, we will pass the conditional statement to the
filter()
method. The filter() method will return a dataframe containing null values in the specified columns. - Finally, we will use the count() method to get the number of rows in the dataframe returned by the filter() method.
After executing the above statements, we can count rows with null values in multiple columns in a pyspark dataframe. You can observe this in the following example.
import pyspark.sql as ps
from pyspark.sql.functions import col
spark = ps.SparkSession.builder \
.master("local[*]") \
.appName("count_null_example") \
.getOrCreate()
dfs=spark.read.csv("StudentMarks.csv",header=True,inferSchema=True)
print("The input dataframe is:")
dfs.show()
dfs=dfs.filter((dfs.Physics.isNull())& (dfs.Chemistry.isNull()))
print("Rows with null values in Physics and Chemistry Columns are:")
dfs.show()
print("Count of the rows with null values in Physics and Chemistry column is:")
print(dfs.count())
spark.sparkContext.stop()
Output:
The input dataframe is:
+-------+-------+---------+----+-------+----+
|Student|Physics|Chemistry|Math|Biology|Arts|
+-------+-------+---------+----+-------+----+
| Aditya| 92| 76| 95| 73| 91|
| Chris| 95| null| 79| 71| 93|
| Sam| null| 62| 75| 95| 63|
| Harry| 68| 92| 69| 66| 98|
| Golu| 74| null| 96| 76| 64|
| Joel| 99| 79|null| null| 61|
| Tom| null| 94| 61| 65| 69|
| Harsh| 98| 99| 93| 95| 91|
| Clara| 93| null| 78| null| 71|
| Tina| 99| 76| 78| 94| 95|
+-------+-------+---------+----+-------+----+
Rows with null values in Physics and Chemistry Columns are:
+-------+-------+---------+----+-------+----+
|Student|Physics|Chemistry|Math|Biology|Arts|
+-------+-------+---------+----+-------+----+
+-------+-------+---------+----+-------+----+
Count of the rows with null values in Physics and Chemistry column is:
0
In this example, we have passed the conditions (dfs.Physics.isNull())
and (dfs.Chemistry.isNull())
to the filter()
method to filter rows with not null values in these columns. Then, we have used the count() method. Observe that the output of the isNull()
method is enclosed inside parentheses for the & operator to execute successfully. If we don’t use parentheses here, the program will run into error.
In the output, you can observe that we get an empty dataframe because there are no rows where both the Physics and Chemistry column is null. Hence, the count is 0.
In a similar manner, you can use the OR conditional operator (|) to count rows in which the Physics
column or Chemistry
columns contain null values as shown below.
import pyspark.sql as ps
from pyspark.sql.functions import col
spark = ps.SparkSession.builder \
.master("local[*]") \
.appName("count_null_example") \
.getOrCreate()
dfs=spark.read.csv("StudentMarks.csv",header=True,inferSchema=True)
print("The input dataframe is:")
dfs.show()
dfs=dfs.filter((dfs.Physics.isNull())|(dfs.Chemistry.isNull()))
print("Rows with null values in Physics or Chemistry Columns are:")
dfs.show()
print("Count of the rows with null values in Physics or Chemistry column is:")
print(dfs.count())
spark.sparkContext.stop()
Output:
The input dataframe is:
+-------+-------+---------+----+-------+----+
|Student|Physics|Chemistry|Math|Biology|Arts|
+-------+-------+---------+----+-------+----+
| Aditya| 92| 76| 95| 73| 91|
| Chris| 95| null| 79| 71| 93|
| Sam| null| 62| 75| 95| 63|
| Harry| 68| 92| 69| 66| 98|
| Golu| 74| null| 96| 76| 64|
| Joel| 99| 79|null| null| 61|
| Tom| null| 94| 61| 65| 69|
| Harsh| 98| 99| 93| 95| 91|
| Clara| 93| null| 78| null| 71|
| Tina| 99| 76| 78| 94| 95|
+-------+-------+---------+----+-------+----+
Rows with null values in Physics or Chemistry Columns are:
+-------+-------+---------+----+-------+----+
|Student|Physics|Chemistry|Math|Biology|Arts|
+-------+-------+---------+----+-------+----+
| Chris| 95| null| 79| 71| 93|
| Sam| null| 62| 75| 95| 63|
| Golu| 74| null| 96| 76| 64|
| Tom| null| 94| 61| 65| 69|
| Clara| 93| null| 78| null| 71|
+-------+-------+---------+----+-------+----+
Count of the rows with null values in Physics or Chemistry column is:
5
In the above example, we have used the filter()
method with the conditions to filter rows with null values in Physics
or Chemistry
column from the pyspark dataframe. Then, we used the count() method to get the number of rows in the dataframe returned by the filter() method.
Instead of the filter()
method, you can also use the where()
method in the previous example as shown below.
import pyspark.sql as ps
from pyspark.sql.functions import col
spark = ps.SparkSession.builder \
.master("local[*]") \
.appName("count_null_example") \
.getOrCreate()
dfs=spark.read.csv("StudentMarks.csv",header=True,inferSchema=True)
print("The input dataframe is:")
dfs.show()
dfs=dfs.where((dfs.Physics.isNull())|(dfs.Chemistry.isNull()))
print("Rows with null values in Physics or Chemistry Columns are:")
dfs.show()
print("Count of the rows with null values in Physics or Chemistry column is:")
print(dfs.count())
spark.sparkContext.stop()
Output:
The input dataframe is:
+-------+-------+---------+----+-------+----+
|Student|Physics|Chemistry|Math|Biology|Arts|
+-------+-------+---------+----+-------+----+
| Aditya| 92| 76| 95| 73| 91|
| Chris| 95| null| 79| 71| 93|
| Sam| null| 62| 75| 95| 63|
| Harry| 68| 92| 69| 66| 98|
| Golu| 74| null| 96| 76| 64|
| Joel| 99| 79|null| null| 61|
| Tom| null| 94| 61| 65| 69|
| Harsh| 98| 99| 93| 95| 91|
| Clara| 93| null| 78| null| 71|
| Tina| 99| 76| 78| 94| 95|
+-------+-------+---------+----+-------+----+
Rows with null values in Physics or Chemistry Columns are:
+-------+-------+---------+----+-------+----+
|Student|Physics|Chemistry|Math|Biology|Arts|
+-------+-------+---------+----+-------+----+
| Chris| 95| null| 79| 71| 93|
| Sam| null| 62| 75| 95| 63|
| Golu| 74| null| 96| 76| 64|
| Tom| null| 94| 61| 65| 69|
| Clara| 93| null| 78| null| 71|
+-------+-------+---------+----+-------+----+
Count of the rows with null values in Physics or Chemistry column is:
5
We can also use multiple conditional statements inside SQL syntax to get the number of rows with null values in multiple columns in a pyspark dataframe. For this, you can use the IS NULL clause with conditional operators in the WHERE clause of the SQL statement as shown below.
import pyspark.sql as ps
from pyspark.sql.functions import col
spark = ps.SparkSession.builder \
.master("local[*]") \
.appName("count_null_example") \
.getOrCreate()
dfs=spark.read.csv("StudentMarks.csv",header=True,inferSchema=True)
print("The input dataframe is:")
dfs.show()
dfs.createOrReplaceTempView("df_sql")
new_df=spark.sql("SELECT COUNT(*) FROM df_sql WHERE Physics IS NULL or Chemistry IS NULL")
print("Count of the rows with null values in Physics or Chemistry column is:")
new_df.show()
spark.sparkContext.stop()
Output:
The input dataframe is:
+-------+-------+---------+----+-------+----+
|Student|Physics|Chemistry|Math|Biology|Arts|
+-------+-------+---------+----+-------+----+
| Aditya| 92| 76| 95| 73| 91|
| Chris| 95| null| 79| 71| 93|
| Sam| null| 62| 75| 95| 63|
| Harry| 68| 92| 69| 66| 98|
| Golu| 74| null| 96| 76| 64|
| Joel| 99| 79|null| null| 61|
| Tom| null| 94| 61| 65| 69|
| Harsh| 98| 99| 93| 95| 91|
| Clara| 93| null| 78| null| 71|
| Tina| 99| 76| 78| 94| 95|
+-------+-------+---------+----+-------+----+
Count of the rows with null values in Physics or Chemistry column is:
+--------+
|count(1)|
+--------+
| 5|
+--------+
Instead of using the sql statement, we can also pass the conditional statement in the WHERE clause to the filter()
method as shown below.
import pyspark.sql as ps
from pyspark.sql.functions import col
spark = ps.SparkSession.builder \
.master("local[*]") \
.appName("count_null_example") \
.getOrCreate()
dfs=spark.read.csv("StudentMarks.csv",header=True,inferSchema=True)
print("The input dataframe is:")
dfs.show()
dfs=dfs.filter("Physics IS NULL OR Chemistry IS NULL")
print("Rows with null values in Physics or Chemistry Columns are:")
dfs.show()
print("Count of the rows with null values in Physics or Chemistry column is:")
print(dfs.count())
spark.sparkContext.stop()
Output:
The input dataframe is:
+-------+-------+---------+----+-------+----+
|Student|Physics|Chemistry|Math|Biology|Arts|
+-------+-------+---------+----+-------+----+
| Aditya| 92| 76| 95| 73| 91|
| Chris| 95| null| 79| 71| 93|
| Sam| null| 62| 75| 95| 63|
| Harry| 68| 92| 69| 66| 98|
| Golu| 74| null| 96| 76| 64|
| Joel| 99| 79|null| null| 61|
| Tom| null| 94| 61| 65| 69|
| Harsh| 98| 99| 93| 95| 91|
| Clara| 93| null| 78| null| 71|
| Tina| 99| 76| 78| 94| 95|
+-------+-------+---------+----+-------+----+
Rows with null values in Physics or Chemistry Columns are:
+-------+-------+---------+----+-------+----+
|Student|Physics|Chemistry|Math|Biology|Arts|
+-------+-------+---------+----+-------+----+
| Chris| 95| null| 79| 71| 93|
| Sam| null| 62| 75| 95| 63|
| Golu| 74| null| 96| 76| 64|
| Tom| null| 94| 61| 65| 69|
| Clara| 93| null| 78| null| 71|
+-------+-------+---------+----+-------+----+
Count of the rows with null values in Physics or Chemistry column is:
5
Suggested reading: PySpark vs Pandas
Get The Number of Rows with Not Null Values in Multiple Columns
To count rows with no Null values in Multiple columns, we can use the conditional operators along with the isNotNull()
method inside the filter()
and where()
method. For this, we will use the following steps.
- We will first invoke the
isNotNull()
method on all the required columns. - Next, we will enclose the masked columns returned by the
isNotNull()
method inside parentheses and combine them with conditional operators like AND (&) and OR (|) to create a conditional statement. - After this, we will pass the conditional statement to the
filter()
method. The filter() method will return a dataframe not null values in the specified columns. - Finally, we will use the count() method to get the count of rows with not null values.
After executing the above statements, we can count rows with not null values in multiple columns in a pyspark dataframe. You can observe this in the following example.
import pyspark.sql as ps
from pyspark.sql.functions import col
spark = ps.SparkSession.builder \
.master("local[*]") \
.appName("count_null_example") \
.getOrCreate()
dfs=spark.read.csv("StudentMarks.csv",header=True,inferSchema=True)
print("The input dataframe is:")
dfs.show()
dfs=dfs.filter((dfs.Physics.isNotNull())& (dfs.Chemistry.isNotNull()))
print("Rows with not null values in Physics and Chemistry Columns are:")
dfs.show()
print("Count of the rows with not null values in Physics or Chemistry column is:")
print(dfs.count())
spark.sparkContext.stop()
Output:
The input dataframe is:
+-------+-------+---------+----+-------+----+
|Student|Physics|Chemistry|Math|Biology|Arts|
+-------+-------+---------+----+-------+----+
| Aditya| 92| 76| 95| 73| 91|
| Chris| 95| null| 79| 71| 93|
| Sam| null| 62| 75| 95| 63|
| Harry| 68| 92| 69| 66| 98|
| Golu| 74| null| 96| 76| 64|
| Joel| 99| 79|null| null| 61|
| Tom| null| 94| 61| 65| 69|
| Harsh| 98| 99| 93| 95| 91|
| Clara| 93| null| 78| null| 71|
| Tina| 99| 76| 78| 94| 95|
+-------+-------+---------+----+-------+----+
Rows with not null values in Physics and Chemistry Columns are:
+-------+-------+---------+----+-------+----+
|Student|Physics|Chemistry|Math|Biology|Arts|
+-------+-------+---------+----+-------+----+
| Aditya| 92| 76| 95| 73| 91|
| Harry| 68| 92| 69| 66| 98|
| Joel| 99| 79|null| null| 61|
| Harsh| 98| 99| 93| 95| 91|
| Tina| 99| 76| 78| 94| 95|
+-------+-------+---------+----+-------+----+
Count of the rows with not null values in Physics or Chemistry column is:
5
Instead of the filter()
method, you can also use the where()
method in the previous example as shown below.
import pyspark.sql as ps
from pyspark.sql.functions import col
spark = ps.SparkSession.builder \
.master("local[*]") \
.appName("count_null_example") \
.getOrCreate()
dfs=spark.read.csv("StudentMarks.csv",header=True,inferSchema=True)
print("The input dataframe is:")
dfs.show()
dfs=dfs.where((dfs.Physics.isNotNull())& (dfs.Chemistry.isNotNull()))
print("Rows with not null values in Physics and Chemistry Columns are:")
dfs.show()
print("Count of the rows with not null values in Physics or Chemistry column is:")
print(dfs.count())
spark.sparkContext.stop()
Output:
The input dataframe is:
+-------+-------+---------+----+-------+----+
|Student|Physics|Chemistry|Math|Biology|Arts|
+-------+-------+---------+----+-------+----+
| Aditya| 92| 76| 95| 73| 91|
| Chris| 95| null| 79| 71| 93|
| Sam| null| 62| 75| 95| 63|
| Harry| 68| 92| 69| 66| 98|
| Golu| 74| null| 96| 76| 64|
| Joel| 99| 79|null| null| 61|
| Tom| null| 94| 61| 65| 69|
| Harsh| 98| 99| 93| 95| 91|
| Clara| 93| null| 78| null| 71|
| Tina| 99| 76| 78| 94| 95|
+-------+-------+---------+----+-------+----+
Rows with not null values in Physics and Chemistry Columns are:
+-------+-------+---------+----+-------+----+
|Student|Physics|Chemistry|Math|Biology|Arts|
+-------+-------+---------+----+-------+----+
| Aditya| 92| 76| 95| 73| 91|
| Harry| 68| 92| 69| 66| 98|
| Joel| 99| 79|null| null| 61|
| Harsh| 98| 99| 93| 95| 91|
| Tina| 99| 76| 78| 94| 95|
+-------+-------+---------+----+-------+----+
Count of the rows with not null values in Physics or Chemistry column is:
5
We can also use multiple conditional statements inside the SQL WHERE clause with the COUNT() function to get the number of rows with not null values in multiple columns in a pyspark dataframe as shown below.
import pyspark.sql as ps
from pyspark.sql.functions import col
spark = ps.SparkSession.builder \
.master("local[*]") \
.appName("count_null_example") \
.getOrCreate()
dfs=spark.read.csv("StudentMarks.csv",header=True,inferSchema=True)
print("The input dataframe is:")
dfs.show()
dfs.createOrReplaceTempView("df_sql")
new_df=spark.sql("SELECT COUNT(*) FROM df_sql WHERE Physics IS NOT NULL AND Chemistry IS NOT NULL")
print("Count of the rows with not null values in Physics or Chemistry column is:")
new_df.show()
spark.sparkContext.stop()
Output:
The input dataframe is:
+-------+-------+---------+----+-------+----+
|Student|Physics|Chemistry|Math|Biology|Arts|
+-------+-------+---------+----+-------+----+
| Aditya| 92| 76| 95| 73| 91|
| Chris| 95| null| 79| 71| 93|
| Sam| null| 62| 75| 95| 63|
| Harry| 68| 92| 69| 66| 98|
| Golu| 74| null| 96| 76| 64|
| Joel| 99| 79|null| null| 61|
| Tom| null| 94| 61| 65| 69|
| Harsh| 98| 99| 93| 95| 91|
| Clara| 93| null| 78| null| 71|
| Tina| 99| 76| 78| 94| 95|
+-------+-------+---------+----+-------+----+
Count of the rows with not null values in Physics or Chemistry column is:
+--------+
|count(1)|
+--------+
| 5|
+--------+
Instead of using the sql statement, we can also pass the conditional statement in the WHERE clause to the filter()
method as shown below.
import pyspark.sql as ps
from pyspark.sql.functions import col
spark = ps.SparkSession.builder \
.master("local[*]") \
.appName("count_null_example") \
.getOrCreate()
dfs=spark.read.csv("StudentMarks.csv",header=True,inferSchema=True)
print("The input dataframe is:")
dfs.show()
dfs=dfs.filter("Physics IS NOT NULL AND Chemistry IS NOT NULL")
print("Rows with not null values in Physics and Chemistry Columns are:")
dfs.show()
print("Count of the rows with not null values in Physics or Chemistry column is:")
print(dfs.count())
spark.sparkContext.stop()
Output:
The input dataframe is:
+-------+-------+---------+----+-------+----+
|Student|Physics|Chemistry|Math|Biology|Arts|
+-------+-------+---------+----+-------+----+
| Aditya| 92| 76| 95| 73| 91|
| Chris| 95| null| 79| 71| 93|
| Sam| null| 62| 75| 95| 63|
| Harry| 68| 92| 69| 66| 98|
| Golu| 74| null| 96| 76| 64|
| Joel| 99| 79|null| null| 61|
| Tom| null| 94| 61| 65| 69|
| Harsh| 98| 99| 93| 95| 91|
| Clara| 93| null| 78| null| 71|
| Tina| 99| 76| 78| 94| 95|
+-------+-------+---------+----+-------+----+
Rows with not null values in Physics and Chemistry Columns are:
+-------+-------+---------+----+-------+----+
|Student|Physics|Chemistry|Math|Biology|Arts|
+-------+-------+---------+----+-------+----+
| Aditya| 92| 76| 95| 73| 91|
| Harry| 68| 92| 69| 66| 98|
| Joel| 99| 79|null| null| 61|
| Harsh| 98| 99| 93| 95| 91|
| Tina| 99| 76| 78| 94| 95|
+-------+-------+---------+----+-------+----+
Count of the rows with not null values in Physics or Chemistry column is:
5
Conclusion
In this article, we discussed different ways to count the number of rows with null values in a pyspark dataframe. To learn more about pyspark dataframes, you can read this article on how to sort a pyspark dataframe. You might also like this article on how to select distinct rows from a pyspark dataframe.
I hope you enjoyed reading this article. Stay tuned for more informative articles.
Happy Learning!
Recommended Python Training
Course: Python 3 For Beginners
Over 15 hours of video content with guided instruction for beginners. Learn how to create real world applications and master the basics.