Select Rows with Null values in PySpark

Missing values in tabular data are a common problem. When we load tabular data with missing values into a pyspark dataframe, the empty values are replaced with null values. In this article, we will discuss how to select rows with null values in a given pyspark dataframe.

Table of Contents

The isNull() Method in PySpark
The isNotNull() Method in PySpark
Select Rows With Null Values in a Column in PySpark DataFrame
Select Rows With Not Null Values in a Column
Select Rows with Null Values in Multiple columns in a DataFrame
Select Rows with Not Null Values in Multiple Columns
Conclusion

The isNull() Method in PySpark

The isNull() Method is used to check for null values in a pyspark dataframe column. When we invoke the isNull() method on a dataframe column, it returns a masked column having True and False values. Here, the values in the mask are set to True at the positions where no values are present. Otherwise, the value in the mask is set to True.

The length of the mask column is equal to the number of rows in the pyspark dataframe. We can use the isNull() method with the filter() or where() method to filter rows with null values from a pyspark dataframe.

The isNotNull() Method in PySpark

The isNotNull() method is the negation of the isNull() method. It is used to check for not null values in pyspark. If we invoke the isNotNull() method on a dataframe column, it also returns a mask having True and False values. Here, the values in the mask are set to False at the positions where no values are present. Otherwise, the value in the mask is set to True. Again, the length of the mask column is equal to the number of rows in the pyspark dataframe.

We can use the isNotNull() method with the filter() or where() method to select rows with not null values from a pyspark dataframe.

Select Rows With Null Values in a Column in PySpark DataFrame

To select rows with null values in a column in a pyspark dataframe, we can use the following approaches.

Using filter() method and the isNull() Method
By using the where() method and the isNull() Method
By Using sql IS NULL statement.

Let us discuss all these approaches one by one.

Select Rows With Null Values Using The filter() Method

To filter rows with null values in a particular column in a pyspark dataframe, we will first invoke the isNull() method on the given column. The isNull() method will return a masked column having True and False values. We will pass the mask column object returned by the isNull() method to the filter() method. After this, we will get the rows with null values in the given column.

You can observe this in the following example.

import pyspark.sql as ps
spark = ps.SparkSession.builder \
      .master("local[*]") \
      .appName("selectnull_example") \
      .getOrCreate()

dfs=spark.read.csv("StudentMarks.csv",header=True,inferSchema=True)
print("The input dataframe is:")
dfs.show()
dfs=dfs.filter(dfs.Physics.isNull())
print("Rows at which there are null values in Physics Column:")
dfs.show()
spark.sparkContext.stop()

Output:

The input dataframe is:
+-------+-------+---------+----+-------+----+
|Student|Physics|Chemistry|Math|Biology|Arts|
+-------+-------+---------+----+-------+----+
| Aditya|     92|       76|  95|     73|  91|
|  Chris|     95|     null|  79|     71|  93|
|    Sam|   null|       62|  75|     95|  63|
|  Harry|     68|       92|  69|     66|  98|
|   Golu|     74|     null|  96|     76|  64|
|   Joel|     99|       79|null|   null|  61|
|    Tom|   null|       94|  61|     65|  69|
|  Harsh|     98|       99|  93|     95|  91|
|  Clara|     93|     null|  78|   null|  71|
|   Tina|     99|       76|  78|     94|  95|
+-------+-------+---------+----+-------+----+

Rows at which there are null values in Physics Column:
+-------+-------+---------+----+-------+----+
|Student|Physics|Chemistry|Math|Biology|Arts|
+-------+-------+---------+----+-------+----+
|    Sam|   null|       62|  75|     95|  63|
|    Tom|   null|       94|  61|     65|  69|
+-------+-------+---------+----+-------+----+

In the above example, we first read a csv file into a pyspark dataframe. Then, we used the isNull() method and the filter() method to filter rows with Null values in the Physics column. In the output, you can observe that the final dataframe contains only those rows in which the Physics column is set to Null.

Filter Rows With Null Values using the where() Method

Instead of the filter() method, you can also use the where() method to filter rows with null values in a given column. The where() method also takes the mask returned by the isNull() method and returns a dataframe containing the rows where the mask column contains True.

You can observe this in the following example.

import pyspark.sql as ps
spark = ps.SparkSession.builder \
      .master("local[*]") \
      .appName("selectnull_example") \
      .getOrCreate()

dfs=spark.read.csv("StudentMarks.csv",header=True,inferSchema=True)
print("The input dataframe is:")
dfs.show()
dfs=dfs.where(dfs.Physics.isNull())
print("Rows at which there are null values in Physics Column:")
dfs.show()
spark.sparkContext.stop()

Output:

The input dataframe is:
+-------+-------+---------+----+-------+----+
|Student|Physics|Chemistry|Math|Biology|Arts|
+-------+-------+---------+----+-------+----+
| Aditya|     92|       76|  95|     73|  91|
|  Chris|     95|     null|  79|     71|  93|
|    Sam|   null|       62|  75|     95|  63|
|  Harry|     68|       92|  69|     66|  98|
|   Golu|     74|     null|  96|     76|  64|
|   Joel|     99|       79|null|   null|  61|
|    Tom|   null|       94|  61|     65|  69|
|  Harsh|     98|       99|  93|     95|  91|
|  Clara|     93|     null|  78|   null|  71|
|   Tina|     99|       76|  78|     94|  95|
+-------+-------+---------+----+-------+----+

Rows at which there are null values in Physics Column:
+-------+-------+---------+----+-------+----+
|Student|Physics|Chemistry|Math|Biology|Arts|
+-------+-------+---------+----+-------+----+
|    Sam|   null|       62|  75|     95|  63|
|    Tom|   null|       94|  61|     65|  69|
+-------+-------+---------+----+-------+----+

In this example, we have used the where() method instead of the filter() method. However, you can observe that the output remains the same.

Instead of the syntax used in the above examples, you can also use the col() function with the isNull() method to create the mask containing True and False values. The col() function is defined in the pyspark.sql.functions module. It takes a column name as an input argument and returns the column object containing all the values in the given column in a pyspark dataframe.

We can invoke the isNull() method on the column object to select rows with null values from the pyspark dataframe as shown below.

import pyspark.sql as ps
from pyspark.sql.functions import col
spark = ps.SparkSession.builder \
      .master("local[*]") \
      .appName("selectnull_example") \
      .getOrCreate()

dfs=spark.read.csv("StudentMarks.csv",header=True,inferSchema=True)
print("The input dataframe is:")
dfs.show()
dfs=dfs.where(col("Physics").isNull())
print("Rows at which there are null values in Physics Column:")
dfs.show()
spark.sparkContext.stop()

Output:

The input dataframe is:
+-------+-------+---------+----+-------+----+
|Student|Physics|Chemistry|Math|Biology|Arts|
+-------+-------+---------+----+-------+----+
| Aditya|     92|       76|  95|     73|  91|
|  Chris|     95|     null|  79|     71|  93|
|    Sam|   null|       62|  75|     95|  63|
|  Harry|     68|       92|  69|     66|  98|
|   Golu|     74|     null|  96|     76|  64|
|   Joel|     99|       79|null|   null|  61|
|    Tom|   null|       94|  61|     65|  69|
|  Harsh|     98|       99|  93|     95|  91|
|  Clara|     93|     null|  78|   null|  71|
|   Tina|     99|       76|  78|     94|  95|
+-------+-------+---------+----+-------+----+

Rows at which there are null values in Physics Column:
+-------+-------+---------+----+-------+----+
|Student|Physics|Chemistry|Math|Biology|Arts|
+-------+-------+---------+----+-------+----+
|    Sam|   null|       62|  75|     95|  63|
|    Tom|   null|       94|  61|     65|  69|
+-------+-------+---------+----+-------+----+

In this example, we have used the col() function to access the Physics column of the dataframe instead of using the name of the dataframe.

Select Rows With Null Values Using SQL syntax

We can also use spark SQL to filter rows with null values from a pyspark dataframe. For this, we will first create a view of the input dataframe using the createOrReplaceTempView() method. The createOrReplaceTempView(), when invoked on a pyspark dataframe, takes the name of the view object as its input argument and creates a view of the dataframe.

After getting the view of the dataframe, we can use the sql SELECT statement with IS NULL clause to select rows with null values in a given column in the pyspark dataframe. For this, we can execute the SQL query using the sql() function as shown below.

import pyspark.sql as ps
from pyspark.sql.functions import col
spark = ps.SparkSession.builder \
      .master("local[*]") \
      .appName("selectnull_example") \
      .getOrCreate()

dfs=spark.read.csv("StudentMarks.csv",header=True,inferSchema=True)
print("The input dataframe is:")
dfs.show()
dfs.createOrReplaceTempView("df_sql")
new_df=spark.sql("SELECT * FROM df_sql WHERE Physics IS NULL")
print("Rows at which there are null values in Physics Column:")
new_df.show()
spark.sparkContext.stop()

Output:

The input dataframe is:
+-------+-------+---------+----+-------+----+
|Student|Physics|Chemistry|Math|Biology|Arts|
+-------+-------+---------+----+-------+----+
| Aditya|     92|       76|  95|     73|  91|
|  Chris|     95|     null|  79|     71|  93|
|    Sam|   null|       62|  75|     95|  63|
|  Harry|     68|       92|  69|     66|  98|
|   Golu|     74|     null|  96|     76|  64|
|   Joel|     99|       79|null|   null|  61|
|    Tom|   null|       94|  61|     65|  69|
|  Harsh|     98|       99|  93|     95|  91|
|  Clara|     93|     null|  78|   null|  71|
|   Tina|     99|       76|  78|     94|  95|
+-------+-------+---------+----+-------+----+

Rows at which there are null values in Physics Column:
+-------+-------+---------+----+-------+----+
|Student|Physics|Chemistry|Math|Biology|Arts|
+-------+-------+---------+----+-------+----+
|    Sam|   null|       62|  75|     95|  63|
|    Tom|   null|       94|  61|     65|  69|
+-------+-------+---------+----+-------+----+

In the above example, we first created a view of the input dataframe. This is because we cannot execute SQL statements on a pyspark dataframe. Then, we used the sql() function to execute the SQL query for retrieving rows in which the Physics column is Null. The sql() function takes the SQL query as its input and returns the output dataframe.

Instead of creating the view of the dataframe and executing the sql() function, you can directly pass the statement in the WHERE clause in the sql statement to the filter() method as shown below.

import pyspark.sql as ps
spark = ps.SparkSession.builder \
      .master("local[*]") \
      .appName("selectnull_example") \
      .getOrCreate()

dfs=spark.read.csv("StudentMarks.csv",header=True,inferSchema=True)
print("The input dataframe is:")
dfs.show()
dfs=dfs.filter("Physics IS NULL")
print("Rows at which there are null values in Physics Column:")
dfs.show()
spark.sparkContext.stop()

Output:

The input dataframe is:
+-------+-------+---------+----+-------+----+
|Student|Physics|Chemistry|Math|Biology|Arts|
+-------+-------+---------+----+-------+----+
| Aditya|     92|       76|  95|     73|  91|
|  Chris|     95|     null|  79|     71|  93|
|    Sam|   null|       62|  75|     95|  63|
|  Harry|     68|       92|  69|     66|  98|
|   Golu|     74|     null|  96|     76|  64|
|   Joel|     99|       79|null|   null|  61|
|    Tom|   null|       94|  61|     65|  69|
|  Harsh|     98|       99|  93|     95|  91|
|  Clara|     93|     null|  78|   null|  71|
|   Tina|     99|       76|  78|     94|  95|
+-------+-------+---------+----+-------+----+

Rows at which there are null values in Physics Column:
+-------+-------+---------+----+-------+----+
|Student|Physics|Chemistry|Math|Biology|Arts|
+-------+-------+---------+----+-------+----+
|    Sam|   null|       62|  75|     95|  63|
|    Tom|   null|       94|  61|     65|  69|
+-------+-------+---------+----+-------+----+

In this example, we have passed the "Physics IS NULL" string to the filter() method. Hence, the filter() method treats the string as a statement of the WHERE clause of the SQL statement and returns the output dataframe.

Select Rows With Not Null Values in a Column

We can use the approaches used for selecting rows with null values to filter rows with not null values. The only difference is that we need to use the isNotNull() method in place of the isNull() method.

Filter Rows With Not Null Values Using The filter() Method

To select rows with not null values in a particular column in a pyspark dataframe, we will first invoke the isNotNull() method on the given column. The isNotNull() method will return a masked column containing True and False values. Next, we will pass the mask column object returned by the isNotNull() method to the filter() method. After this, we will get the rows with not null values in the given column.

You can observe this in the following example.

import pyspark.sql as ps
spark = ps.SparkSession.builder \
      .master("local[*]") \
      .appName("selectnull_example") \
      .getOrCreate()

dfs=spark.read.csv("StudentMarks.csv",header=True,inferSchema=True)
print("The input dataframe is:")
dfs.show()
dfs=dfs.filter(dfs.Physics.isNotNull())
print("Rows at which there are not null values in Physics Column:")
dfs.show()
spark.sparkContext.stop()

Output:

The input dataframe is:
+-------+-------+---------+----+-------+----+
|Student|Physics|Chemistry|Math|Biology|Arts|
+-------+-------+---------+----+-------+----+
| Aditya|     92|       76|  95|     73|  91|
|  Chris|     95|     null|  79|     71|  93|
|    Sam|   null|       62|  75|     95|  63|
|  Harry|     68|       92|  69|     66|  98|
|   Golu|     74|     null|  96|     76|  64|
|   Joel|     99|       79|null|   null|  61|
|    Tom|   null|       94|  61|     65|  69|
|  Harsh|     98|       99|  93|     95|  91|
|  Clara|     93|     null|  78|   null|  71|
|   Tina|     99|       76|  78|     94|  95|
+-------+-------+---------+----+-------+----+

Rows at which there are not null values in Physics Column:
+-------+-------+---------+----+-------+----+
|Student|Physics|Chemistry|Math|Biology|Arts|
+-------+-------+---------+----+-------+----+
| Aditya|     92|       76|  95|     73|  91|
|  Chris|     95|     null|  79|     71|  93|
|  Harry|     68|       92|  69|     66|  98|
|   Golu|     74|     null|  96|     76|  64|
|   Joel|     99|       79|null|   null|  61|
|  Harsh|     98|       99|  93|     95|  91|
|  Clara|     93|     null|  78|   null|  71|
|   Tina|     99|       76|  78|     94|  95|
+-------+-------+---------+----+-------+----+

In this example, we used the isNotNull() method with the filter() method to select rows with not null values from the pyspark dataframe.

Select Rows With Not Null Values Using the where() Method

Instead of the filter() method, you can also use the where() method to filter rows with not null values in a given column. The where() method also takes the mask returned by the isNotNull() method and returns a dataframe containing the rows where the mask column contains True.

You can observe this in the following example.

import pyspark.sql as ps
spark = ps.SparkSession.builder \
      .master("local[*]") \
      .appName("selectnull_example") \
      .getOrCreate()

dfs=spark.read.csv("StudentMarks.csv",header=True,inferSchema=True)
print("The input dataframe is:")
dfs.show()
dfs=dfs.where(dfs.Physics.isNotNull())
print("Rows at which there are not null values in Physics Column:")
dfs.show()
spark.sparkContext.stop()

Output:

he input dataframe is:
+-------+-------+---------+----+-------+----+
|Student|Physics|Chemistry|Math|Biology|Arts|
+-------+-------+---------+----+-------+----+
| Aditya|     92|       76|  95|     73|  91|
|  Chris|     95|     null|  79|     71|  93|
|    Sam|   null|       62|  75|     95|  63|
|  Harry|     68|       92|  69|     66|  98|
|   Golu|     74|     null|  96|     76|  64|
|   Joel|     99|       79|null|   null|  61|
|    Tom|   null|       94|  61|     65|  69|
|  Harsh|     98|       99|  93|     95|  91|
|  Clara|     93|     null|  78|   null|  71|
|   Tina|     99|       76|  78|     94|  95|
+-------+-------+---------+----+-------+----+

Rows at which there are not null values in Physics Column:
+-------+-------+---------+----+-------+----+
|Student|Physics|Chemistry|Math|Biology|Arts|
+-------+-------+---------+----+-------+----+
| Aditya|     92|       76|  95|     73|  91|
|  Chris|     95|     null|  79|     71|  93|
|  Harry|     68|       92|  69|     66|  98|
|   Golu|     74|     null|  96|     76|  64|
|   Joel|     99|       79|null|   null|  61|
|  Harsh|     98|       99|  93|     95|  91|
|  Clara|     93|     null|  78|   null|  71|
|   Tina|     99|       76|  78|     94|  95|
+-------+-------+---------+----+-------+----+

In this example, we used the where() method instead of the filter() method. However, the output remains the same.

Instead of the syntax used in the above examples, you can also use the col() function with the isNotNull() method to filter rows with not null values in a pyspark dataframe as shown below.

import pyspark.sql as ps
from pyspark.sql.functions import col
spark = ps.SparkSession.builder \
      .master("local[*]") \
      .appName("selectnull_example") \
      .getOrCreate()

dfs=spark.read.csv("StudentMarks.csv",header=True,inferSchema=True)
print("The input dataframe is:")
dfs.show()
dfs=dfs.where(col("Physics").isNotNull())
print("Rows at which there are not null values in Physics Column:")
dfs.show()
spark.sparkContext.stop()

Output:

The input dataframe is:
+-------+-------+---------+----+-------+----+
|Student|Physics|Chemistry|Math|Biology|Arts|
+-------+-------+---------+----+-------+----+
| Aditya|     92|       76|  95|     73|  91|
|  Chris|     95|     null|  79|     71|  93|
|    Sam|   null|       62|  75|     95|  63|
|  Harry|     68|       92|  69|     66|  98|
|   Golu|     74|     null|  96|     76|  64|
|   Joel|     99|       79|null|   null|  61|
|    Tom|   null|       94|  61|     65|  69|
|  Harsh|     98|       99|  93|     95|  91|
|  Clara|     93|     null|  78|   null|  71|
|   Tina|     99|       76|  78|     94|  95|
+-------+-------+---------+----+-------+----+

Rows at which there are not null values in Physics Column:
+-------+-------+---------+----+-------+----+
|Student|Physics|Chemistry|Math|Biology|Arts|
+-------+-------+---------+----+-------+----+
| Aditya|     92|       76|  95|     73|  91|
|  Chris|     95|     null|  79|     71|  93|
|  Harry|     68|       92|  69|     66|  98|
|   Golu|     74|     null|  96|     76|  64|
|   Joel|     99|       79|null|   null|  61|
|  Harsh|     98|       99|  93|     95|  91|
|  Clara|     93|     null|  78|   null|  71|
|   Tina|     99|       76|  78|     94|  95|
+-------+-------+---------+----+-------+----+

Select Rows With Not Null Values Using the dropna() Method

We can also use the dropna() method to select rows with not null values in a single column from a pyspark dataframe. For this, we can invoke the dropna() method on the pyspark dataframe and pass the column name as input to the subset parameter. After execution of the dropna() method, we will get rows with not null values in the specified column.

You can observe this in the following example.

import pyspark.sql as ps
from pyspark.sql.functions import col
spark = ps.SparkSession.builder \
      .master("local[*]") \
      .appName("selectnull_example") \
      .getOrCreate()

dfs=spark.read.csv("StudentMarks.csv",header=True,inferSchema=True)
print("The input dataframe is:")
dfs.show()
dfs=dfs.dropna(subset="Physics")
print("Rows at which there are not null values in Physics Column:")
dfs.show()
spark.sparkContext.stop()

Output:

The input dataframe is:
+-------+-------+---------+----+-------+----+
|Student|Physics|Chemistry|Math|Biology|Arts|
+-------+-------+---------+----+-------+----+
| Aditya|     92|       76|  95|     73|  91|
|  Chris|     95|     null|  79|     71|  93|
|    Sam|   null|       62|  75|     95|  63|
|  Harry|     68|       92|  69|     66|  98|
|   Golu|     74|     null|  96|     76|  64|
|   Joel|     99|       79|null|   null|  61|
|    Tom|   null|       94|  61|     65|  69|
|  Harsh|     98|       99|  93|     95|  91|
|  Clara|     93|     null|  78|   null|  71|
|   Tina|     99|       76|  78|     94|  95|
+-------+-------+---------+----+-------+----+

Rows at which there are not null values in Physics Column:
+-------+-------+---------+----+-------+----+
|Student|Physics|Chemistry|Math|Biology|Arts|
+-------+-------+---------+----+-------+----+
| Aditya|     92|       76|  95|     73|  91|
|  Chris|     95|     null|  79|     71|  93|
|  Harry|     68|       92|  69|     66|  98|
|   Golu|     74|     null|  96|     76|  64|
|   Joel|     99|       79|null|   null|  61|
|  Harsh|     98|       99|  93|     95|  91|
|  Clara|     93|     null|  78|   null|  71|
|   Tina|     99|       76|  78|     94|  95|
+-------+-------+---------+----+-------+----+

In this example, we invoked the dropna() method on the input pyspark dataframe. Here, we have passed the column name Physics as input to the subset parameter in the dropna() method. Due to this, all the rows in which the Physics column contains null values are dropped from the dataframe. Thus, we get the required rows with no null values in the Physics column in the dataframe.

Filter Rows With Not Null Values using SQL From a PySpark DataFrame

We can also use pyspark SQL syntax to filter rows with no null values in a column from a pyspark dataframe. For this, we will use the following steps.

First, we will create a view of the input dataframe using the createOrReplaceTempView() method. The createOrReplaceTempView(), when invoked on a pyspark dataframe, takes the name of the view object as its input argument and creates a view of the dataframe.
After getting the view of the dataframe, we can use the sql SELECT statement with IS NOT NULL clause to select rows with not null values in the given column in the pyspark dataframe. For this, we can execute the SQL query using the sql() function.

You can observe this in the following example.

import pyspark.sql as ps
from pyspark.sql.functions import col
spark = ps.SparkSession.builder \
      .master("local[*]") \
      .appName("selectnull_example") \
      .getOrCreate()

dfs=spark.read.csv("StudentMarks.csv",header=True,inferSchema=True)
print("The input dataframe is:")
dfs.show()
dfs.createOrReplaceTempView("df_sql")
new_df=spark.sql("SELECT * FROM df_sql WHERE Physics IS NOT NULL")
print("Rows at which there are not null values in Physics Column:")
new_df.show()
spark.sparkContext.stop()

Output:

The input dataframe is:
+-------+-------+---------+----+-------+----+
|Student|Physics|Chemistry|Math|Biology|Arts|
+-------+-------+---------+----+-------+----+
| Aditya|     92|       76|  95|     73|  91|
|  Chris|     95|     null|  79|     71|  93|
|    Sam|   null|       62|  75|     95|  63|
|  Harry|     68|       92|  69|     66|  98|
|   Golu|     74|     null|  96|     76|  64|
|   Joel|     99|       79|null|   null|  61|
|    Tom|   null|       94|  61|     65|  69|
|  Harsh|     98|       99|  93|     95|  91|
|  Clara|     93|     null|  78|   null|  71|
|   Tina|     99|       76|  78|     94|  95|
+-------+-------+---------+----+-------+----+

Rows at which there are not null values in Physics Column:
+-------+-------+---------+----+-------+----+
|Student|Physics|Chemistry|Math|Biology|Arts|
+-------+-------+---------+----+-------+----+
| Aditya|     92|       76|  95|     73|  91|
|  Chris|     95|     null|  79|     71|  93|
|  Harry|     68|       92|  69|     66|  98|
|   Golu|     74|     null|  96|     76|  64|
|   Joel|     99|       79|null|   null|  61|
|  Harsh|     98|       99|  93|     95|  91|
|  Clara|     93|     null|  78|   null|  71|
|   Tina|     99|       76|  78|     94|  95|
+-------+-------+---------+----+-------+----+

In this example, we have used the sql() function to execute the sql statement for selecting rows with not null values in the Physics column.

Instead of using the above approach, you can directly pass the statement in the WHERE clause of the sql statement to the filter() method as shown below.

import pyspark.sql as ps
spark = ps.SparkSession.builder \
      .master("local[*]") \
      .appName("selectnull_example") \
      .getOrCreate()

dfs=spark.read.csv("StudentMarks.csv",header=True,inferSchema=True)
print("The input dataframe is:")
dfs.show()
dfs=dfs.filter("Physics IS NOT NULL")
print("Rows at which there are not null values in Physics Column:")
dfs.show()
spark.sparkContext.stop()

Output:

The input dataframe is:
+-------+-------+---------+----+-------+----+
|Student|Physics|Chemistry|Math|Biology|Arts|
+-------+-------+---------+----+-------+----+
| Aditya|     92|       76|  95|     73|  91|
|  Chris|     95|     null|  79|     71|  93|
|    Sam|   null|       62|  75|     95|  63|
|  Harry|     68|       92|  69|     66|  98|
|   Golu|     74|     null|  96|     76|  64|
|   Joel|     99|       79|null|   null|  61|
|    Tom|   null|       94|  61|     65|  69|
|  Harsh|     98|       99|  93|     95|  91|
|  Clara|     93|     null|  78|   null|  71|
|   Tina|     99|       76|  78|     94|  95|
+-------+-------+---------+----+-------+----+

Rows at which there are not null values in Physics Column:
+-------+-------+---------+----+-------+----+
|Student|Physics|Chemistry|Math|Biology|Arts|
+-------+-------+---------+----+-------+----+
| Aditya|     92|       76|  95|     73|  91|
|  Chris|     95|     null|  79|     71|  93|
|  Harry|     68|       92|  69|     66|  98|
|   Golu|     74|     null|  96|     76|  64|
|   Joel|     99|       79|null|   null|  61|
|  Harsh|     98|       99|  93|     95|  91|
|  Clara|     93|     null|  78|   null|  71|
|   Tina|     99|       76|  78|     94|  95|
+-------+-------+---------+----+-------+----+

In this example, we passed the string "Physics IS NOT NULL" to the filter() method instead of executing the entire SQL query. However, the output remains the same.

Select Rows with Null Values in Multiple columns in a DataFrame

To select rows with Null values in Multiple columns, we can use the conditional operators along with the isNull() method inside the filter() and where() method. For this, we will use the following steps.

We will first invoke the isNull() method on all the required columns.
Next, we will enclose the masked columns returned by the isNull() method inside parentheses and combine them with conditional operators like AND (&) and OR (|) to create a conditional statement.
Finally, we will pass the conditional statement to the filter() method.

After executing the above statements, we can select rows with null values in multiple columns in a pyspark dataframe. You can observe this in the following example.

import pyspark.sql as ps
spark = ps.SparkSession.builder \
      .master("local[*]") \
      .appName("selectnull_example") \
      .getOrCreate()

dfs=spark.read.csv("StudentMarks.csv",header=True,inferSchema=True)
print("The input dataframe is:")
dfs.show()
dfs=dfs.filter((dfs.Physics.isNull())& (dfs.Chemistry.isNull()))
print("Rows with null values in Physics and Chemistry Columns are:")
dfs.show()
spark.sparkContext.stop()

Output:

The input dataframe is:
+-------+-------+---------+----+-------+----+
|Student|Physics|Chemistry|Math|Biology|Arts|
+-------+-------+---------+----+-------+----+
| Aditya|     92|       76|  95|     73|  91|
|  Chris|     95|     null|  79|     71|  93|
|    Sam|   null|       62|  75|     95|  63|
|  Harry|     68|       92|  69|     66|  98|
|   Golu|     74|     null|  96|     76|  64|
|   Joel|     99|       79|null|   null|  61|
|    Tom|   null|       94|  61|     65|  69|
|  Harsh|     98|       99|  93|     95|  91|
|  Clara|     93|     null|  78|   null|  71|
|   Tina|     99|       76|  78|     94|  95|
+-------+-------+---------+----+-------+----+

Rows with null values in Physics and Chemistry Columns are:
+-------+-------+---------+----+-------+----+
|Student|Physics|Chemistry|Math|Biology|Arts|
+-------+-------+---------+----+-------+----+
+-------+-------+---------+----+-------+----+

In this example, we have passed the conditions (dfs.Physics.isNull()) and (dfs.Chemistry.isNull()) to the filter() method to filter rows with not null values in these columns. Observe that the output of the isNull() method is enclosed inside parentheses for the & operator to execute successfully. If we don’t use parentheses here, the program will run into error.

In the output, you can observe that we get an empty dataframe because there are no rows where both the Physics and Chemistry column are null.

In a similar manner, you can use the OR conditional operator (|) to filter rows in which the Physics column or Chemistry column contains null values as shown below.

import pyspark.sql as ps
spark = ps.SparkSession.builder \
      .master("local[*]") \
      .appName("selectnull_example") \
      .getOrCreate()

dfs=spark.read.csv("StudentMarks.csv",header=True,inferSchema=True)
print("The input dataframe is:")
dfs.show()
dfs=dfs.filter((dfs.Physics.isNull())|(dfs.Chemistry.isNull()))
print("Rows with null values in Physics or Chemistry Columns are:")
dfs.show()
spark.sparkContext.stop()

Output:

The input dataframe is:
+-------+-------+---------+----+-------+----+
|Student|Physics|Chemistry|Math|Biology|Arts|
+-------+-------+---------+----+-------+----+
| Aditya|     92|       76|  95|     73|  91|
|  Chris|     95|     null|  79|     71|  93|
|    Sam|   null|       62|  75|     95|  63|
|  Harry|     68|       92|  69|     66|  98|
|   Golu|     74|     null|  96|     76|  64|
|   Joel|     99|       79|null|   null|  61|
|    Tom|   null|       94|  61|     65|  69|
|  Harsh|     98|       99|  93|     95|  91|
|  Clara|     93|     null|  78|   null|  71|
|   Tina|     99|       76|  78|     94|  95|
+-------+-------+---------+----+-------+----+

Rows with null values in Physics or Chemistry Columns are:
+-------+-------+---------+----+-------+----+
|Student|Physics|Chemistry|Math|Biology|Arts|
+-------+-------+---------+----+-------+----+
|  Chris|     95|     null|  79|     71|  93|
|    Sam|   null|       62|  75|     95|  63|
|   Golu|     74|     null|  96|     76|  64|
|    Tom|   null|       94|  61|     65|  69|
|  Clara|     93|     null|  78|   null|  71|
+-------+-------+---------+----+-------+----+

In the above example, we have used the filter() method with the conditions to filter rows with null values in Physics or Chemistry column from the pyspark dataframe.

Instead of the filter() method, you can also use the where() method in the previous example as shown below.

import pyspark.sql as ps
spark = ps.SparkSession.builder \
      .master("local[*]") \
      .appName("selectnull_example") \
      .getOrCreate()

dfs=spark.read.csv("StudentMarks.csv",header=True,inferSchema=True)
print("The input dataframe is:")
dfs.show()
dfs=dfs.where((dfs.Physics.isNull())|(dfs.Chemistry.isNull()))
print("Rows with null values in Physics or Chemistry Columns are:")
dfs.show()
spark.sparkContext.stop()

Output:

The input dataframe is:
+-------+-------+---------+----+-------+----+
|Student|Physics|Chemistry|Math|Biology|Arts|
+-------+-------+---------+----+-------+----+
| Aditya|     92|       76|  95|     73|  91|
|  Chris|     95|     null|  79|     71|  93|
|    Sam|   null|       62|  75|     95|  63|
|  Harry|     68|       92|  69|     66|  98|
|   Golu|     74|     null|  96|     76|  64|
|   Joel|     99|       79|null|   null|  61|
|    Tom|   null|       94|  61|     65|  69|
|  Harsh|     98|       99|  93|     95|  91|
|  Clara|     93|     null|  78|   null|  71|
|   Tina|     99|       76|  78|     94|  95|
+-------+-------+---------+----+-------+----+

Rows with null values in Physics or Chemistry Columns are:
+-------+-------+---------+----+-------+----+
|Student|Physics|Chemistry|Math|Biology|Arts|
+-------+-------+---------+----+-------+----+
|  Chris|     95|     null|  79|     71|  93|
|    Sam|   null|       62|  75|     95|  63|
|   Golu|     74|     null|  96|     76|  64|
|    Tom|   null|       94|  61|     65|  69|
|  Clara|     93|     null|  78|   null|  71|
+-------+-------+---------+----+-------+----+

We can also use multiple conditional statements inside SQL syntax to select rows with null values in multiple columns in a pyspark dataframe. For this, you can use the IS NULL clause with conditional operators in the WHERE clause of the SQL statement as shown below.

import pyspark.sql as ps
spark = ps.SparkSession.builder \
      .master("local[*]") \
      .appName("selectnull_example") \
      .getOrCreate()

dfs=spark.read.csv("StudentMarks.csv",header=True,inferSchema=True)
print("The input dataframe is:")
dfs.show()
dfs.createOrReplaceTempView("df_sql")
new_df=spark.sql("SELECT * FROM df_sql WHERE Physics IS NULL or Chemistry IS NULL")
print("Rows with null values in Physics or Chemistry Columns are:")
new_df.show()
spark.sparkContext.stop()

Output:

The input dataframe is:
+-------+-------+---------+----+-------+----+
|Student|Physics|Chemistry|Math|Biology|Arts|
+-------+-------+---------+----+-------+----+
| Aditya|     92|       76|  95|     73|  91|
|  Chris|     95|     null|  79|     71|  93|
|    Sam|   null|       62|  75|     95|  63|
|  Harry|     68|       92|  69|     66|  98|
|   Golu|     74|     null|  96|     76|  64|
|   Joel|     99|       79|null|   null|  61|
|    Tom|   null|       94|  61|     65|  69|
|  Harsh|     98|       99|  93|     95|  91|
|  Clara|     93|     null|  78|   null|  71|
|   Tina|     99|       76|  78|     94|  95|
+-------+-------+---------+----+-------+----+

Rows with null values in Physics or Chemistry Columns are:
+-------+-------+---------+----+-------+----+
|Student|Physics|Chemistry|Math|Biology|Arts|
+-------+-------+---------+----+-------+----+
|  Chris|     95|     null|  79|     71|  93|
|    Sam|   null|       62|  75|     95|  63|
|   Golu|     74|     null|  96|     76|  64|
|    Tom|   null|       94|  61|     65|  69|
|  Clara|     93|     null|  78|   null|  71|
+-------+-------+---------+----+-------+----+

Instead of using the sql statement, we can also pass the conditional statement in the WHERE clause to the filter() method as shown below.

import pyspark.sql as ps
spark = ps.SparkSession.builder \
      .master("local[*]") \
      .appName("selectnull_example") \
      .getOrCreate()

dfs=spark.read.csv("StudentMarks.csv",header=True,inferSchema=True)
print("The input dataframe is:")
dfs.show()
dfs=dfs.filter("Physics IS NULL OR Chemistry IS NULL")
print("Rows with null values in Physics or Chemistry Columns are:")
dfs.show()
spark.sparkContext.stop()

Output:

The input dataframe is:
+-------+-------+---------+----+-------+----+
|Student|Physics|Chemistry|Math|Biology|Arts|
+-------+-------+---------+----+-------+----+
| Aditya|     92|       76|  95|     73|  91|
|  Chris|     95|     null|  79|     71|  93|
|    Sam|   null|       62|  75|     95|  63|
|  Harry|     68|       92|  69|     66|  98|
|   Golu|     74|     null|  96|     76|  64|
|   Joel|     99|       79|null|   null|  61|
|    Tom|   null|       94|  61|     65|  69|
|  Harsh|     98|       99|  93|     95|  91|
|  Clara|     93|     null|  78|   null|  71|
|   Tina|     99|       76|  78|     94|  95|
+-------+-------+---------+----+-------+----+

Rows with null values in Physics or Chemistry Columns are:
+-------+-------+---------+----+-------+----+
|Student|Physics|Chemistry|Math|Biology|Arts|
+-------+-------+---------+----+-------+----+
|  Chris|     95|     null|  79|     71|  93|
|    Sam|   null|       62|  75|     95|  63|
|   Golu|     74|     null|  96|     76|  64|
|    Tom|   null|       94|  61|     65|  69|
|  Clara|     93|     null|  78|   null|  71|
+-------+-------+---------+----+-------+----+

Select Rows with Not Null Values in Multiple Columns

To select rows with not Null values in Multiple columns, we can use the conditional operators along with the isNotNull() method inside the filter() and where() method. For this, we will use the following steps.

We will first invoke the isNotNull() method on all the required columns.
Next, we will enclose the masked columns returned by the isNotNull() method inside parentheses and combine them with conditional operators like AND (&) and OR (|) to create a conditional statement.
Finally, we will pass the conditional statement to the filter() method.

After executing the above statements, we can select rows with not null values in multiple columns in a pyspark dataframe. You can observe this in the following example.

import pyspark.sql as ps
spark = ps.SparkSession.builder \
      .master("local[*]") \
      .appName("selectnull_example") \
      .getOrCreate()

dfs=spark.read.csv("StudentMarks.csv",header=True,inferSchema=True)
print("The input dataframe is:")
dfs.show()
dfs=dfs.filter((dfs.Physics.isNotNull())& (dfs.Chemistry.isNotNull()))
print("Rows with not null values in Physics and Chemistry Columns are:")
dfs.show()
spark.sparkContext.stop()

Output:

The input dataframe is:
+-------+-------+---------+----+-------+----+
|Student|Physics|Chemistry|Math|Biology|Arts|
+-------+-------+---------+----+-------+----+
| Aditya|     92|       76|  95|     73|  91|
|  Chris|     95|     null|  79|     71|  93|
|    Sam|   null|       62|  75|     95|  63|
|  Harry|     68|       92|  69|     66|  98|
|   Golu|     74|     null|  96|     76|  64|
|   Joel|     99|       79|null|   null|  61|
|    Tom|   null|       94|  61|     65|  69|
|  Harsh|     98|       99|  93|     95|  91|
|  Clara|     93|     null|  78|   null|  71|
|   Tina|     99|       76|  78|     94|  95|
+-------+-------+---------+----+-------+----+

Rows with not null values in Physics and Chemistry Columns are:
+-------+-------+---------+----+-------+----+
|Student|Physics|Chemistry|Math|Biology|Arts|
+-------+-------+---------+----+-------+----+
| Aditya|     92|       76|  95|     73|  91|
|  Harry|     68|       92|  69|     66|  98|
|   Joel|     99|       79|null|   null|  61|
|  Harsh|     98|       99|  93|     95|  91|
|   Tina|     99|       76|  78|     94|  95|
+-------+-------+---------+----+-------+----+

Instead of the filter() method, you can also use the where() method in the previous example as shown below.

import pyspark.sql as ps
spark = ps.SparkSession.builder \
      .master("local[*]") \
      .appName("selectnull_example") \
      .getOrCreate()

dfs=spark.read.csv("StudentMarks.csv",header=True,inferSchema=True)
print("The input dataframe is:")
dfs.show()
dfs=dfs.where((dfs.Physics.isNotNull())& (dfs.Chemistry.isNotNull()))
print("Rows with not null values in Physics and Chemistry Columns are:")
dfs.show()
spark.sparkContext.stop()

Output:

The input dataframe is:
+-------+-------+---------+----+-------+----+
|Student|Physics|Chemistry|Math|Biology|Arts|
+-------+-------+---------+----+-------+----+
| Aditya|     92|       76|  95|     73|  91|
|  Chris|     95|     null|  79|     71|  93|
|    Sam|   null|       62|  75|     95|  63|
|  Harry|     68|       92|  69|     66|  98|
|   Golu|     74|     null|  96|     76|  64|
|   Joel|     99|       79|null|   null|  61|
|    Tom|   null|       94|  61|     65|  69|
|  Harsh|     98|       99|  93|     95|  91|
|  Clara|     93|     null|  78|   null|  71|
|   Tina|     99|       76|  78|     94|  95|
+-------+-------+---------+----+-------+----+

Rows with not null values in Physics and Chemistry Columns are:
+-------+-------+---------+----+-------+----+
|Student|Physics|Chemistry|Math|Biology|Arts|
+-------+-------+---------+----+-------+----+
| Aditya|     92|       76|  95|     73|  91|
|  Harry|     68|       92|  69|     66|  98|
|   Joel|     99|       79|null|   null|  61|
|  Harsh|     98|       99|  93|     95|  91|
|   Tina|     99|       76|  78|     94|  95|
+-------+-------+---------+----+-------+----+

We can also use multiple conditional statements inside the SQL WHERE clause to select rows with not null values in multiple columns in a pyspark dataframe as shown below.

import pyspark.sql as ps
spark = ps.SparkSession.builder \
      .master("local[*]") \
      .appName("selectnull_example") \
      .getOrCreate()

dfs=spark.read.csv("StudentMarks.csv",header=True,inferSchema=True)
print("The input dataframe is:")
dfs.show()
dfs.createOrReplaceTempView("df_sql")
new_df=spark.sql("SELECT * FROM df_sql WHERE Physics IS NOT NULL AND Chemistry IS NOT NULL")
print("Rows with not null values in Physics and Chemistry Columns are:")
new_df.show()
spark.sparkContext.stop()

Output:

The input dataframe is:
+-------+-------+---------+----+-------+----+
|Student|Physics|Chemistry|Math|Biology|Arts|
+-------+-------+---------+----+-------+----+
| Aditya|     92|       76|  95|     73|  91|
|  Chris|     95|     null|  79|     71|  93|
|    Sam|   null|       62|  75|     95|  63|
|  Harry|     68|       92|  69|     66|  98|
|   Golu|     74|     null|  96|     76|  64|
|   Joel|     99|       79|null|   null|  61|
|    Tom|   null|       94|  61|     65|  69|
|  Harsh|     98|       99|  93|     95|  91|
|  Clara|     93|     null|  78|   null|  71|
|   Tina|     99|       76|  78|     94|  95|
+-------+-------+---------+----+-------+----+

Rows with not null values in Physics and Chemistry Columns are:
+-------+-------+---------+----+-------+----+
|Student|Physics|Chemistry|Math|Biology|Arts|
+-------+-------+---------+----+-------+----+
| Aditya|     92|       76|  95|     73|  91|
|  Harry|     68|       92|  69|     66|  98|
|   Joel|     99|       79|null|   null|  61|
|  Harsh|     98|       99|  93|     95|  91|
|   Tina|     99|       76|  78|     94|  95|
+-------+-------+---------+----+-------+----+

Instead of using the sql statement, we can also pass the conditional statement in the WHERE clause to the filter() method as shown below.

import pyspark.sql as ps
spark = ps.SparkSession.builder \
      .master("local[*]") \
      .appName("selectnull_example") \
      .getOrCreate()

dfs=spark.read.csv("StudentMarks.csv",header=True,inferSchema=True)
print("The input dataframe is:")
dfs.show()
dfs=dfs.filter("Physics IS NOT NULL AND Chemistry IS NOT NULL")
print("Rows with notn null values in Physics and Chemistry Columns are:")
dfs.show()
spark.sparkContext.stop()

Output:

The input dataframe is:
+-------+-------+---------+----+-------+----+
|Student|Physics|Chemistry|Math|Biology|Arts|
+-------+-------+---------+----+-------+----+
| Aditya|     92|       76|  95|     73|  91|
|  Chris|     95|     null|  79|     71|  93|
|    Sam|   null|       62|  75|     95|  63|
|  Harry|     68|       92|  69|     66|  98|
|   Golu|     74|     null|  96|     76|  64|
|   Joel|     99|       79|null|   null|  61|
|    Tom|   null|       94|  61|     65|  69|
|  Harsh|     98|       99|  93|     95|  91|
|  Clara|     93|     null|  78|   null|  71|
|   Tina|     99|       76|  78|     94|  95|
+-------+-------+---------+----+-------+----+

Rows with notn null values in Physics and Chemistry Columns are:
+-------+-------+---------+----+-------+----+
|Student|Physics|Chemistry|Math|Biology|Arts|
+-------+-------+---------+----+-------+----+
| Aditya|     92|       76|  95|     73|  91|
|  Harry|     68|       92|  69|     66|  98|
|   Joel|     99|       79|null|   null|  61|
|  Harsh|     98|       99|  93|     95|  91|
|   Tina|     99|       76|  78|     94|  95|
+-------+-------+---------+----+-------+----+

Conclusion

In this article, we discussed different ways to select rows with null values in a pyspark dataframe. To learn more about pyspark dataframes, you can read this article on how to sort a pyspark dataframe. You might also like this article on how to select distinct rows from a pyspark dataframe.

I hope you enjoyed reading this article. Stay tuned for more informative articles.

Happy Learning!

Recommended Python Training

Course: Python 3 For Beginners

Over 15 hours of video content with guided instruction for beginners. Learn how to create real world applications and master the basics.

Enroll Now

The isNull() Method in PySpark

The isNotNull() Method in PySpark

Select Rows With Null Values in a Column in PySpark DataFrame

Select Rows With Null Values Using The filter() Method

Filter Rows With Null Values using the where() Method

Select Rows With Null Values Using SQL syntax

Select Rows With Not Null Values in a Column

Filter Rows With Not Null Values Using The filter() Method

Select Rows With Not Null Values Using the where() Method

Select Rows With Not Null Values Using the dropna() Method

Filter Rows With Not Null Values using SQL From a PySpark DataFrame

Select Rows with Null Values in Multiple columns in a DataFrame

Select Rows with Not Null Values in Multiple Columns

Conclusion

Related

Recommended Python Training

More Python Topics