Missing values in tabular data are a common problem. When we load tabular data with missing values into a pyspark dataframe, the empty values are replaced with null values. In this article, we will discuss how to select rows with null values in a given pyspark dataframe.
The isNull() Method in PySpark
The isNull()
Method is used to check for null values in a pyspark dataframe column. When we invoke the isNull()
method on a dataframe column, it returns a masked column having True and False values. Here, the values in the mask are set to True at the positions where no values are present. Otherwise, the value in the mask is set to True.
The length of the mask column is equal to the number of rows in the pyspark dataframe. We can use the isNull()
method with the filter()
or where()
method to filter rows with null values from a pyspark dataframe.
The isNotNull() Method in PySpark
The isNotNull()
method is the negation of the isNull()
method. It is used to check for not null values in pyspark. If we invoke the isNotNull()
method on a dataframe column, it also returns a mask having True and False values. Here, the values in the mask are set to False at the positions where no values are present. Otherwise, the value in the mask is set to True. Again, the length of the mask column is equal to the number of rows in the pyspark dataframe.
We can use the isNotNull()
method with the filter()
or where()
method to select rows with not null values from a pyspark dataframe.
Select Rows With Null Values in a Column in PySpark DataFrame
To select rows with null values in a column in a pyspark dataframe, we can use the following approaches.
- Using
filter()
method and theisNull()
Method - By using the
where()
method and theisNull()
Method - By Using sql
IS NULL
statement.
Let us discuss all these approaches one by one.
Select Rows With Null Values Using The filter() Method
To filter rows with null values in a particular column in a pyspark dataframe, we will first invoke the isNull()
method on the given column. The isNull()
method will return a masked column having True and False values. We will pass the mask column object returned by the isNull()
method to the filter()
method. After this, we will get the rows with null values in the given column.
You can observe this in the following example.
import pyspark.sql as ps
spark = ps.SparkSession.builder \
.master("local[*]") \
.appName("selectnull_example") \
.getOrCreate()
dfs=spark.read.csv("StudentMarks.csv",header=True,inferSchema=True)
print("The input dataframe is:")
dfs.show()
dfs=dfs.filter(dfs.Physics.isNull())
print("Rows at which there are null values in Physics Column:")
dfs.show()
spark.sparkContext.stop()
Output:
The input dataframe is:
+-------+-------+---------+----+-------+----+
|Student|Physics|Chemistry|Math|Biology|Arts|
+-------+-------+---------+----+-------+----+
| Aditya| 92| 76| 95| 73| 91|
| Chris| 95| null| 79| 71| 93|
| Sam| null| 62| 75| 95| 63|
| Harry| 68| 92| 69| 66| 98|
| Golu| 74| null| 96| 76| 64|
| Joel| 99| 79|null| null| 61|
| Tom| null| 94| 61| 65| 69|
| Harsh| 98| 99| 93| 95| 91|
| Clara| 93| null| 78| null| 71|
| Tina| 99| 76| 78| 94| 95|
+-------+-------+---------+----+-------+----+
Rows at which there are null values in Physics Column:
+-------+-------+---------+----+-------+----+
|Student|Physics|Chemistry|Math|Biology|Arts|
+-------+-------+---------+----+-------+----+
| Sam| null| 62| 75| 95| 63|
| Tom| null| 94| 61| 65| 69|
+-------+-------+---------+----+-------+----+
In the above example, we first read a csv file into a pyspark dataframe. Then, we used the isNull()
method and the filter()
method to filter rows with Null values in the Physics
column. In the output, you can observe that the final dataframe contains only those rows in which the Physics
column is set to Null.
Filter Rows With Null Values using the where() Method
Instead of the filter()
method, you can also use the where()
method to filter rows with null values in a given column. The where()
method also takes the mask returned by the isNull()
method and returns a dataframe containing the rows where the mask column contains True.
You can observe this in the following example.
import pyspark.sql as ps
spark = ps.SparkSession.builder \
.master("local[*]") \
.appName("selectnull_example") \
.getOrCreate()
dfs=spark.read.csv("StudentMarks.csv",header=True,inferSchema=True)
print("The input dataframe is:")
dfs.show()
dfs=dfs.where(dfs.Physics.isNull())
print("Rows at which there are null values in Physics Column:")
dfs.show()
spark.sparkContext.stop()
Output:
The input dataframe is:
+-------+-------+---------+----+-------+----+
|Student|Physics|Chemistry|Math|Biology|Arts|
+-------+-------+---------+----+-------+----+
| Aditya| 92| 76| 95| 73| 91|
| Chris| 95| null| 79| 71| 93|
| Sam| null| 62| 75| 95| 63|
| Harry| 68| 92| 69| 66| 98|
| Golu| 74| null| 96| 76| 64|
| Joel| 99| 79|null| null| 61|
| Tom| null| 94| 61| 65| 69|
| Harsh| 98| 99| 93| 95| 91|
| Clara| 93| null| 78| null| 71|
| Tina| 99| 76| 78| 94| 95|
+-------+-------+---------+----+-------+----+
Rows at which there are null values in Physics Column:
+-------+-------+---------+----+-------+----+
|Student|Physics|Chemistry|Math|Biology|Arts|
+-------+-------+---------+----+-------+----+
| Sam| null| 62| 75| 95| 63|
| Tom| null| 94| 61| 65| 69|
+-------+-------+---------+----+-------+----+
In this example, we have used the where()
method instead of the filter()
method. However, you can observe that the output remains the same.
Instead of the syntax used in the above examples, you can also use the col()
function with the isNull()
method to create the mask containing True and False values. The col()
function is defined in the pyspark.sql.functions module. It takes a column name as an input argument and returns the column object containing all the values in the given column in a pyspark dataframe.
We can invoke the isNull()
method on the column object to select rows with null values from the pyspark dataframe as shown below.
import pyspark.sql as ps
from pyspark.sql.functions import col
spark = ps.SparkSession.builder \
.master("local[*]") \
.appName("selectnull_example") \
.getOrCreate()
dfs=spark.read.csv("StudentMarks.csv",header=True,inferSchema=True)
print("The input dataframe is:")
dfs.show()
dfs=dfs.where(col("Physics").isNull())
print("Rows at which there are null values in Physics Column:")
dfs.show()
spark.sparkContext.stop()
Output:
The input dataframe is:
+-------+-------+---------+----+-------+----+
|Student|Physics|Chemistry|Math|Biology|Arts|
+-------+-------+---------+----+-------+----+
| Aditya| 92| 76| 95| 73| 91|
| Chris| 95| null| 79| 71| 93|
| Sam| null| 62| 75| 95| 63|
| Harry| 68| 92| 69| 66| 98|
| Golu| 74| null| 96| 76| 64|
| Joel| 99| 79|null| null| 61|
| Tom| null| 94| 61| 65| 69|
| Harsh| 98| 99| 93| 95| 91|
| Clara| 93| null| 78| null| 71|
| Tina| 99| 76| 78| 94| 95|
+-------+-------+---------+----+-------+----+
Rows at which there are null values in Physics Column:
+-------+-------+---------+----+-------+----+
|Student|Physics|Chemistry|Math|Biology|Arts|
+-------+-------+---------+----+-------+----+
| Sam| null| 62| 75| 95| 63|
| Tom| null| 94| 61| 65| 69|
+-------+-------+---------+----+-------+----+
In this example, we have used the col()
function to access the Physics
column of the dataframe instead of using the name of the dataframe.
Select Rows With Null Values Using SQL syntax
We can also use spark SQL to filter rows with null values from a pyspark dataframe. For this, we will first create a view of the input dataframe using the createOrReplaceTempView()
method. The createOrReplaceTempView()
, when invoked on a pyspark dataframe, takes the name of the view object as its input argument and creates a view of the dataframe.
After getting the view of the dataframe, we can use the sql SELECT statement with IS NULL clause to select rows with null values in a given column in the pyspark dataframe. For this, we can execute the SQL query using the sql()
function as shown below.
import pyspark.sql as ps
from pyspark.sql.functions import col
spark = ps.SparkSession.builder \
.master("local[*]") \
.appName("selectnull_example") \
.getOrCreate()
dfs=spark.read.csv("StudentMarks.csv",header=True,inferSchema=True)
print("The input dataframe is:")
dfs.show()
dfs.createOrReplaceTempView("df_sql")
new_df=spark.sql("SELECT * FROM df_sql WHERE Physics IS NULL")
print("Rows at which there are null values in Physics Column:")
new_df.show()
spark.sparkContext.stop()
Output:
The input dataframe is:
+-------+-------+---------+----+-------+----+
|Student|Physics|Chemistry|Math|Biology|Arts|
+-------+-------+---------+----+-------+----+
| Aditya| 92| 76| 95| 73| 91|
| Chris| 95| null| 79| 71| 93|
| Sam| null| 62| 75| 95| 63|
| Harry| 68| 92| 69| 66| 98|
| Golu| 74| null| 96| 76| 64|
| Joel| 99| 79|null| null| 61|
| Tom| null| 94| 61| 65| 69|
| Harsh| 98| 99| 93| 95| 91|
| Clara| 93| null| 78| null| 71|
| Tina| 99| 76| 78| 94| 95|
+-------+-------+---------+----+-------+----+
Rows at which there are null values in Physics Column:
+-------+-------+---------+----+-------+----+
|Student|Physics|Chemistry|Math|Biology|Arts|
+-------+-------+---------+----+-------+----+
| Sam| null| 62| 75| 95| 63|
| Tom| null| 94| 61| 65| 69|
+-------+-------+---------+----+-------+----+
In the above example, we first created a view of the input dataframe. This is because we cannot execute SQL statements on a pyspark dataframe. Then, we used the sql()
function to execute the SQL query for retrieving rows in which the Physics
column is Null. The sql()
function takes the SQL query as its input and returns the output dataframe.
Instead of creating the view of the dataframe and executing the sql()
function, you can directly pass the statement in the WHERE clause in the sql statement to the filter()
method as shown below.
import pyspark.sql as ps
spark = ps.SparkSession.builder \
.master("local[*]") \
.appName("selectnull_example") \
.getOrCreate()
dfs=spark.read.csv("StudentMarks.csv",header=True,inferSchema=True)
print("The input dataframe is:")
dfs.show()
dfs=dfs.filter("Physics IS NULL")
print("Rows at which there are null values in Physics Column:")
dfs.show()
spark.sparkContext.stop()
Output:
The input dataframe is:
+-------+-------+---------+----+-------+----+
|Student|Physics|Chemistry|Math|Biology|Arts|
+-------+-------+---------+----+-------+----+
| Aditya| 92| 76| 95| 73| 91|
| Chris| 95| null| 79| 71| 93|
| Sam| null| 62| 75| 95| 63|
| Harry| 68| 92| 69| 66| 98|
| Golu| 74| null| 96| 76| 64|
| Joel| 99| 79|null| null| 61|
| Tom| null| 94| 61| 65| 69|
| Harsh| 98| 99| 93| 95| 91|
| Clara| 93| null| 78| null| 71|
| Tina| 99| 76| 78| 94| 95|
+-------+-------+---------+----+-------+----+
Rows at which there are null values in Physics Column:
+-------+-------+---------+----+-------+----+
|Student|Physics|Chemistry|Math|Biology|Arts|
+-------+-------+---------+----+-------+----+
| Sam| null| 62| 75| 95| 63|
| Tom| null| 94| 61| 65| 69|
+-------+-------+---------+----+-------+----+
In this example, we have passed the "Physics IS NULL"
string to the filter()
method. Hence, the filter()
method treats the string as a statement of the WHERE clause of the SQL statement and returns the output dataframe.
Select Rows With Not Null Values in a Column
We can use the approaches used for selecting rows with null values to filter rows with not null values. The only difference is that we need to use the isNotNull()
method in place of the isNull()
method.
Filter Rows With Not Null Values Using The filter() Method
To select rows with not null values in a particular column in a pyspark dataframe, we will first invoke the isNotNull()
method on the given column. The isNotNull()
method will return a masked column containing True and False values. Next, we will pass the mask column object returned by the isNotNull()
method to the filter()
method. After this, we will get the rows with not null values in the given column.
You can observe this in the following example.
import pyspark.sql as ps
spark = ps.SparkSession.builder \
.master("local[*]") \
.appName("selectnull_example") \
.getOrCreate()
dfs=spark.read.csv("StudentMarks.csv",header=True,inferSchema=True)
print("The input dataframe is:")
dfs.show()
dfs=dfs.filter(dfs.Physics.isNotNull())
print("Rows at which there are not null values in Physics Column:")
dfs.show()
spark.sparkContext.stop()
Output:
The input dataframe is:
+-------+-------+---------+----+-------+----+
|Student|Physics|Chemistry|Math|Biology|Arts|
+-------+-------+---------+----+-------+----+
| Aditya| 92| 76| 95| 73| 91|
| Chris| 95| null| 79| 71| 93|
| Sam| null| 62| 75| 95| 63|
| Harry| 68| 92| 69| 66| 98|
| Golu| 74| null| 96| 76| 64|
| Joel| 99| 79|null| null| 61|
| Tom| null| 94| 61| 65| 69|
| Harsh| 98| 99| 93| 95| 91|
| Clara| 93| null| 78| null| 71|
| Tina| 99| 76| 78| 94| 95|
+-------+-------+---------+----+-------+----+
Rows at which there are not null values in Physics Column:
+-------+-------+---------+----+-------+----+
|Student|Physics|Chemistry|Math|Biology|Arts|
+-------+-------+---------+----+-------+----+
| Aditya| 92| 76| 95| 73| 91|
| Chris| 95| null| 79| 71| 93|
| Harry| 68| 92| 69| 66| 98|
| Golu| 74| null| 96| 76| 64|
| Joel| 99| 79|null| null| 61|
| Harsh| 98| 99| 93| 95| 91|
| Clara| 93| null| 78| null| 71|
| Tina| 99| 76| 78| 94| 95|
+-------+-------+---------+----+-------+----+
In this example, we used the isNotNull()
method with the filter()
method to select rows with not null values from the pyspark dataframe.
Select Rows With Not Null Values Using the where() Method
Instead of the filter() method, you can also use the where()
method to filter rows with not null values in a given column. The where()
method also takes the mask returned by the isNotNull()
method and returns a dataframe containing the rows where the mask column contains True.
You can observe this in the following example.
import pyspark.sql as ps
spark = ps.SparkSession.builder \
.master("local[*]") \
.appName("selectnull_example") \
.getOrCreate()
dfs=spark.read.csv("StudentMarks.csv",header=True,inferSchema=True)
print("The input dataframe is:")
dfs.show()
dfs=dfs.where(dfs.Physics.isNotNull())
print("Rows at which there are not null values in Physics Column:")
dfs.show()
spark.sparkContext.stop()
Output:
he input dataframe is:
+-------+-------+---------+----+-------+----+
|Student|Physics|Chemistry|Math|Biology|Arts|
+-------+-------+---------+----+-------+----+
| Aditya| 92| 76| 95| 73| 91|
| Chris| 95| null| 79| 71| 93|
| Sam| null| 62| 75| 95| 63|
| Harry| 68| 92| 69| 66| 98|
| Golu| 74| null| 96| 76| 64|
| Joel| 99| 79|null| null| 61|
| Tom| null| 94| 61| 65| 69|
| Harsh| 98| 99| 93| 95| 91|
| Clara| 93| null| 78| null| 71|
| Tina| 99| 76| 78| 94| 95|
+-------+-------+---------+----+-------+----+
Rows at which there are not null values in Physics Column:
+-------+-------+---------+----+-------+----+
|Student|Physics|Chemistry|Math|Biology|Arts|
+-------+-------+---------+----+-------+----+
| Aditya| 92| 76| 95| 73| 91|
| Chris| 95| null| 79| 71| 93|
| Harry| 68| 92| 69| 66| 98|
| Golu| 74| null| 96| 76| 64|
| Joel| 99| 79|null| null| 61|
| Harsh| 98| 99| 93| 95| 91|
| Clara| 93| null| 78| null| 71|
| Tina| 99| 76| 78| 94| 95|
+-------+-------+---------+----+-------+----+
In this example, we used the where()
method instead of the filter()
method. However, the output remains the same.
Instead of the syntax used in the above examples, you can also use the col()
function with the isNotNull()
method to filter rows with not null values in a pyspark dataframe as shown below.
import pyspark.sql as ps
from pyspark.sql.functions import col
spark = ps.SparkSession.builder \
.master("local[*]") \
.appName("selectnull_example") \
.getOrCreate()
dfs=spark.read.csv("StudentMarks.csv",header=True,inferSchema=True)
print("The input dataframe is:")
dfs.show()
dfs=dfs.where(col("Physics").isNotNull())
print("Rows at which there are not null values in Physics Column:")
dfs.show()
spark.sparkContext.stop()
Output:
The input dataframe is:
+-------+-------+---------+----+-------+----+
|Student|Physics|Chemistry|Math|Biology|Arts|
+-------+-------+---------+----+-------+----+
| Aditya| 92| 76| 95| 73| 91|
| Chris| 95| null| 79| 71| 93|
| Sam| null| 62| 75| 95| 63|
| Harry| 68| 92| 69| 66| 98|
| Golu| 74| null| 96| 76| 64|
| Joel| 99| 79|null| null| 61|
| Tom| null| 94| 61| 65| 69|
| Harsh| 98| 99| 93| 95| 91|
| Clara| 93| null| 78| null| 71|
| Tina| 99| 76| 78| 94| 95|
+-------+-------+---------+----+-------+----+
Rows at which there are not null values in Physics Column:
+-------+-------+---------+----+-------+----+
|Student|Physics|Chemistry|Math|Biology|Arts|
+-------+-------+---------+----+-------+----+
| Aditya| 92| 76| 95| 73| 91|
| Chris| 95| null| 79| 71| 93|
| Harry| 68| 92| 69| 66| 98|
| Golu| 74| null| 96| 76| 64|
| Joel| 99| 79|null| null| 61|
| Harsh| 98| 99| 93| 95| 91|
| Clara| 93| null| 78| null| 71|
| Tina| 99| 76| 78| 94| 95|
+-------+-------+---------+----+-------+----+
Select Rows With Not Null Values Using the dropna() Method
We can also use the dropna()
method to select rows with not null values in a single column from a pyspark dataframe. For this, we can invoke the dropna()
method on the pyspark dataframe and pass the column name as input to the subset
parameter. After execution of the dropna()
method, we will get rows with not null values in the specified column.
You can observe this in the following example.
import pyspark.sql as ps
from pyspark.sql.functions import col
spark = ps.SparkSession.builder \
.master("local[*]") \
.appName("selectnull_example") \
.getOrCreate()
dfs=spark.read.csv("StudentMarks.csv",header=True,inferSchema=True)
print("The input dataframe is:")
dfs.show()
dfs=dfs.dropna(subset="Physics")
print("Rows at which there are not null values in Physics Column:")
dfs.show()
spark.sparkContext.stop()
Output:
The input dataframe is:
+-------+-------+---------+----+-------+----+
|Student|Physics|Chemistry|Math|Biology|Arts|
+-------+-------+---------+----+-------+----+
| Aditya| 92| 76| 95| 73| 91|
| Chris| 95| null| 79| 71| 93|
| Sam| null| 62| 75| 95| 63|
| Harry| 68| 92| 69| 66| 98|
| Golu| 74| null| 96| 76| 64|
| Joel| 99| 79|null| null| 61|
| Tom| null| 94| 61| 65| 69|
| Harsh| 98| 99| 93| 95| 91|
| Clara| 93| null| 78| null| 71|
| Tina| 99| 76| 78| 94| 95|
+-------+-------+---------+----+-------+----+
Rows at which there are not null values in Physics Column:
+-------+-------+---------+----+-------+----+
|Student|Physics|Chemistry|Math|Biology|Arts|
+-------+-------+---------+----+-------+----+
| Aditya| 92| 76| 95| 73| 91|
| Chris| 95| null| 79| 71| 93|
| Harry| 68| 92| 69| 66| 98|
| Golu| 74| null| 96| 76| 64|
| Joel| 99| 79|null| null| 61|
| Harsh| 98| 99| 93| 95| 91|
| Clara| 93| null| 78| null| 71|
| Tina| 99| 76| 78| 94| 95|
+-------+-------+---------+----+-------+----+
In this example, we invoked the dropna()
method on the input pyspark dataframe. Here, we have passed the column name Physics
as input to the subset
parameter in the dropna()
method. Due to this, all the rows in which the Physics
column contains null values are dropped from the dataframe. Thus, we get the required rows with no null values in the Physics
column in the dataframe.
Filter Rows With Not Null Values using SQL From a PySpark DataFrame
We can also use pyspark SQL syntax to filter rows with no null values in a column from a pyspark dataframe. For this, we will use the following steps.
- First, we will create a view of the input dataframe using the
createOrReplaceTempView()
method. ThecreateOrReplaceTempView()
, when invoked on a pyspark dataframe, takes the name of the view object as its input argument and creates a view of the dataframe. - After getting the view of the dataframe, we can use the sql SELECT statement with IS NOT NULL clause to select rows with not null values in the given column in the pyspark dataframe. For this, we can execute the SQL query using the
sql()
function.
You can observe this in the following example.
import pyspark.sql as ps
from pyspark.sql.functions import col
spark = ps.SparkSession.builder \
.master("local[*]") \
.appName("selectnull_example") \
.getOrCreate()
dfs=spark.read.csv("StudentMarks.csv",header=True,inferSchema=True)
print("The input dataframe is:")
dfs.show()
dfs.createOrReplaceTempView("df_sql")
new_df=spark.sql("SELECT * FROM df_sql WHERE Physics IS NOT NULL")
print("Rows at which there are not null values in Physics Column:")
new_df.show()
spark.sparkContext.stop()
Output:
The input dataframe is:
+-------+-------+---------+----+-------+----+
|Student|Physics|Chemistry|Math|Biology|Arts|
+-------+-------+---------+----+-------+----+
| Aditya| 92| 76| 95| 73| 91|
| Chris| 95| null| 79| 71| 93|
| Sam| null| 62| 75| 95| 63|
| Harry| 68| 92| 69| 66| 98|
| Golu| 74| null| 96| 76| 64|
| Joel| 99| 79|null| null| 61|
| Tom| null| 94| 61| 65| 69|
| Harsh| 98| 99| 93| 95| 91|
| Clara| 93| null| 78| null| 71|
| Tina| 99| 76| 78| 94| 95|
+-------+-------+---------+----+-------+----+
Rows at which there are not null values in Physics Column:
+-------+-------+---------+----+-------+----+
|Student|Physics|Chemistry|Math|Biology|Arts|
+-------+-------+---------+----+-------+----+
| Aditya| 92| 76| 95| 73| 91|
| Chris| 95| null| 79| 71| 93|
| Harry| 68| 92| 69| 66| 98|
| Golu| 74| null| 96| 76| 64|
| Joel| 99| 79|null| null| 61|
| Harsh| 98| 99| 93| 95| 91|
| Clara| 93| null| 78| null| 71|
| Tina| 99| 76| 78| 94| 95|
+-------+-------+---------+----+-------+----+
In this example, we have used the sql()
function to execute the sql statement for selecting rows with not null values in the Physics
column.
Instead of using the above approach, you can directly pass the statement in the WHERE clause of the sql statement to the filter()
method as shown below.
import pyspark.sql as ps
spark = ps.SparkSession.builder \
.master("local[*]") \
.appName("selectnull_example") \
.getOrCreate()
dfs=spark.read.csv("StudentMarks.csv",header=True,inferSchema=True)
print("The input dataframe is:")
dfs.show()
dfs=dfs.filter("Physics IS NOT NULL")
print("Rows at which there are not null values in Physics Column:")
dfs.show()
spark.sparkContext.stop()
Output:
The input dataframe is:
+-------+-------+---------+----+-------+----+
|Student|Physics|Chemistry|Math|Biology|Arts|
+-------+-------+---------+----+-------+----+
| Aditya| 92| 76| 95| 73| 91|
| Chris| 95| null| 79| 71| 93|
| Sam| null| 62| 75| 95| 63|
| Harry| 68| 92| 69| 66| 98|
| Golu| 74| null| 96| 76| 64|
| Joel| 99| 79|null| null| 61|
| Tom| null| 94| 61| 65| 69|
| Harsh| 98| 99| 93| 95| 91|
| Clara| 93| null| 78| null| 71|
| Tina| 99| 76| 78| 94| 95|
+-------+-------+---------+----+-------+----+
Rows at which there are not null values in Physics Column:
+-------+-------+---------+----+-------+----+
|Student|Physics|Chemistry|Math|Biology|Arts|
+-------+-------+---------+----+-------+----+
| Aditya| 92| 76| 95| 73| 91|
| Chris| 95| null| 79| 71| 93|
| Harry| 68| 92| 69| 66| 98|
| Golu| 74| null| 96| 76| 64|
| Joel| 99| 79|null| null| 61|
| Harsh| 98| 99| 93| 95| 91|
| Clara| 93| null| 78| null| 71|
| Tina| 99| 76| 78| 94| 95|
+-------+-------+---------+----+-------+----+
In this example, we passed the string "Physics IS NOT NULL"
to the filter()
method instead of executing the entire SQL query. However, the output remains the same.
Select Rows with Null Values in Multiple columns in a DataFrame
To select rows with Null values in Multiple columns, we can use the conditional operators along with the isNull()
method inside the filter()
and where()
method. For this, we will use the following steps.
- We will first invoke the
isNull()
method on all the required columns. - Next, we will enclose the masked columns returned by the
isNull()
method inside parentheses and combine them with conditional operators like AND (&) and OR (|) to create a conditional statement. - Finally, we will pass the conditional statement to the
filter()
method.
After executing the above statements, we can select rows with null values in multiple columns in a pyspark dataframe. You can observe this in the following example.
import pyspark.sql as ps
spark = ps.SparkSession.builder \
.master("local[*]") \
.appName("selectnull_example") \
.getOrCreate()
dfs=spark.read.csv("StudentMarks.csv",header=True,inferSchema=True)
print("The input dataframe is:")
dfs.show()
dfs=dfs.filter((dfs.Physics.isNull())& (dfs.Chemistry.isNull()))
print("Rows with null values in Physics and Chemistry Columns are:")
dfs.show()
spark.sparkContext.stop()
Output:
The input dataframe is:
+-------+-------+---------+----+-------+----+
|Student|Physics|Chemistry|Math|Biology|Arts|
+-------+-------+---------+----+-------+----+
| Aditya| 92| 76| 95| 73| 91|
| Chris| 95| null| 79| 71| 93|
| Sam| null| 62| 75| 95| 63|
| Harry| 68| 92| 69| 66| 98|
| Golu| 74| null| 96| 76| 64|
| Joel| 99| 79|null| null| 61|
| Tom| null| 94| 61| 65| 69|
| Harsh| 98| 99| 93| 95| 91|
| Clara| 93| null| 78| null| 71|
| Tina| 99| 76| 78| 94| 95|
+-------+-------+---------+----+-------+----+
Rows with null values in Physics and Chemistry Columns are:
+-------+-------+---------+----+-------+----+
|Student|Physics|Chemistry|Math|Biology|Arts|
+-------+-------+---------+----+-------+----+
+-------+-------+---------+----+-------+----+
In this example, we have passed the conditions (dfs.Physics.isNull())
and (dfs.Chemistry.isNull())
to the filter()
method to filter rows with not null values in these columns. Observe that the output of the isNull()
method is enclosed inside parentheses for the & operator to execute successfully. If we don’t use parentheses here, the program will run into error.
In the output, you can observe that we get an empty dataframe because there are no rows where both the Physics
and Chemistry
column are null.
In a similar manner, you can use the OR conditional operator (|) to filter rows in which the Physics
column or Chemistry
column contains null values as shown below.
import pyspark.sql as ps
spark = ps.SparkSession.builder \
.master("local[*]") \
.appName("selectnull_example") \
.getOrCreate()
dfs=spark.read.csv("StudentMarks.csv",header=True,inferSchema=True)
print("The input dataframe is:")
dfs.show()
dfs=dfs.filter((dfs.Physics.isNull())|(dfs.Chemistry.isNull()))
print("Rows with null values in Physics or Chemistry Columns are:")
dfs.show()
spark.sparkContext.stop()
Output:
The input dataframe is:
+-------+-------+---------+----+-------+----+
|Student|Physics|Chemistry|Math|Biology|Arts|
+-------+-------+---------+----+-------+----+
| Aditya| 92| 76| 95| 73| 91|
| Chris| 95| null| 79| 71| 93|
| Sam| null| 62| 75| 95| 63|
| Harry| 68| 92| 69| 66| 98|
| Golu| 74| null| 96| 76| 64|
| Joel| 99| 79|null| null| 61|
| Tom| null| 94| 61| 65| 69|
| Harsh| 98| 99| 93| 95| 91|
| Clara| 93| null| 78| null| 71|
| Tina| 99| 76| 78| 94| 95|
+-------+-------+---------+----+-------+----+
Rows with null values in Physics or Chemistry Columns are:
+-------+-------+---------+----+-------+----+
|Student|Physics|Chemistry|Math|Biology|Arts|
+-------+-------+---------+----+-------+----+
| Chris| 95| null| 79| 71| 93|
| Sam| null| 62| 75| 95| 63|
| Golu| 74| null| 96| 76| 64|
| Tom| null| 94| 61| 65| 69|
| Clara| 93| null| 78| null| 71|
+-------+-------+---------+----+-------+----+
In the above example, we have used the filter()
method with the conditions to filter rows with null values in Physics
or Chemistry
column from the pyspark dataframe.
Instead of the filter()
method, you can also use the where()
method in the previous example as shown below.
import pyspark.sql as ps
spark = ps.SparkSession.builder \
.master("local[*]") \
.appName("selectnull_example") \
.getOrCreate()
dfs=spark.read.csv("StudentMarks.csv",header=True,inferSchema=True)
print("The input dataframe is:")
dfs.show()
dfs=dfs.where((dfs.Physics.isNull())|(dfs.Chemistry.isNull()))
print("Rows with null values in Physics or Chemistry Columns are:")
dfs.show()
spark.sparkContext.stop()
Output:
The input dataframe is:
+-------+-------+---------+----+-------+----+
|Student|Physics|Chemistry|Math|Biology|Arts|
+-------+-------+---------+----+-------+----+
| Aditya| 92| 76| 95| 73| 91|
| Chris| 95| null| 79| 71| 93|
| Sam| null| 62| 75| 95| 63|
| Harry| 68| 92| 69| 66| 98|
| Golu| 74| null| 96| 76| 64|
| Joel| 99| 79|null| null| 61|
| Tom| null| 94| 61| 65| 69|
| Harsh| 98| 99| 93| 95| 91|
| Clara| 93| null| 78| null| 71|
| Tina| 99| 76| 78| 94| 95|
+-------+-------+---------+----+-------+----+
Rows with null values in Physics or Chemistry Columns are:
+-------+-------+---------+----+-------+----+
|Student|Physics|Chemistry|Math|Biology|Arts|
+-------+-------+---------+----+-------+----+
| Chris| 95| null| 79| 71| 93|
| Sam| null| 62| 75| 95| 63|
| Golu| 74| null| 96| 76| 64|
| Tom| null| 94| 61| 65| 69|
| Clara| 93| null| 78| null| 71|
+-------+-------+---------+----+-------+----+
We can also use multiple conditional statements inside SQL syntax to select rows with null values in multiple columns in a pyspark dataframe. For this, you can use the IS NULL clause with conditional operators in the WHERE clause of the SQL statement as shown below.
import pyspark.sql as ps
spark = ps.SparkSession.builder \
.master("local[*]") \
.appName("selectnull_example") \
.getOrCreate()
dfs=spark.read.csv("StudentMarks.csv",header=True,inferSchema=True)
print("The input dataframe is:")
dfs.show()
dfs.createOrReplaceTempView("df_sql")
new_df=spark.sql("SELECT * FROM df_sql WHERE Physics IS NULL or Chemistry IS NULL")
print("Rows with null values in Physics or Chemistry Columns are:")
new_df.show()
spark.sparkContext.stop()
Output:
The input dataframe is:
+-------+-------+---------+----+-------+----+
|Student|Physics|Chemistry|Math|Biology|Arts|
+-------+-------+---------+----+-------+----+
| Aditya| 92| 76| 95| 73| 91|
| Chris| 95| null| 79| 71| 93|
| Sam| null| 62| 75| 95| 63|
| Harry| 68| 92| 69| 66| 98|
| Golu| 74| null| 96| 76| 64|
| Joel| 99| 79|null| null| 61|
| Tom| null| 94| 61| 65| 69|
| Harsh| 98| 99| 93| 95| 91|
| Clara| 93| null| 78| null| 71|
| Tina| 99| 76| 78| 94| 95|
+-------+-------+---------+----+-------+----+
Rows with null values in Physics or Chemistry Columns are:
+-------+-------+---------+----+-------+----+
|Student|Physics|Chemistry|Math|Biology|Arts|
+-------+-------+---------+----+-------+----+
| Chris| 95| null| 79| 71| 93|
| Sam| null| 62| 75| 95| 63|
| Golu| 74| null| 96| 76| 64|
| Tom| null| 94| 61| 65| 69|
| Clara| 93| null| 78| null| 71|
+-------+-------+---------+----+-------+----+
Instead of using the sql statement, we can also pass the conditional statement in the WHERE clause to the filter()
method as shown below.
import pyspark.sql as ps
spark = ps.SparkSession.builder \
.master("local[*]") \
.appName("selectnull_example") \
.getOrCreate()
dfs=spark.read.csv("StudentMarks.csv",header=True,inferSchema=True)
print("The input dataframe is:")
dfs.show()
dfs=dfs.filter("Physics IS NULL OR Chemistry IS NULL")
print("Rows with null values in Physics or Chemistry Columns are:")
dfs.show()
spark.sparkContext.stop()
Output:
The input dataframe is:
+-------+-------+---------+----+-------+----+
|Student|Physics|Chemistry|Math|Biology|Arts|
+-------+-------+---------+----+-------+----+
| Aditya| 92| 76| 95| 73| 91|
| Chris| 95| null| 79| 71| 93|
| Sam| null| 62| 75| 95| 63|
| Harry| 68| 92| 69| 66| 98|
| Golu| 74| null| 96| 76| 64|
| Joel| 99| 79|null| null| 61|
| Tom| null| 94| 61| 65| 69|
| Harsh| 98| 99| 93| 95| 91|
| Clara| 93| null| 78| null| 71|
| Tina| 99| 76| 78| 94| 95|
+-------+-------+---------+----+-------+----+
Rows with null values in Physics or Chemistry Columns are:
+-------+-------+---------+----+-------+----+
|Student|Physics|Chemistry|Math|Biology|Arts|
+-------+-------+---------+----+-------+----+
| Chris| 95| null| 79| 71| 93|
| Sam| null| 62| 75| 95| 63|
| Golu| 74| null| 96| 76| 64|
| Tom| null| 94| 61| 65| 69|
| Clara| 93| null| 78| null| 71|
+-------+-------+---------+----+-------+----+
Suggested reading: PySpark vs Pandas
Select Rows with Not Null Values in Multiple Columns
To select rows with not Null values in Multiple columns, we can use the conditional operators along with the isNotNull()
method inside the filter()
and where()
method. For this, we will use the following steps.
- We will first invoke the
isNotNull()
method on all the required columns. - Next, we will enclose the masked columns returned by the
isNotNull()
method inside parentheses and combine them with conditional operators like AND (&) and OR (|) to create a conditional statement. - Finally, we will pass the conditional statement to the
filter()
method.
After executing the above statements, we can select rows with not null values in multiple columns in a pyspark dataframe. You can observe this in the following example.
import pyspark.sql as ps
spark = ps.SparkSession.builder \
.master("local[*]") \
.appName("selectnull_example") \
.getOrCreate()
dfs=spark.read.csv("StudentMarks.csv",header=True,inferSchema=True)
print("The input dataframe is:")
dfs.show()
dfs=dfs.filter((dfs.Physics.isNotNull())& (dfs.Chemistry.isNotNull()))
print("Rows with not null values in Physics and Chemistry Columns are:")
dfs.show()
spark.sparkContext.stop()
Output:
The input dataframe is:
+-------+-------+---------+----+-------+----+
|Student|Physics|Chemistry|Math|Biology|Arts|
+-------+-------+---------+----+-------+----+
| Aditya| 92| 76| 95| 73| 91|
| Chris| 95| null| 79| 71| 93|
| Sam| null| 62| 75| 95| 63|
| Harry| 68| 92| 69| 66| 98|
| Golu| 74| null| 96| 76| 64|
| Joel| 99| 79|null| null| 61|
| Tom| null| 94| 61| 65| 69|
| Harsh| 98| 99| 93| 95| 91|
| Clara| 93| null| 78| null| 71|
| Tina| 99| 76| 78| 94| 95|
+-------+-------+---------+----+-------+----+
Rows with not null values in Physics and Chemistry Columns are:
+-------+-------+---------+----+-------+----+
|Student|Physics|Chemistry|Math|Biology|Arts|
+-------+-------+---------+----+-------+----+
| Aditya| 92| 76| 95| 73| 91|
| Harry| 68| 92| 69| 66| 98|
| Joel| 99| 79|null| null| 61|
| Harsh| 98| 99| 93| 95| 91|
| Tina| 99| 76| 78| 94| 95|
+-------+-------+---------+----+-------+----+
Instead of the filter()
method, you can also use the where()
method in the previous example as shown below.
import pyspark.sql as ps
spark = ps.SparkSession.builder \
.master("local[*]") \
.appName("selectnull_example") \
.getOrCreate()
dfs=spark.read.csv("StudentMarks.csv",header=True,inferSchema=True)
print("The input dataframe is:")
dfs.show()
dfs=dfs.where((dfs.Physics.isNotNull())& (dfs.Chemistry.isNotNull()))
print("Rows with not null values in Physics and Chemistry Columns are:")
dfs.show()
spark.sparkContext.stop()
Output:
The input dataframe is:
+-------+-------+---------+----+-------+----+
|Student|Physics|Chemistry|Math|Biology|Arts|
+-------+-------+---------+----+-------+----+
| Aditya| 92| 76| 95| 73| 91|
| Chris| 95| null| 79| 71| 93|
| Sam| null| 62| 75| 95| 63|
| Harry| 68| 92| 69| 66| 98|
| Golu| 74| null| 96| 76| 64|
| Joel| 99| 79|null| null| 61|
| Tom| null| 94| 61| 65| 69|
| Harsh| 98| 99| 93| 95| 91|
| Clara| 93| null| 78| null| 71|
| Tina| 99| 76| 78| 94| 95|
+-------+-------+---------+----+-------+----+
Rows with not null values in Physics and Chemistry Columns are:
+-------+-------+---------+----+-------+----+
|Student|Physics|Chemistry|Math|Biology|Arts|
+-------+-------+---------+----+-------+----+
| Aditya| 92| 76| 95| 73| 91|
| Harry| 68| 92| 69| 66| 98|
| Joel| 99| 79|null| null| 61|
| Harsh| 98| 99| 93| 95| 91|
| Tina| 99| 76| 78| 94| 95|
+-------+-------+---------+----+-------+----+
We can also use multiple conditional statements inside the SQL WHERE clause to select rows with not null values in multiple columns in a pyspark dataframe as shown below.
import pyspark.sql as ps
spark = ps.SparkSession.builder \
.master("local[*]") \
.appName("selectnull_example") \
.getOrCreate()
dfs=spark.read.csv("StudentMarks.csv",header=True,inferSchema=True)
print("The input dataframe is:")
dfs.show()
dfs.createOrReplaceTempView("df_sql")
new_df=spark.sql("SELECT * FROM df_sql WHERE Physics IS NOT NULL AND Chemistry IS NOT NULL")
print("Rows with not null values in Physics and Chemistry Columns are:")
new_df.show()
spark.sparkContext.stop()
Output:
The input dataframe is:
+-------+-------+---------+----+-------+----+
|Student|Physics|Chemistry|Math|Biology|Arts|
+-------+-------+---------+----+-------+----+
| Aditya| 92| 76| 95| 73| 91|
| Chris| 95| null| 79| 71| 93|
| Sam| null| 62| 75| 95| 63|
| Harry| 68| 92| 69| 66| 98|
| Golu| 74| null| 96| 76| 64|
| Joel| 99| 79|null| null| 61|
| Tom| null| 94| 61| 65| 69|
| Harsh| 98| 99| 93| 95| 91|
| Clara| 93| null| 78| null| 71|
| Tina| 99| 76| 78| 94| 95|
+-------+-------+---------+----+-------+----+
Rows with not null values in Physics and Chemistry Columns are:
+-------+-------+---------+----+-------+----+
|Student|Physics|Chemistry|Math|Biology|Arts|
+-------+-------+---------+----+-------+----+
| Aditya| 92| 76| 95| 73| 91|
| Harry| 68| 92| 69| 66| 98|
| Joel| 99| 79|null| null| 61|
| Harsh| 98| 99| 93| 95| 91|
| Tina| 99| 76| 78| 94| 95|
+-------+-------+---------+----+-------+----+
Instead of using the sql statement, we can also pass the conditional statement in the WHERE clause to the filter()
method as shown below.
import pyspark.sql as ps
spark = ps.SparkSession.builder \
.master("local[*]") \
.appName("selectnull_example") \
.getOrCreate()
dfs=spark.read.csv("StudentMarks.csv",header=True,inferSchema=True)
print("The input dataframe is:")
dfs.show()
dfs=dfs.filter("Physics IS NOT NULL AND Chemistry IS NOT NULL")
print("Rows with notn null values in Physics and Chemistry Columns are:")
dfs.show()
spark.sparkContext.stop()
Output:
The input dataframe is:
+-------+-------+---------+----+-------+----+
|Student|Physics|Chemistry|Math|Biology|Arts|
+-------+-------+---------+----+-------+----+
| Aditya| 92| 76| 95| 73| 91|
| Chris| 95| null| 79| 71| 93|
| Sam| null| 62| 75| 95| 63|
| Harry| 68| 92| 69| 66| 98|
| Golu| 74| null| 96| 76| 64|
| Joel| 99| 79|null| null| 61|
| Tom| null| 94| 61| 65| 69|
| Harsh| 98| 99| 93| 95| 91|
| Clara| 93| null| 78| null| 71|
| Tina| 99| 76| 78| 94| 95|
+-------+-------+---------+----+-------+----+
Rows with notn null values in Physics and Chemistry Columns are:
+-------+-------+---------+----+-------+----+
|Student|Physics|Chemistry|Math|Biology|Arts|
+-------+-------+---------+----+-------+----+
| Aditya| 92| 76| 95| 73| 91|
| Harry| 68| 92| 69| 66| 98|
| Joel| 99| 79|null| null| 61|
| Harsh| 98| 99| 93| 95| 91|
| Tina| 99| 76| 78| 94| 95|
+-------+-------+---------+----+-------+----+
Conclusion
In this article, we discussed different ways to select rows with null values in a pyspark dataframe. To learn more about pyspark dataframes, you can read this article on how to sort a pyspark dataframe. You might also like this article on how to select distinct rows from a pyspark dataframe.
I hope you enjoyed reading this article. Stay tuned for more informative articles.
Happy Learning!
Recommended Python Training
Course: Python 3 For Beginners
Over 15 hours of video content with guided instruction for beginners. Learn how to create real world applications and master the basics.