While working with pyspark dataframes, we often need to order the rows according to one or multiple columns. In this article, we will discuss different ways to orderby a pyspark dataframe using the orderBy() method.
The pyspark orderBy() Method
The orderBy()
method in pyspark is used to order the rows of a dataframe by one or multiple columns. It has the following syntax.
df.orderBy(*column_names, ascending=True)
Here,
- The parameter
*column_names
represents one or multiple columns by which we need to order the pyspark dataframe. - The
ascending
parameter specifies if we want to order the dataframe in ascending or descending order by given column names. If there are multiple columns by which you want to sort the dataframe, you can also pass a list of True and False values to specify the columns by which the dataframe is ordered in ascending or descending order.
Orderby PySpark DataFrame By Column Name
To orderby a pyspark dataframe by a given column name, we can use the orderBy()
method as shown in the following example.
import pyspark.sql as ps
spark = ps.SparkSession.builder \
.master("local[*]") \
.appName("orderby_example") \
.getOrCreate()
dfs=spark.read.csv("sample_csv_file.csv",header=True,inferSchema=True)
print("The input dataframe is:")
dfs.show()
dfs=dfs.orderBy('Physics')
print("The dataframe ordered by Physics column is:")
dfs.show()
spark.sparkContext.stop()
Output:
The input dataframe is:
+-------+-----+-------+---------+
| Name|Maths|Physics|Chemistry|
+-------+-----+-------+---------+
| Aditya| 45| 89| 71|
| Chris| null| 85| 82|
| Joel| 45| 75| 87|
|Katrina| 49| 47| 83|
| Joel| 45| 75| 87|
| Agatha| 77| 76| 93|
| Sam| 99| 62| 95|
| Aditya| 65| 89| 71|
+-------+-----+-------+---------+
The dataframe ordered by Physics column is:
+-------+-----+-------+---------+
| Name|Maths|Physics|Chemistry|
+-------+-----+-------+---------+
|Katrina| 49| 47| 83|
| Sam| 99| 62| 95|
| Joel| 45| 75| 87|
| Joel| 45| 75| 87|
| Agatha| 77| 76| 93|
| Chris| null| 85| 82|
| Aditya| 45| 89| 71|
| Aditya| 65| 89| 71|
+-------+-----+-------+---------+
In this example, we first created a SparkSession on our local machine. Then, we read a csv to create a pyspark dataframe. Next, we used the orderBy()
method to order the dataframe using the ‘Physics
’ column. In the output dataframe, you can observe that the rows are ordered in ascending order by the Physics
column.
Instead of the above approach, you can also use the col()
function to orderby the pyspark dataframe. The col()
function is defined in the pyspark.sql.functions module. It takes a column name as its input argument and returns a column object. We can pass the column object to the orderBy()
method to get the pyspark dataframe ordered by a given column.
You can observe this in the following example.
import pyspark.sql as ps
from pyspark.sql.functions import col
spark = ps.SparkSession.builder \
.master("local[*]") \
.appName("orderby_example") \
.getOrCreate()
dfs=spark.read.csv("sample_csv_file.csv",header=True,inferSchema=True)
print("The input dataframe is:")
dfs.show()
dfs=dfs.orderBy(col('Physics'))
print("The dataframe ordered by Physics column is:")
dfs.show()
spark.sparkContext.stop()
Output:
The input dataframe is:
+-------+-----+-------+---------+
| Name|Maths|Physics|Chemistry|
+-------+-----+-------+---------+
| Aditya| 45| 89| 71|
| Chris| null| 85| 82|
| Joel| 45| 75| 87|
|Katrina| 49| 47| 83|
| Joel| 45| 75| 87|
| Agatha| 77| 76| 93|
| Sam| 99| 62| 95|
| Aditya| 65| 89| 71|
+-------+-----+-------+---------+
The dataframe ordered by Physics column is:
+-------+-----+-------+---------+
| Name|Maths|Physics|Chemistry|
+-------+-----+-------+---------+
|Katrina| 49| 47| 83|
| Sam| 99| 62| 95|
| Joel| 45| 75| 87|
| Joel| 45| 75| 87|
| Agatha| 77| 76| 93|
| Chris| null| 85| 82|
| Aditya| 45| 89| 71|
| Aditya| 65| 89| 71|
+-------+-----+-------+---------+
In this example, we used the col()
function inside the orderBy() method to order the pyspark dataframe by the Physics
column.
Pyspark Orderby DataFrame in Descending Order
To order a pyspark dataframe by a column in descending order, you can set the ascending
parameter to False in the orderBy()
method as shown below.
import pyspark.sql as ps
spark = ps.SparkSession.builder \
.master("local[*]") \
.appName("orderby_example") \
.getOrCreate()
dfs=spark.read.csv("sample_csv_file.csv",header=True,inferSchema=True)
print("The input dataframe is:")
dfs.show()
dfs=dfs.orderBy('Physics',ascending=False)
print("The dataframe ordered by Physics column in descending order is:")
dfs.show()
spark.sparkContext.stop()
Output:
The input dataframe is:
+-------+-----+-------+---------+
| Name|Maths|Physics|Chemistry|
+-------+-----+-------+---------+
| Aditya| 45| 89| 71|
| Chris| null| 85| 82|
| Joel| 45| 75| 87|
|Katrina| 49| 47| 83|
| Joel| 45| 75| 87|
| Agatha| 77| 76| 93|
| Sam| 99| 62| 95|
| Aditya| 65| 89| 71|
+-------+-----+-------+---------+
The dataframe ordered by Physics column in descending order is:
+-------+-----+-------+---------+
| Name|Maths|Physics|Chemistry|
+-------+-----+-------+---------+
| Aditya| 45| 89| 71|
| Aditya| 65| 89| 71|
| Chris| null| 85| 82|
| Agatha| 77| 76| 93|
| Joel| 45| 75| 87|
| Joel| 45| 75| 87|
| Sam| 99| 62| 95|
|Katrina| 49| 47| 83|
+-------+-----+-------+---------+
In the above example, we have set the ascending
parameter to True in the orderBy()
method. Hence, the output dataframe is ordered by the Physics
column in descending
order.
If you are using the col()
function to set the pyspark dataframe in order, you can use the desc()
method on the column of the pyspark.
When we invoke the desc()
method on the column obtained using the col()
function, the orderBy()
method sorts the pyspark dataframe in descending order. You can observe this in the following example.
import pyspark.sql as ps
from pyspark.sql.functions import col
spark = ps.SparkSession.builder \
.master("local[*]") \
.appName("orderby_example") \
.getOrCreate()
dfs=spark.read.csv("sample_csv_file.csv",header=True,inferSchema=True)
print("The input dataframe is:")
dfs.show()
dfs=dfs.orderBy(col('Physics').desc())
print("The dataframe ordered by Physics column in descending order is:")
dfs.show()
spark.sparkContext.stop()
Output:
The input dataframe is:
+-------+-----+-------+---------+
| Name|Maths|Physics|Chemistry|
+-------+-----+-------+---------+
| Aditya| 45| 89| 71|
| Chris| null| 85| 82|
| Joel| 45| 75| 87|
|Katrina| 49| 47| 83|
| Joel| 45| 75| 87|
| Agatha| 77| 76| 93|
| Sam| 99| 62| 95|
| Aditya| 65| 89| 71|
+-------+-----+-------+---------+
The dataframe ordered by Physics column in descending order is:
+-------+-----+-------+---------+
| Name|Maths|Physics|Chemistry|
+-------+-----+-------+---------+
| Aditya| 45| 89| 71|
| Aditya| 65| 89| 71|
| Chris| null| 85| 82|
| Agatha| 77| 76| 93|
| Joel| 45| 75| 87|
| Joel| 45| 75| 87|
| Sam| 99| 62| 95|
|Katrina| 49| 47| 83|
+-------+-----+-------+---------+
Order PySpark DataFrame by Multiple Columns
To orderby a pyspark dataframe by multiple columns, you can pass all the column names to the orderBy()
method as shown below.
import pyspark.sql as ps
spark = ps.SparkSession.builder \
.master("local[*]") \
.appName("orderby_example") \
.getOrCreate()
dfs=spark.read.csv("sample_csv_file.csv",header=True)
print("The input dataframe is:")
dfs.show()
dfs=dfs.orderBy('Maths','Physics')
print("The dataframe ordered by Maths and Physics column is:")
dfs.show()
spark.sparkContext.stop()
Output:
The input dataframe is:
+-------+-----+-------+---------+
| Name|Maths|Physics|Chemistry|
+-------+-----+-------+---------+
| Aditya| 45| 89| 71|
| Chris| null| 85| 82|
| Joel| 45| 75| 87|
|Katrina| 49| 47| 83|
| Joel| 45| 75| 87|
| Agatha| 77| 76| 93|
| Sam| 99| 62| 95|
| Aditya| 65| 89| 71|
+-------+-----+-------+---------+
The dataframe ordered by Maths and Physics column is:
+-------+-----+-------+---------+
| Name|Maths|Physics|Chemistry|
+-------+-----+-------+---------+
| Chris| null| 85| 82|
| Joel| 45| 75| 87|
| Joel| 45| 75| 87|
| Aditya| 45| 89| 71|
|Katrina| 49| 47| 83|
| Aditya| 65| 89| 71|
| Agatha| 77| 76| 93|
| Sam| 99| 62| 95|
+-------+-----+-------+---------+
In the above example, we passed the column names 'Maths'
and 'Physics'
to the orderBy()
method. Hence, the output dataframe is first sorted by the Maths
column. For the rows in which the Maths
column has the same value, the order is decided using the Physics
column.
By default, the orderBy()
method sets the pyspark dataframe in ascending order by all the columns. To sort the dataframe in descending order by all the columns using the orderBy()
method, you can set the ascending
parameter to False as shown below.
import pyspark.sql as ps
spark = ps.SparkSession.builder \
.master("local[*]") \
.appName("orderby_example") \
.getOrCreate()
dfs=spark.read.csv("sample_csv_file.csv",header=True)
print("The input dataframe is:")
dfs.show()
dfs=dfs.orderBy('Maths','Physics',ascending=False)
print("The dataframe ordered by Maths and Physics column in descending order is:")
dfs.show()
spark.sparkContext.stop()
Output:
The input dataframe is:
+-------+-----+-------+---------+
| Name|Maths|Physics|Chemistry|
+-------+-----+-------+---------+
| Aditya| 45| 89| 71|
| Chris| null| 85| 82|
| Joel| 45| 75| 87|
|Katrina| 49| 47| 83|
| Joel| 45| 75| 87|
| Agatha| 77| 76| 93|
| Sam| 99| 62| 95|
| Aditya| 65| 89| 71|
+-------+-----+-------+---------+
The dataframe ordered by Maths and Physics column in descending order is:
+-------+-----+-------+---------+
| Name|Maths|Physics|Chemistry|
+-------+-----+-------+---------+
| Sam| 99| 62| 95|
| Agatha| 77| 76| 93|
| Aditya| 65| 89| 71|
|Katrina| 49| 47| 83|
| Aditya| 45| 89| 71|
| Joel| 45| 75| 87|
| Joel| 45| 75| 87|
| Chris| null| 85| 82|
+-------+-----+-------+---------+
If you want to change the sorting order for each column, you can pass a list of True and False values to the ascending
parameter in the orderBy()
method. Here, the number of boolean values should be equal to the number of column names passed to the orderBy()
method. Each value in the list corresponds to a single column at the same position in the parameter list.
If we want to orderby the pyspark dataframe in ascending order by the ith column name passed to the orderBy()
method, the ith element in the list passed to the ascending parameter should be True. Similarly, if we want to orderby the pyspark dataframe in ascending order by the jth column name passed to the orderBy()
method, the jth element in the list should be False.
You can observe this in the following example.
import pyspark.sql as ps
spark = ps.SparkSession.builder \
.master("local[*]") \
.appName("orderby_example") \
.getOrCreate()
dfs=spark.read.csv("sample_csv_file.csv",header=True)
print("The input dataframe is:")
dfs.show()
dfs=dfs.orderBy('Maths','Physics',ascending=[True, False])
print("The dataframe ordered by Maths and Physics column is:")
dfs.show()
spark.sparkContext.stop()
Output:
The input dataframe is:
+-------+-----+-------+---------+
| Name|Maths|Physics|Chemistry|
+-------+-----+-------+---------+
| Aditya| 45| 89| 71|
| Chris| null| 85| 82|
| Joel| 45| 75| 87|
|Katrina| 49| 47| 83|
| Joel| 45| 75| 87|
| Agatha| 77| 76| 93|
| Sam| 99| 62| 95|
| Aditya| 65| 89| 71|
+-------+-----+-------+---------+
The dataframe ordered by Maths and Physics column is:
+-------+-----+-------+---------+
| Name|Maths|Physics|Chemistry|
+-------+-----+-------+---------+
| Chris| null| 85| 82|
| Aditya| 45| 89| 71|
| Joel| 45| 75| 87|
| Joel| 45| 75| 87|
|Katrina| 49| 47| 83|
| Aditya| 65| 89| 71|
| Agatha| 77| 76| 93|
| Sam| 99| 62| 95|
+-------+-----+-------+---------+
In the above example, we passed the list [True, False] to the ascending
parameter in the orderBy()
method. Hence, the output dataframe is sorted by the Maths
column in ascending order. For the rows in which the Maths
column has the same value, the rows are sorted in descending order by the Physics
column.
Suppose you are using the col()
function to orederby the pyspark dataframe. In that case, you can use the asc()
method and desc()
method on each column to sort the dataframe by the column in ascending order or descending order respectively.
import pyspark.sql as ps
from pyspark.sql.functions import col
spark = ps.SparkSession.builder \
.master("local[*]") \
.appName("orderby_example") \
.getOrCreate()
dfs=spark.read.csv("sample_csv_file.csv",header=True,inferSchema=True)
print("The input dataframe is:")
dfs.show()
dfs=dfs.orderBy(col('Maths').asc(),col('Physics').desc())
print("The dataframe ordered by Maths and Physics column is:")
dfs.show()
spark.sparkContext.stop()
Output:
The input dataframe is:
+-------+-----+-------+---------+
| Name|Maths|Physics|Chemistry|
+-------+-----+-------+---------+
| Aditya| 45| 89| 71|
| Chris| null| 85| 82|
| Joel| 45| 75| 87|
|Katrina| 49| 47| 83|
| Joel| 45| 75| 87|
| Agatha| 77| 76| 93|
| Sam| 99| 62| 95|
| Aditya| 65| 89| 71|
+-------+-----+-------+---------+
The dataframe ordered by Maths and Physics column is:
+-------+-----+-------+---------+
| Name|Maths|Physics|Chemistry|
+-------+-----+-------+---------+
| Chris| null| 85| 82|
| Aditya| 45| 89| 71|
| Joel| 45| 75| 87|
| Joel| 45| 75| 87|
|Katrina| 49| 47| 83|
| Aditya| 65| 89| 71|
| Agatha| 77| 76| 93|
| Sam| 99| 62| 95|
+-------+-----+-------+---------+
In this example, we have invoked the asc()
method on the Maths
column and desc()
method on the Physics
column. Hence, the output dataframe is sorted by the Maths
column in ascending order. For the rows in which the Maths
column has the same value, the rows are sorted in descending order by the Physics
column.
Orderby PySpark DataFrame Nulls First
If there are null values present in a row in the column by which we want to orderby a pyspark dataframe, the row is placed at the top of the ordered dataframe by default.
You can observe this in the following example.
import pyspark.sql as ps
from pyspark.sql.functions import col
spark = ps.SparkSession.builder \
.master("local[*]") \
.appName("orderby_example") \
.getOrCreate()
dfs=spark.read.csv("sample_csv_file.csv",header=True)
print("The input dataframe is:")
dfs.show()
dfs=dfs.orderBy(col('Maths'))
print("The ordered dataframe is:")
dfs.show()
spark.sparkContext.stop()
Output:
The input dataframe is:
+-------+-----+-------+---------+
| Name|Maths|Physics|Chemistry|
+-------+-----+-------+---------+
| Aditya| 45| 89| 71|
| Chris| null| 85| 82|
| Joel| 45| 75| 87|
|Katrina| 49| 47| 83|
| Joel| 45| 75| 87|
| Agatha| 77| 76| 93|
| Sam| null| 62| 95|
| Aditya| 65| 89| 71|
+-------+-----+-------+---------+
The ordered dataframe is:
+-------+-----+-------+---------+
| Name|Maths|Physics|Chemistry|
+-------+-----+-------+---------+
| Chris| null| 85| 82|
| Sam| null| 62| 95|
| Joel| 45| 75| 87|
| Joel| 45| 75| 87|
| Aditya| 45| 89| 71|
|Katrina| 49| 47| 83|
| Aditya| 65| 89| 71|
| Agatha| 77| 76| 93|
+-------+-----+-------+---------+
In this example, the input dataframe contains two rows with null values in the Maths
column. Hence, when we sort the dataframe by the Maths
column in ascending order, the rows with null values are kept at the top of the output dataframe by default.
When we sort a pyspark dataframe by a column with null values in descending order, the rows with null values are placed at the bottom. You can observe this in the following example.
import pyspark.sql as ps
from pyspark.sql.functions import col
spark = ps.SparkSession.builder \
.master("local[*]") \
.appName("orderby_example") \
.getOrCreate()
dfs=spark.read.csv("sample_csv_file.csv",header=True)
print("The input dataframe is:")
dfs.show()
dfs=dfs.orderBy(col('Maths').desc())
print("The ordered dataframe is:")
dfs.show()
spark.sparkContext.stop()
Output:
The input dataframe is:
+-------+-----+-------+---------+
| Name|Maths|Physics|Chemistry|
+-------+-----+-------+---------+
| Aditya| 45| 89| 71|
| Chris| null| 85| 82|
| Joel| 45| 75| 87|
|Katrina| 49| 47| 83|
| Joel| 45| 75| 87|
| Agatha| 77| 76| 93|
| Sam| null| 62| 95|
| Aditya| 65| 89| 71|
+-------+-----+-------+---------+
The ordered dataframe is:
+-------+-----+-------+---------+
| Name|Maths|Physics|Chemistry|
+-------+-----+-------+---------+
| Agatha| 77| 76| 93|
| Aditya| 65| 89| 71|
|Katrina| 49| 47| 83|
| Aditya| 45| 89| 71|
| Joel| 45| 75| 87|
| Joel| 45| 75| 87|
| Chris| null| 85| 82|
| Sam| null| 62| 95|
+-------+-----+-------+---------+
To put the rows containing the null values in the first place in the ordered dataframe, we can use the desc_nulls_first()
method on the columns given in the orderBy()
method. After this, the data frame will be ordered in descending order with rows containing the null values at the top of the dataframe. You can observe this in the following example.
import pyspark.sql as ps
from pyspark.sql.functions import col
spark = ps.SparkSession.builder \
.master("local[*]") \
.appName("orderby_example") \
.getOrCreate()
dfs=spark.read.csv("sample_csv_file.csv",header=True)
print("The input dataframe is:")
dfs.show()
dfs=dfs.orderBy(col('Maths').desc_nulls_first())
print("The ordered dataframe is:")
dfs.show()
spark.sparkContext.stop()
Output:
The input dataframe is:
+-------+-----+-------+---------+
| Name|Maths|Physics|Chemistry|
+-------+-----+-------+---------+
| Aditya| 45| 89| 71|
| Chris| null| 85| 82|
| Joel| 45| 75| 87|
|Katrina| 49| 47| 83|
| Joel| 45| 75| 87|
| Agatha| 77| 76| 93|
| Sam| null| 62| 95|
| Aditya| 65| 89| 71|
+-------+-----+-------+---------+
The ordered dataframe is:
+-------+-----+-------+---------+
| Name|Maths|Physics|Chemistry|
+-------+-----+-------+---------+
| Chris| null| 85| 82|
| Sam| null| 62| 95|
| Agatha| 77| 76| 93|
| Aditya| 65| 89| 71|
|Katrina| 49| 47| 83|
| Aditya| 45| 89| 71|
| Joel| 45| 75| 87|
| Joel| 45| 75| 87|
+-------+-----+-------+---------+
Here, we have used the desc_nulls_first()
method on the Maths
column. Hence, even if the dataframe is sorted in descending order, the rows with null values are kept at the top of the output dataframe.
If you want to sort the pyspark dataframe in ascending order and put the rows containing nulls at the top of the dataframe, you can use the asc_nulls_first()
method in the orderBy()
method. However, Using the asc_nulls_first()
method is redundant as the rows with null values are put at the top of the ordered dataframe by default when we sort them in ascending order.
Orderby PySpark DataFrame Nulls Last
If there are null values present in a row in the column by which we want to orderby a pyspark dataframe, the row is placed at the last of the ordered dataframe by default when we order it in descending order. You can observe this in the following example.
import pyspark.sql as ps
from pyspark.sql.functions import col
spark = ps.SparkSession.builder \
.master("local[*]") \
.appName("orderby_example") \
.getOrCreate()
dfs=spark.read.csv("sample_csv_file.csv",header=True)
print("The input dataframe is:")
dfs.show()
dfs=dfs.orderBy(col('Maths').desc())
print("The ordered dataframe is:")
dfs.show()
spark.sparkContext.stop()
Output:
The input dataframe is:
+-------+-----+-------+---------+
| Name|Maths|Physics|Chemistry|
+-------+-----+-------+---------+
| Aditya| 45| 89| 71|
| Chris| null| 85| 82|
| Joel| 45| 75| 87|
|Katrina| 49| 47| 83|
| Joel| 45| 75| 87|
| Agatha| 77| 76| 93|
| Sam| null| 62| 95|
| Aditya| 65| 89| 71|
+-------+-----+-------+---------+
The ordered dataframe is:
+-------+-----+-------+---------+
| Name|Maths|Physics|Chemistry|
+-------+-----+-------+---------+
| Agatha| 77| 76| 93|
| Aditya| 65| 89| 71|
|Katrina| 49| 47| 83|
| Aditya| 45| 89| 71|
| Joel| 45| 75| 87|
| Joel| 45| 75| 87|
| Chris| null| 85| 82|
| Sam| null| 62| 95|
+-------+-----+-------+---------+
When we sort a pyspark dataframe by a column with null values in ascending order, the rows with null values are placed at the top. You can observe this in the following example.
import pyspark.sql as ps
from pyspark.sql.functions import col
spark = ps.SparkSession.builder \
.master("local[*]") \
.appName("orderby_example") \
.getOrCreate()
dfs=spark.read.csv("sample_csv_file.csv",header=True)
print("The input dataframe is:")
dfs.show()
dfs=dfs.orderBy(col('Maths'))
print("The ordered dataframe is:")
dfs.show()
spark.sparkContext.stop()
Output:
The input dataframe is:
+-------+-----+-------+---------+
| Name|Maths|Physics|Chemistry|
+-------+-----+-------+---------+
| Aditya| 45| 89| 71|
| Chris| null| 85| 82|
| Joel| 45| 75| 87|
|Katrina| 49| 47| 83|
| Joel| 45| 75| 87|
| Agatha| 77| 76| 93|
| Sam| null| 62| 95|
| Aditya| 65| 89| 71|
+-------+-----+-------+---------+
The ordered dataframe is:
+-------+-----+-------+---------+
| Name|Maths|Physics|Chemistry|
+-------+-----+-------+---------+
| Chris| null| 85| 82|
| Sam| null| 62| 95|
| Joel| 45| 75| 87|
| Joel| 45| 75| 87|
| Aditya| 45| 89| 71|
|Katrina| 49| 47| 83|
| Aditya| 65| 89| 71|
| Agatha| 77| 76| 93|
+-------+-----+-------+---------+
To put the rows containing the null values in the last place in the ordered dataframe while sorting in ascending order, we can use the asc_nulls_last()
method on the columns given in the orderBy()
method. After this, the data frame will be ordered in ascending order with rows containing the null values at the last of the dataframe. You can observe this in the following example.
import pyspark.sql as ps
from pyspark.sql.functions import col
spark = ps.SparkSession.builder \
.master("local[*]") \
.appName("orderby_example") \
.getOrCreate()
dfs=spark.read.csv("sample_csv_file.csv",header=True)
print("The input dataframe is:")
dfs.show()
dfs=dfs.orderBy(col('Maths').asc_nulls_last())
print("The ordered dataframe is:")
dfs.show()
spark.sparkContext.stop()
Output:
The input dataframe is:
+-------+-----+-------+---------+
| Name|Maths|Physics|Chemistry|
+-------+-----+-------+---------+
| Aditya| 45| 89| 71|
| Chris| null| 85| 82|
| Joel| 45| 75| 87|
|Katrina| 49| 47| 83|
| Joel| 45| 75| 87|
| Agatha| 77| 76| 93|
| Sam| null| 62| 95|
| Aditya| 65| 89| 71|
+-------+-----+-------+---------+
The ordered dataframe is:
+-------+-----+-------+---------+
| Name|Maths|Physics|Chemistry|
+-------+-----+-------+---------+
| Aditya| 45| 89| 71|
| Joel| 45| 75| 87|
| Joel| 45| 75| 87|
|Katrina| 49| 47| 83|
| Aditya| 65| 89| 71|
| Agatha| 77| 76| 93|
| Chris| null| 85| 82|
| Sam| null| 62| 95|
+-------+-----+-------+---------+
Here, we have used the asc_nulls_last()
method on the Maths
column. Hence, even if the dataframe is sorted in ascending order, the rows with null values are kept at the last of the output dataframe.
If you want to sort the pyspark dataframe in descending order and put the rows containing nulls at the last of the dataframe, you can use the desc_nulls_last()
method in the orderBy()
method. However, using the desc_nulls_last()
method is redundant as the rows with null values are put at the last of the ordered dataframe by default when we sort them in descending order.
Conclusion
In this article, we discussed how to sort a pyspark dataframe using the orderBy() method. To learn more about Python programming, you can read this article on how to select rows with null values in a pyspark dataframe. You might also like this article on list of lists in Python.
I hope you enjoyed reading this article. Stay tuned for more informative articles.
Happy Learning!
Recommended Python Training
Course: Python 3 For Beginners
Over 15 hours of video content with guided instruction for beginners. Learn how to create real world applications and master the basics.