While working with pyspark dataframes, we often need to sort them according to one or multiple columns. In this article, we will discuss different ways to sort a pyspark dataframe using the sort() method.
The PySpark sort() Method
The sort() method in pyspark is used to sort a dataframe by one or multiple columns. It has the following syntax.
df.sort(*columns, ascending=True)
Here,
- The parameter
*columns
represent one or multiple columns by which we need to sort the dataframe. - The
ascending
parameter specifies if we want to sort the dataframe in ascending or descending order. If there are multiple columns by which you want to sort the dataframe, you can also pass a list of True and False values to specify the columns by which the dataframe is sorted in ascending or descending order.
PySpark Sort DataFrame by Column Name
To sort a pyspark dataframe by column name, we can use the sort()
method as shown in the following example.
import pyspark.sql as ps
spark = ps.SparkSession.builder \
.master("local[*]") \
.appName("sort_example") \
.getOrCreate()
dfs=spark.read.csv("sample_csv_file.csv",header=True,inferSchema=True)
print("The input dataframe is:")
dfs.show()
dfs=dfs.sort('Maths')
print("The sorted dataframe is:")
dfs.show()
spark.sparkContext.stop()
Output:
The input dataframe is:
+-------+-----+-------+---------+
| Name|Maths|Physics|Chemistry|
+-------+-----+-------+---------+
| Aditya| 45| 89| 71|
| Chris| 86| 85| 82|
| Joel| 45| 85| 87|
|Katrina| 49| 47| 83|
| Agatha| 76| 89| 83|
| Sam| 76| 98| 95|
+-------+-----+-------+---------+
The sorted dataframe is:
+-------+-----+-------+---------+
| Name|Maths|Physics|Chemistry|
+-------+-----+-------+---------+
| Joel| 45| 85| 87|
| Aditya| 45| 89| 71|
|Katrina| 49| 47| 83|
| Sam| 76| 98| 95|
| Agatha| 76| 89| 83|
| Chris| 86| 85| 82|
+-------+-----+-------+---------+
In this example, we first created a SparkSession
on our local machine. Then, we read a csv file to create a pyspark dataframe. You can download the file using the following link.
Once we get the dataframe, we sort it using the ‘Maths
’ column. In the output dataframe, you can observe that the rows of the dataframe are sorted in ascending order by the Maths
column.
Instead of the above approach, you can also use the col()
function to sort the pyspark dataframe as shown below.
import pyspark.sql as ps
from pyspark.sql.functions import col
spark = ps.SparkSession.builder \
.master("local[*]") \
.appName("sort_example") \
.getOrCreate()
dfs=spark.read.csv("sample_csv_file.csv",header=True,inferSchema=True)
print("The input dataframe is:")
dfs.show()
dfs=dfs.sort(col('Maths'))
print("The sorted dataframe is:")
dfs.show()
spark.sparkContext.stop()
Output:
he input dataframe is:
+-------+-----+-------+---------+
| Name|Maths|Physics|Chemistry|
+-------+-----+-------+---------+
| Aditya| 45| 89| 71|
| Chris| 86| 85| 82|
| Joel| 45| 85| 87|
|Katrina| 49| 47| 83|
| Agatha| 76| 89| 83|
| Sam| 76| 98| 95|
+-------+-----+-------+---------+
The sorted dataframe is:
+-------+-----+-------+---------+
| Name|Maths|Physics|Chemistry|
+-------+-----+-------+---------+
| Joel| 45| 85| 87|
| Aditya| 45| 89| 71|
|Katrina| 49| 47| 83|
| Sam| 76| 98| 95|
| Agatha| 76| 89| 83|
| Chris| 86| 85| 82|
+-------+-----+-------+---------+
The col()
function is used to select a dataframe column as a column object. In this example, we have used the col()
function to select the Maths
column in the dataframe. Then, the dataframe is sorted by the selected column using the sort()
method.
Sort DataFrame in Descending Order
To sort a pyspark dataframe by a column in descending order, you can set the ascending
parameter to False in the sort()
method as shown below.
import pyspark.sql as ps
spark = ps.SparkSession.builder \
.master("local[*]") \
.appName("sort_example") \
.getOrCreate()
dfs=spark.read.csv("sample_csv_file.csv",header=True,inferSchema=True)
print("The input dataframe is:")
dfs.show()
dfs=dfs.sort('Maths',ascending=False)
print("The sorted dataframe is:")
dfs.show()
spark.sparkContext.stop()
Output:
The input dataframe is:
+-------+-----+-------+---------+
| Name|Maths|Physics|Chemistry|
+-------+-----+-------+---------+
| Aditya| 45| 89| 71|
| Chris| 86| 85| 82|
| Joel| 45| 85| 87|
|Katrina| 49| 47| 83|
| Agatha| 76| 89| 83|
| Sam| 76| 98| 95|
+-------+-----+-------+---------+
The sorted dataframe is:
+-------+-----+-------+---------+
| Name|Maths|Physics|Chemistry|
+-------+-----+-------+---------+
| Chris| 86| 85| 82|
| Agatha| 76| 89| 83|
| Sam| 76| 98| 95|
|Katrina| 49| 47| 83|
| Aditya| 45| 89| 71|
| Joel| 45| 85| 87|
+-------+-----+-------+---------+
In this example, we have set the ascending
parameter to False in the sort()
method. Hence, the dataframe is sorted by the Maths
column in descending order.
If you are using the col()
function to sort the pyspark dataframe, you can use the desc()
method on the column of the pyspark. When we invoke the desc()
method on the column obtained using the col()
function, the sort()
method sorts the pyspark dataframe in descending order. You can observe this in the following example.
import pyspark.sql as ps
from pyspark.sql.functions import col
spark = ps.SparkSession.builder \
.master("local[*]") \
.appName("sort_example") \
.getOrCreate()
dfs=spark.read.csv("sample_csv_file.csv",header=True,inferSchema=True)
print("The input dataframe is:")
dfs.show()
dfs=dfs.sort(col('Maths').desc())
print("The sorted dataframe is:")
dfs.show()
spark.sparkContext.stop()
Output:
The input dataframe is:
+-------+-----+-------+---------+
| Name|Maths|Physics|Chemistry|
+-------+-----+-------+---------+
| Aditya| 45| 89| 71|
| Chris| 86| 85| 82|
| Joel| 45| 85| 87|
|Katrina| 49| 47| 83|
| Agatha| 76| 89| 83|
| Sam| 76| 98| 95|
+-------+-----+-------+---------+
The sorted dataframe is:
+-------+-----+-------+---------+
| Name|Maths|Physics|Chemistry|
+-------+-----+-------+---------+
| Chris| 86| 85| 82|
| Agatha| 76| 89| 83|
| Sam| 76| 98| 95|
|Katrina| 49| 47| 83|
| Aditya| 45| 89| 71|
| Joel| 45| 85| 87|
+-------+-----+-------+---------+
PySpark Sort DataFrame by Multiple Columns
To sort a pyspark dataframe by multiple columns, you can pass all the column names to the sort() method in a sequence. Here, the dataframe is first sorted by the first column name given to the sort()
method. If two rows have same values in the first column, they are sorted using the values in the next column. This process is repeated until the rows are sorted by all the given columns. You can observe this in the following example.
import pyspark.sql as ps
spark = ps.SparkSession.builder \
.master("local[*]") \
.appName("sort_example") \
.getOrCreate()
dfs=spark.read.csv("sample_csv_file.csv",header=True)
print("The input dataframe is:")
dfs.show()
dfs=dfs.sort('Maths','Physics')
print("The sorted dataframe is:")
dfs.show()
spark.sparkContext.stop()
Output:
The input dataframe is:
+-------+-----+-------+---------+
| Name|Maths|Physics|Chemistry|
+-------+-----+-------+---------+
| Aditya| 45| 89| 71|
| Chris| 86| 85| 82|
| Joel| 45| 85| 87|
|Katrina| 49| 47| 83|
| Agatha| 76| 89| 83|
| Sam| 76| 98| 95|
+-------+-----+-------+---------+
The sorted dataframe is:
+-------+-----+-------+---------+
| Name|Maths|Physics|Chemistry|
+-------+-----+-------+---------+
| Joel| 45| 85| 87|
| Aditya| 45| 89| 71|
|Katrina| 49| 47| 83|
| Agatha| 76| 89| 83|
| Sam| 76| 98| 95|
| Chris| 86| 85| 82|
+-------+-----+-------+---------+
In this example, we have sorted the pyspark dataframe using Maths
and Physics
column. While execution, the sort()
method first sorts the rows of the dataframe by the Maths
column as its is given as the first input argument. The rows in which the Maths
column has the same values are then sorted by the Physics
column, as you can observe in the output above.
By default, the sort()
method sorts the pyspark dataframe in ascending order. To sort the dataframe in descending order by all the columns using the sort()
method, you can set the ascending
parameter to False as shown below.
import pyspark.sql as ps
spark = ps.SparkSession.builder \
.master("local[*]") \
.appName("sort_example") \
.getOrCreate()
dfs=spark.read.csv("sample_csv_file.csv",header=True)
print("The input dataframe is:")
dfs.show()
dfs=dfs.sort('Maths','Physics',ascending=False)
print("The sorted dataframe is:")
dfs.show()
spark.sparkContext.stop()
Output:
The input dataframe is:
+-------+-----+-------+---------+
| Name|Maths|Physics|Chemistry|
+-------+-----+-------+---------+
| Aditya| 45| 89| 71|
| Chris| 86| 85| 82|
| Joel| 45| 85| 87|
|Katrina| 49| 47| 83|
| Agatha| 76| 89| 83|
| Sam| 76| 98| 95|
+-------+-----+-------+---------+
The sorted dataframe is:
+-------+-----+-------+---------+
| Name|Maths|Physics|Chemistry|
+-------+-----+-------+---------+
| Chris| 86| 85| 82|
| Sam| 76| 98| 95|
| Agatha| 76| 89| 83|
|Katrina| 49| 47| 83|
| Aditya| 45| 89| 71|
| Joel| 45| 85| 87|
+-------+-----+-------+---------+
Sort DataFrame by Multiple Columns With Different Sorting Order
If you want to change the sorting order for each column, you can pass a list of True and False values to the ascending
parameter in the sort()
method. Here, the number of boolean values should be equal to the number of column names passed to the sort()
method. Each value in the list corresponds to a single column at the same position in the parameter list.
If we want to sort the pyspark dataframe in ascending order by the ith
column name passed to the sort()
method, the ith
element in the list passed to the ascending
parameter should be True. Similarly, if we want to sort the pyspark dataframe in ascending order by the jth
column name passed to the sort()
method, the jth
element in the list should be False. You can observe this in the following example.
import pyspark.sql as ps
spark = ps.SparkSession.builder \
.master("local[*]") \
.appName("sort_example") \
.getOrCreate()
dfs=spark.read.csv("sample_csv_file.csv",header=True)
print("The input dataframe is:")
dfs.show()
dfs=dfs.sort('Maths','Physics',ascending=[True, False])
print("The sorted dataframe is:")
dfs.show()
spark.sparkContext.stop()
Output:
The input dataframe is:
+-------+-----+-------+---------+
| Name|Maths|Physics|Chemistry|
+-------+-----+-------+---------+
| Aditya| 45| 89| 71|
| Chris| 86| 85| 82|
| Joel| 45| 85| 87|
|Katrina| 49| 47| 83|
| Agatha| 76| 89| 83|
| Sam| 76| 98| 95|
+-------+-----+-------+---------+
The sorted dataframe is:
+-------+-----+-------+---------+
| Name|Maths|Physics|Chemistry|
+-------+-----+-------+---------+
| Aditya| 45| 89| 71|
| Joel| 45| 85| 87|
|Katrina| 49| 47| 83|
| Sam| 76| 98| 95|
| Agatha| 76| 89| 83|
| Chris| 86| 85| 82|
+-------+-----+-------+---------+
In this example, we have passed the list [True, False]
to the ascending
parameter in the sort()
method. Hence, the sort()
method first sorts the dataframe by the Maths
column in ascending order. For the rows where the Maths
column contains equal values, the dataframe is sorted by the Physics
column in descending order as the second element in the list passed to the ascending
parameter is False.
If you are using the col()
function to sort the dataframe, you can use the asc()
method and desc()
method on each column to sort the dataframe by the column in ascending order or descending order respectively.
import pyspark.sql as ps
from pyspark.sql.functions import col
spark = ps.SparkSession.builder \
.master("local[*]") \
.appName("sort_example") \
.getOrCreate()
dfs=spark.read.csv("sample_csv_file.csv",header=True,inferSchema=True)
print("The input dataframe is:")
dfs.show()
dfs=dfs.sort(col('Maths').asc(),col('Physics').desc())
print("The sorted dataframe is:")
dfs.show()
spark.sparkContext.stop()
Output:
The input dataframe is:
+-------+-----+-------+---------+
| Name|Maths|Physics|Chemistry|
+-------+-----+-------+---------+
| Aditya| 45| 89| 71|
| Chris| 86| 85| 82|
| Joel| 45| 85| 87|
|Katrina| 49| 47| 83|
| Agatha| 76| 89| 83|
| Sam| 76| 98| 95|
+-------+-----+-------+---------+
The sorted dataframe is:
+-------+-----+-------+---------+
| Name|Maths|Physics|Chemistry|
+-------+-----+-------+---------+
| Aditya| 45| 89| 71|
| Joel| 45| 85| 87|
|Katrina| 49| 47| 83|
| Sam| 76| 98| 95|
| Agatha| 76| 89| 83|
| Chris| 86| 85| 82|
+-------+-----+-------+---------+
In this example, we have used the asc()
method on the Maths
and desc()
method on the Physics
column object. Hence, the sort()
method first sorts the dataframe by the Maths
column in ascending order. If two rows have the same values in the Maths
column, the dataframe is sorted by the Physics
column in descending order as we have used the desc()
method on the Physics
column.
Sort PySpark DataFrame With Null Values
If there are null values present in a row in the column by which we are sorting a pyspark dataframe, the row is placed at the top of the sorted dataframe. You can observe this in the following example.
import pyspark.sql as ps
from pyspark.sql.functions import col
spark = ps.SparkSession.builder \
.master("local[*]") \
.appName("sort_example") \
.getOrCreate()
dfs=spark.read.csv("sample_csv_file.csv",header=True)
print("The input dataframe is:")
dfs.show()
dfs=dfs.sort(col('Maths'))
print("The sorted dataframe is:")
dfs.show()
spark.sparkContext.stop()
Output:
The input dataframe is:
+-------+-----+-------+---------+
| Name|Maths|Physics|Chemistry|
+-------+-----+-------+---------+
| Aditya| 45| 89| 71|
| Chris| null| 85| 82|
| Joel| 45| null| 87|
|Katrina| 49| 47| 83|
| Agatha| 76| null| 83|
| Sam| null| 98| 95|
+-------+-----+-------+---------+
The sorted dataframe is:
+-------+-----+-------+---------+
| Name|Maths|Physics|Chemistry|
+-------+-----+-------+---------+
| Sam| null| 98| 95|
| Chris| null| 85| 82|
| Aditya| 45| 89| 71|
| Joel| 45| null| 87|
|Katrina| 49| 47| 83|
| Agatha| 76| null| 83|
+-------+-----+-------+---------+
In this example, the Maths
column contains null values in two rows. Hence, when we sort the pyspark dataframe by the Maths
column in ascending order, the rows with null values are put at the top of the dataframe in the output.
When we sort the dataframe in descending order, the null values are put at the bottom of the dataframe.
import pyspark.sql as ps
from pyspark.sql.functions import col
spark = ps.SparkSession.builder \
.master("local[*]") \
.appName("sort_example") \
.getOrCreate()
dfs=spark.read.csv("sample_csv_file.csv",header=True)
print("The input dataframe is:")
dfs.show()
dfs=dfs.sort(col('Maths').desc())
print("The sorted dataframe is:")
dfs.show()
spark.sparkContext.stop()
Output:
The input dataframe is:
+-------+-----+-------+---------+
| Name|Maths|Physics|Chemistry|
+-------+-----+-------+---------+
| Aditya| 45| 89| 71|
| Chris| null| 85| 82|
| Joel| 45| null| 87|
|Katrina| 49| 47| 83|
| Agatha| 76| null| 83|
| Sam| null| 98| 95|
+-------+-----+-------+---------+
The sorted dataframe is:
+-------+-----+-------+---------+
| Name|Maths|Physics|Chemistry|
+-------+-----+-------+---------+
| Agatha| 76| null| 83|
|Katrina| 49| 47| 83|
| Aditya| 45| 89| 71|
| Joel| 45| null| 87|
| Chris| null| 85| 82|
| Sam| null| 98| 95|
+-------+-----+-------+---------+
Here, we have sorted the pyspark dataframe by the Maths
column in descending order. Hence, the rows with null values in the Maths
column are positioned at the bottom of the output dataframe.
Pyspark Sort DataFrame Nulls Last
To put the rows containing the null values in the first place in the sorted dataframe, we can use the asc_nulls_last()
method on the columns given in the sort()
method. After this, the data frame will be sorted in ascending order with rows containing the null values at the top of the dataframe. You can observe this in the following example.
import pyspark.sql as ps
from pyspark.sql.functions import col
spark = ps.SparkSession.builder \
.master("local[*]") \
.appName("sort_example") \
.getOrCreate()
dfs=spark.read.csv("sample_csv_file.csv",header=True)
print("The input dataframe is:")
dfs.show()
dfs=dfs.sort(col('Maths').asc_nulls_last())
print("The sorted dataframe is:")
dfs.show()
spark.sparkContext.stop()
Output:
The input dataframe is:
+-------+-----+-------+---------+
| Name|Maths|Physics|Chemistry|
+-------+-----+-------+---------+
| Aditya| 45| 89| 71|
| Chris| null| 85| 82|
| Joel| 45| null| 87|
|Katrina| 49| 47| 83|
| Agatha| 76| null| 83|
| Sam| null| 98| 95|
+-------+-----+-------+---------+
The sorted dataframe is:
+-------+-----+-------+---------+
| Name|Maths|Physics|Chemistry|
+-------+-----+-------+---------+
| Joel| 45| null| 87|
| Aditya| 45| 89| 71|
|Katrina| 49| 47| 83|
| Agatha| 76| null| 83|
| Chris| null| 85| 82|
| Sam| null| 98| 95|
+-------+-----+-------+---------+
In this example, we have invoked the asc_nulls_last()
method on the column object passed to the sort()
method. Hence, even if we are sorting the dataframe in ascending order, the rows with null values are positioned at the bottom of the output dataframe.
Similarly, if you want to sort the pyspark dataframe in descending order and put the rows containing nulls at the bottom of the dataframe, you can use the desc_nulls_last()
method in the sort()
method as shown below.
import pyspark.sql as ps
from pyspark.sql.functions import col
spark = ps.SparkSession.builder \
.master("local[*]") \
.appName("sort_example") \
.getOrCreate()
dfs=spark.read.csv("sample_csv_file.csv",header=True)
print("The input dataframe is:")
dfs.show()
dfs=dfs.sort(col('Maths').desc_nulls_last())
print("The sorted dataframe is:")
dfs.show()
spark.sparkContext.stop()
Output:
The input dataframe is:
+-------+-----+-------+---------+
| Name|Maths|Physics|Chemistry|
+-------+-----+-------+---------+
| Aditya| 45| 89| 71|
| Chris| null| 85| 82|
| Joel| 45| null| 87|
|Katrina| 49| 47| 83|
| Agatha| 76| null| 83|
| Sam| null| 98| 95|
+-------+-----+-------+---------+
The sorted dataframe is:
+-------+-----+-------+---------+
| Name|Maths|Physics|Chemistry|
+-------+-----+-------+---------+
| Agatha| 76| null| 83|
|Katrina| 49| 47| 83|
| Aditya| 45| 89| 71|
| Joel| 45| null| 87|
| Chris| null| 85| 82|
| Sam| null| 98| 95|
+-------+-----+-------+---------+
Pyspark Sort DataFrame Nulls First
For sorting a pyspark dataframe in descending order and with null values at the top of the sorted dataframe, you can use the desc_nulls_first()
method. When we invoke the desc_nulls_first()
method on a column object, the sort()
method returns the pyspark dataframe sorted in descending order and null values at the top of the dataframe.
You can also use the asc_nulls_first()
method to sort the pyspark data frame in ascending order and place the rows containing null values at the top of the data frame which is the default behavior of the sort()
method.
Conclusion
In this article, we discussed how to sort a pyspark dataframe using the sort()
method.To learn more about python programming, you can read this article on Pyspark vs Pandas. You might also like this article on list of lists in Python.
I hope you enjoyed reading this article. Stay tuned for more informative articles.
Happy Learning!
Recommended Python Training
Course: Python 3 For Beginners
Over 15 hours of video content with guided instruction for beginners. Learn how to create real world applications and master the basics.