PySpark Sort DataFrame With Examples - PythonForBeginners.com

While working with pyspark dataframes, we often need to sort them according to one or multiple columns. In this article, we will discuss different ways to sort a pyspark dataframe using the sort() method.

Table of Contents

The PySpark sort() Method
PySpark Sort DataFrame by Column Name
Sort DataFrame in Descending Order
PySpark Sort DataFrame by Multiple Columns
1. Sort DataFrame by Multiple Columns With Different Sorting Order
Sort PySpark DataFrame With Null Values
1. Pyspark Sort DataFrame Nulls Last
2. Pyspark Sort DataFrame Nulls First
Conclusion

The PySpark sort() Method

The sort() method in pyspark is used to sort a dataframe by one or multiple columns. It has the following syntax.

df.sort(*columns, ascending=True)

Here,

The parameter *columns represent one or multiple columns by which we need to sort the dataframe.
The ascending parameter specifies if we want to sort the dataframe in ascending or descending order. If there are multiple columns by which you want to sort the dataframe, you can also pass a list of True and False values to specify the columns by which the dataframe is sorted in ascending or descending order.

PySpark Sort DataFrame by Column Name

To sort a pyspark dataframe by column name, we can use the sort() method as shown in the following example.

import pyspark.sql as ps
spark = ps.SparkSession.builder \
      .master("local[*]") \
      .appName("sort_example") \
      .getOrCreate()

dfs=spark.read.csv("sample_csv_file.csv",header=True,inferSchema=True)
print("The input dataframe is:")
dfs.show()
dfs=dfs.sort('Maths')
print("The sorted dataframe is:")
dfs.show()
spark.sparkContext.stop()

Output:

The input dataframe is:
+-------+-----+-------+---------+
|   Name|Maths|Physics|Chemistry|
+-------+-----+-------+---------+
| Aditya|   45|     89|       71|
|  Chris|   86|     85|       82|
|   Joel|   45|     85|       87|
|Katrina|   49|     47|       83|
| Agatha|   76|     89|       83|
|    Sam|   76|     98|       95|
+-------+-----+-------+---------+

The sorted dataframe is:
+-------+-----+-------+---------+
|   Name|Maths|Physics|Chemistry|
+-------+-----+-------+---------+
|   Joel|   45|     85|       87|
| Aditya|   45|     89|       71|
|Katrina|   49|     47|       83|
|    Sam|   76|     98|       95|
| Agatha|   76|     89|       83|
|  Chris|   86|     85|       82|
+-------+-----+-------+---------+

In this example, we first created a SparkSession on our local machine. Then, we read a csv file to create a pyspark dataframe. You can download the file using the following link.

sample_csv_file Download

Once we get the dataframe, we sort it using the ‘Maths’ column. In the output dataframe, you can observe that the rows of the dataframe are sorted in ascending order by the Maths column.

Instead of the above approach, you can also use the col() function to sort the pyspark dataframe as shown below.

import pyspark.sql as ps
from pyspark.sql.functions import col
spark = ps.SparkSession.builder \
      .master("local[*]") \
      .appName("sort_example") \
      .getOrCreate()

dfs=spark.read.csv("sample_csv_file.csv",header=True,inferSchema=True)
print("The input dataframe is:")
dfs.show()
dfs=dfs.sort(col('Maths'))
print("The sorted dataframe is:")
dfs.show()
spark.sparkContext.stop()

Output:

he input dataframe is:
+-------+-----+-------+---------+
|   Name|Maths|Physics|Chemistry|
+-------+-----+-------+---------+
| Aditya|   45|     89|       71|
|  Chris|   86|     85|       82|
|   Joel|   45|     85|       87|
|Katrina|   49|     47|       83|
| Agatha|   76|     89|       83|
|    Sam|   76|     98|       95|
+-------+-----+-------+---------+

The sorted dataframe is:
+-------+-----+-------+---------+
|   Name|Maths|Physics|Chemistry|
+-------+-----+-------+---------+
|   Joel|   45|     85|       87|
| Aditya|   45|     89|       71|
|Katrina|   49|     47|       83|
|    Sam|   76|     98|       95|
| Agatha|   76|     89|       83|
|  Chris|   86|     85|       82|
+-------+-----+-------+---------+

The col() function is used to select a dataframe column as a column object. In this example, we have used the col() function to select the Maths column in the dataframe. Then, the dataframe is sorted by the selected column using the sort() method.

Sort DataFrame in Descending Order

To sort a pyspark dataframe by a column in descending order, you can set the ascending parameter to False in the sort() method as shown below.

import pyspark.sql as ps
spark = ps.SparkSession.builder \
      .master("local[*]") \
      .appName("sort_example") \
      .getOrCreate()

dfs=spark.read.csv("sample_csv_file.csv",header=True,inferSchema=True)
print("The input dataframe is:")
dfs.show()
dfs=dfs.sort('Maths',ascending=False)
print("The sorted dataframe is:")
dfs.show()
spark.sparkContext.stop()

Output:

The input dataframe is:
+-------+-----+-------+---------+
|   Name|Maths|Physics|Chemistry|
+-------+-----+-------+---------+
| Aditya|   45|     89|       71|
|  Chris|   86|     85|       82|
|   Joel|   45|     85|       87|
|Katrina|   49|     47|       83|
| Agatha|   76|     89|       83|
|    Sam|   76|     98|       95|
+-------+-----+-------+---------+

The sorted dataframe is:
+-------+-----+-------+---------+
|   Name|Maths|Physics|Chemistry|
+-------+-----+-------+---------+
|  Chris|   86|     85|       82|
| Agatha|   76|     89|       83|
|    Sam|   76|     98|       95|
|Katrina|   49|     47|       83|
| Aditya|   45|     89|       71|
|   Joel|   45|     85|       87|
+-------+-----+-------+---------+

In this example, we have set the ascending parameter to False in the sort() method. Hence, the dataframe is sorted by the Maths column in descending order.

If you are using the col() function to sort the pyspark dataframe, you can use the desc() method on the column of the pyspark. When we invoke the desc() method on the column obtained using the col() function, the sort() method sorts the pyspark dataframe in descending order. You can observe this in the following example.

import pyspark.sql as ps
from pyspark.sql.functions import col
spark = ps.SparkSession.builder \
      .master("local[*]") \
      .appName("sort_example") \
      .getOrCreate()

dfs=spark.read.csv("sample_csv_file.csv",header=True,inferSchema=True)
print("The input dataframe is:")
dfs.show()
dfs=dfs.sort(col('Maths').desc())
print("The sorted dataframe is:")
dfs.show()
spark.sparkContext.stop()

Output:

The input dataframe is:
+-------+-----+-------+---------+
|   Name|Maths|Physics|Chemistry|
+-------+-----+-------+---------+
| Aditya|   45|     89|       71|
|  Chris|   86|     85|       82|
|   Joel|   45|     85|       87|
|Katrina|   49|     47|       83|
| Agatha|   76|     89|       83|
|    Sam|   76|     98|       95|
+-------+-----+-------+---------+

The sorted dataframe is:
+-------+-----+-------+---------+
|   Name|Maths|Physics|Chemistry|
+-------+-----+-------+---------+
|  Chris|   86|     85|       82|
| Agatha|   76|     89|       83|
|    Sam|   76|     98|       95|
|Katrina|   49|     47|       83|
| Aditya|   45|     89|       71|
|   Joel|   45|     85|       87|
+-------+-----+-------+---------+

PySpark Sort DataFrame by Multiple Columns

To sort a pyspark dataframe by multiple columns, you can pass all the column names to the sort() method in a sequence. Here, the dataframe is first sorted by the first column name given to the sort() method. If two rows have same values in the first column, they are sorted using the values in the next column. This process is repeated until the rows are sorted by all the given columns. You can observe this in the following example.

import pyspark.sql as ps
spark = ps.SparkSession.builder \
      .master("local[*]") \
      .appName("sort_example") \
      .getOrCreate()

dfs=spark.read.csv("sample_csv_file.csv",header=True)
print("The input dataframe is:")
dfs.show()
dfs=dfs.sort('Maths','Physics')
print("The sorted dataframe is:")
dfs.show()
spark.sparkContext.stop()

Output:

The input dataframe is:
+-------+-----+-------+---------+
|   Name|Maths|Physics|Chemistry|
+-------+-----+-------+---------+
| Aditya|   45|     89|       71|
|  Chris|   86|     85|       82|
|   Joel|   45|     85|       87|
|Katrina|   49|     47|       83|
| Agatha|   76|     89|       83|
|    Sam|   76|     98|       95|
+-------+-----+-------+---------+

The sorted dataframe is:
+-------+-----+-------+---------+
|   Name|Maths|Physics|Chemistry|
+-------+-----+-------+---------+
|   Joel|   45|     85|       87|
| Aditya|   45|     89|       71|
|Katrina|   49|     47|       83|
| Agatha|   76|     89|       83|
|    Sam|   76|     98|       95|
|  Chris|   86|     85|       82|
+-------+-----+-------+---------+

In this example, we have sorted the pyspark dataframe using Maths and Physics column. While execution, the sort() method first sorts the rows of the dataframe by the Maths column as its is given as the first input argument. The rows in which the Maths column has the same values are then sorted by the Physics column, as you can observe in the output above.

By default, the sort() method sorts the pyspark dataframe in ascending order. To sort the dataframe in descending order by all the columns using the sort() method, you can set the ascending parameter to False as shown below.

import pyspark.sql as ps
spark = ps.SparkSession.builder \
      .master("local[*]") \
      .appName("sort_example") \
      .getOrCreate()

dfs=spark.read.csv("sample_csv_file.csv",header=True)
print("The input dataframe is:")
dfs.show()
dfs=dfs.sort('Maths','Physics',ascending=False)
print("The sorted dataframe is:")
dfs.show()
spark.sparkContext.stop()

Output:

The input dataframe is:
+-------+-----+-------+---------+
|   Name|Maths|Physics|Chemistry|
+-------+-----+-------+---------+
| Aditya|   45|     89|       71|
|  Chris|   86|     85|       82|
|   Joel|   45|     85|       87|
|Katrina|   49|     47|       83|
| Agatha|   76|     89|       83|
|    Sam|   76|     98|       95|
+-------+-----+-------+---------+

The sorted dataframe is:
+-------+-----+-------+---------+
|   Name|Maths|Physics|Chemistry|
+-------+-----+-------+---------+
|  Chris|   86|     85|       82|
|    Sam|   76|     98|       95|
| Agatha|   76|     89|       83|
|Katrina|   49|     47|       83|
| Aditya|   45|     89|       71|
|   Joel|   45|     85|       87|
+-------+-----+-------+---------+

Sort DataFrame by Multiple Columns With Different Sorting Order

If you want to change the sorting order for each column, you can pass a list of True and False values to the ascending parameter in the sort() method. Here, the number of boolean values should be equal to the number of column names passed to the sort() method. Each value in the list corresponds to a single column at the same position in the parameter list.

If we want to sort the pyspark dataframe in ascending order by the ith column name passed to the sort() method, the ith element in the list passed to the ascending parameter should be True. Similarly, if we want to sort the pyspark dataframe in ascending order by the jth column name passed to the sort() method, the jth element in the list should be False. You can observe this in the following example.

import pyspark.sql as ps
spark = ps.SparkSession.builder \
      .master("local[*]") \
      .appName("sort_example") \
      .getOrCreate()

dfs=spark.read.csv("sample_csv_file.csv",header=True)
print("The input dataframe is:")
dfs.show()
dfs=dfs.sort('Maths','Physics',ascending=[True, False])
print("The sorted dataframe is:")
dfs.show()
spark.sparkContext.stop()

Output:

The input dataframe is:
+-------+-----+-------+---------+
|   Name|Maths|Physics|Chemistry|
+-------+-----+-------+---------+
| Aditya|   45|     89|       71|
|  Chris|   86|     85|       82|
|   Joel|   45|     85|       87|
|Katrina|   49|     47|       83|
| Agatha|   76|     89|       83|
|    Sam|   76|     98|       95|
+-------+-----+-------+---------+

The sorted dataframe is:
+-------+-----+-------+---------+
|   Name|Maths|Physics|Chemistry|
+-------+-----+-------+---------+
| Aditya|   45|     89|       71|
|   Joel|   45|     85|       87|
|Katrina|   49|     47|       83|
|    Sam|   76|     98|       95|
| Agatha|   76|     89|       83|
|  Chris|   86|     85|       82|
+-------+-----+-------+---------+

In this example, we have passed the list [True, False] to the ascending parameter in the sort() method. Hence, the sort() method first sorts the dataframe by the Maths column in ascending order. For the rows where the Maths column contains equal values, the dataframe is sorted by the Physics column in descending order as the second element in the list passed to the ascending parameter is False.

If you are using the col() function to sort the dataframe, you can use the asc() method and desc() method on each column to sort the dataframe by the column in ascending order or descending order respectively.

import pyspark.sql as ps
from pyspark.sql.functions import col
spark = ps.SparkSession.builder \
      .master("local[*]") \
      .appName("sort_example") \
      .getOrCreate()

dfs=spark.read.csv("sample_csv_file.csv",header=True,inferSchema=True)
print("The input dataframe is:")
dfs.show()
dfs=dfs.sort(col('Maths').asc(),col('Physics').desc())
print("The sorted dataframe is:")
dfs.show()
spark.sparkContext.stop()

Output:

The input dataframe is:
+-------+-----+-------+---------+
|   Name|Maths|Physics|Chemistry|
+-------+-----+-------+---------+
| Aditya|   45|     89|       71|
|  Chris|   86|     85|       82|
|   Joel|   45|     85|       87|
|Katrina|   49|     47|       83|
| Agatha|   76|     89|       83|
|    Sam|   76|     98|       95|
+-------+-----+-------+---------+

The sorted dataframe is:
+-------+-----+-------+---------+
|   Name|Maths|Physics|Chemistry|
+-------+-----+-------+---------+
| Aditya|   45|     89|       71|
|   Joel|   45|     85|       87|
|Katrina|   49|     47|       83|
|    Sam|   76|     98|       95|
| Agatha|   76|     89|       83|
|  Chris|   86|     85|       82|
+-------+-----+-------+---------+

In this example, we have used the asc() method on the Maths and desc() method on the Physics column object. Hence, the sort() method first sorts the dataframe by the Maths column in ascending order. If two rows have the same values in the Maths column, the dataframe is sorted by the Physics column in descending order as we have used the desc() method on the Physics column.

Sort PySpark DataFrame With Null Values

If there are null values present in a row in the column by which we are sorting a pyspark dataframe, the row is placed at the top of the sorted dataframe. You can observe this in the following example.

import pyspark.sql as ps
from pyspark.sql.functions import col
spark = ps.SparkSession.builder \
      .master("local[*]") \
      .appName("sort_example") \
      .getOrCreate()

dfs=spark.read.csv("sample_csv_file.csv",header=True)
print("The input dataframe is:")
dfs.show()
dfs=dfs.sort(col('Maths'))
print("The sorted dataframe is:")
dfs.show()
spark.sparkContext.stop()

Output:

The input dataframe is:
+-------+-----+-------+---------+
|   Name|Maths|Physics|Chemistry|
+-------+-----+-------+---------+
| Aditya|   45|     89|       71|
|  Chris| null|     85|       82|
|   Joel|   45|   null|       87|
|Katrina|   49|     47|       83|
| Agatha|   76|   null|       83|
|    Sam| null|     98|       95|
+-------+-----+-------+---------+

The sorted dataframe is:
+-------+-----+-------+---------+
|   Name|Maths|Physics|Chemistry|
+-------+-----+-------+---------+
|    Sam| null|     98|       95|
|  Chris| null|     85|       82|
| Aditya|   45|     89|       71|
|   Joel|   45|   null|       87|
|Katrina|   49|     47|       83|
| Agatha|   76|   null|       83|
+-------+-----+-------+---------+

In this example, the Maths column contains null values in two rows. Hence, when we sort the pyspark dataframe by the Maths column in ascending order, the rows with null values are put at the top of the dataframe in the output.

When we sort the dataframe in descending order, the null values are put at the bottom of the dataframe.

import pyspark.sql as ps
from pyspark.sql.functions import col
spark = ps.SparkSession.builder \
      .master("local[*]") \
      .appName("sort_example") \
      .getOrCreate()

dfs=spark.read.csv("sample_csv_file.csv",header=True)
print("The input dataframe is:")
dfs.show()
dfs=dfs.sort(col('Maths').desc())
print("The sorted dataframe is:")
dfs.show()
spark.sparkContext.stop()

Output:

The input dataframe is:
+-------+-----+-------+---------+
|   Name|Maths|Physics|Chemistry|
+-------+-----+-------+---------+
| Aditya|   45|     89|       71|
|  Chris| null|     85|       82|
|   Joel|   45|   null|       87|
|Katrina|   49|     47|       83|
| Agatha|   76|   null|       83|
|    Sam| null|     98|       95|
+-------+-----+-------+---------+

The sorted dataframe is:
+-------+-----+-------+---------+
|   Name|Maths|Physics|Chemistry|
+-------+-----+-------+---------+
| Agatha|   76|   null|       83|
|Katrina|   49|     47|       83|
| Aditya|   45|     89|       71|
|   Joel|   45|   null|       87|
|  Chris| null|     85|       82|
|    Sam| null|     98|       95|
+-------+-----+-------+---------+

Here, we have sorted the pyspark dataframe by the Maths column in descending order. Hence, the rows with null values in the Maths column are positioned at the bottom of the output dataframe.

Pyspark Sort DataFrame Nulls Last

To put the rows containing the null values in the first place in the sorted dataframe, we can use the asc_nulls_last() method on the columns given in the sort() method. After this, the data frame will be sorted in ascending order with rows containing the null values at the top of the dataframe. You can observe this in the following example.

import pyspark.sql as ps
from pyspark.sql.functions import col
spark = ps.SparkSession.builder \
      .master("local[*]") \
      .appName("sort_example") \
      .getOrCreate()

dfs=spark.read.csv("sample_csv_file.csv",header=True)
print("The input dataframe is:")
dfs.show()
dfs=dfs.sort(col('Maths').asc_nulls_last())
print("The sorted dataframe is:")
dfs.show()
spark.sparkContext.stop()

Output:

The input dataframe is:
+-------+-----+-------+---------+
|   Name|Maths|Physics|Chemistry|
+-------+-----+-------+---------+
| Aditya|   45|     89|       71|
|  Chris| null|     85|       82|
|   Joel|   45|   null|       87|
|Katrina|   49|     47|       83|
| Agatha|   76|   null|       83|
|    Sam| null|     98|       95|
+-------+-----+-------+---------+

The sorted dataframe is:
+-------+-----+-------+---------+
|   Name|Maths|Physics|Chemistry|
+-------+-----+-------+---------+
|   Joel|   45|   null|       87|
| Aditya|   45|     89|       71|
|Katrina|   49|     47|       83|
| Agatha|   76|   null|       83|
|  Chris| null|     85|       82|
|    Sam| null|     98|       95|
+-------+-----+-------+---------+

In this example, we have invoked the asc_nulls_last() method on the column object passed to the sort() method. Hence, even if we are sorting the dataframe in ascending order, the rows with null values are positioned at the bottom of the output dataframe.

Similarly, if you want to sort the pyspark dataframe in descending order and put the rows containing nulls at the bottom of the dataframe, you can use the desc_nulls_last() method in the sort() method as shown below.

import pyspark.sql as ps
from pyspark.sql.functions import col
spark = ps.SparkSession.builder \
      .master("local[*]") \
      .appName("sort_example") \
      .getOrCreate()

dfs=spark.read.csv("sample_csv_file.csv",header=True)
print("The input dataframe is:")
dfs.show()
dfs=dfs.sort(col('Maths').desc_nulls_last())
print("The sorted dataframe is:")
dfs.show()
spark.sparkContext.stop()

Output:

The input dataframe is:
+-------+-----+-------+---------+
|   Name|Maths|Physics|Chemistry|
+-------+-----+-------+---------+
| Aditya|   45|     89|       71|
|  Chris| null|     85|       82|
|   Joel|   45|   null|       87|
|Katrina|   49|     47|       83|
| Agatha|   76|   null|       83|
|    Sam| null|     98|       95|
+-------+-----+-------+---------+

The sorted dataframe is:
+-------+-----+-------+---------+
|   Name|Maths|Physics|Chemistry|
+-------+-----+-------+---------+
| Agatha|   76|   null|       83|
|Katrina|   49|     47|       83|
| Aditya|   45|     89|       71|
|   Joel|   45|   null|       87|
|  Chris| null|     85|       82|
|    Sam| null|     98|       95|
+-------+-----+-------+---------+

Pyspark Sort DataFrame Nulls First

For sorting a pyspark dataframe in descending order and with null values at the top of the sorted dataframe, you can use the desc_nulls_first() method. When we invoke the desc_nulls_first() method on a column object, the sort() method returns the pyspark dataframe sorted in descending order and null values at the top of the dataframe.

You can also use the asc_nulls_first() method to sort the pyspark data frame in ascending order and place the rows containing null values at the top of the data frame which is the default behavior of the sort() method.

Conclusion

In this article, we discussed how to sort a pyspark dataframe using the sort() method.To learn more about python programming, you can read this article on Pyspark vs Pandas. You might also like this article on list of lists in Python.

I hope you enjoyed reading this article. Stay tuned for more informative articles.

Happy Learning!

PySpark OrderBy One or Multiple ColumnsJuly 21, 2023In "Pyspark"

PySpark Filter Rows in a DataFrame by ConditionJuly 14, 2023In "Pyspark"

PySpark Create Empty DataFrameJuly 3, 2023In "Pyspark"

Recommended Python Training

Course: Python 3 For Beginners

Over 15 hours of video content with guided instruction for beginners. Learn how to create real world applications and master the basics.

Enroll Now

The PySpark sort() Method

PySpark Sort DataFrame by Column Name

Sort DataFrame in Descending Order

PySpark Sort DataFrame by Multiple Columns

Sort DataFrame by Multiple Columns With Different Sorting Order

Sort PySpark DataFrame With Null Values

Pyspark Sort DataFrame Nulls Last

Pyspark Sort DataFrame Nulls First

Conclusion

Related

Recommended Python Training

More Python Topics