We use pyspark dataframes to handle tabular data. Sometimes, we need to create empty pyspark dataframes. This article will discuss how to create an empty dataframe in Pyspark.
Create An Empty DataFrame With Column Names in PySpark
We need to perform three steps to create an empty pyspark dataframe with column names.
- First, we will create an empty RDD object.
- Next, we will define the schema for the dataframe using the column names and data types.
- Finally, we will convert the RDD to a dataframe using the schema.
Let us discuss all these steps one by one.
Create an Empty RDD in Pyspark
To create an empty dataframe in pyspark, we will first create an empty RDD. To create an empty RDD, you just need to use the emptyRDD()
function on the sparkContext
attribute of a spark session. After execution, the emptyRDD()
function returns an empty RDD as shown below.
from pyspark.sql.types import StructType
import pyspark.sql as ps
spark = ps.SparkSession.builder \
.master("local[*]") \
.appName("create_dataframe") \
.getOrCreate()
empty_rdd=spark.sparkContext.emptyRDD()
print("The empty RDD object is:")
print(empty_rdd)
Output:
The empty RDD object is:
EmptyRDD[1] at emptyRDD at NativeMethodAccessorImpl.java:0
Here, we have created an emptyRDD
object using the emptyRDD()
method.
Define Schema For The DataFrame in PySpark
To define the schema for a pyspark dataframe, we use the StructType()
function and the StructField()
function.
The StructField()
function is used to define the name and data type of a particular column. It takes the column name as its first input argument and the data type of the column as its second input argument. To specify the data type of the column names, we use the StringType()
, IntegerType()
, FloatType()
, DoubleType()
, and other functions defined in the pyspark.sql.types
module.
In the third input argument to the StructField(
) function, we pass True or False specifying if the column can contain null values or not. If we set the third parameter to True, the column will allow null values. Otherwise, it will not.
After specifying the column names using the StructField()
function, we can pass all the StructField
objects to the StructType()
function to create a schema for the dataframe. As shown below.
from pyspark.sql.types import StructType, StructField, IntegerType, StringType,FloatType
import pyspark.sql as ps
spark = ps.SparkSession.builder \
.master("local[*]") \
.appName("create_dataframe") \
.getOrCreate()
empty_rdd=spark.sparkContext.emptyRDD()
list_of_cols=[StructField("Roll",IntegerType(),True),
StructField("Name",StringType(),True),
StructField("Percentage",FloatType(),True)]
schema=StructType(list_of_cols)
print("The schema is:")
print(schema)
Output:
The schema is:
StructType([StructField('Roll', IntegerType(), True), StructField('Name', StringType(), True), StructField('Percentage', FloatType(), True)])
In this example, we have defined the schema for a dataframe having three columns i.e. Roll
, Name
, and Percentage
.
Convert Empty RDD to PySpark DataFrame Using the Schema
Once we get the empty dataframe and the schema, we can use the createDataFrame()
function to create an empty pyspark dataframe with column names. The createDataFrame()
function takes the empty RDD object as its first input argument and the schema as its second input argument. After execution, it returns an empty dataframe with column names as shown below.
from pyspark.sql.types import StructType, StructField, IntegerType, StringType,FloatType
import pyspark.sql as ps
spark = ps.SparkSession.builder \
.master("local[*]") \
.appName("create_dataframe") \
.getOrCreate()
empty_rdd=spark.sparkContext.emptyRDD()
list_of_cols=[StructField("Roll",IntegerType(),True),
StructField("Name",StringType(),True),
StructField("Percentage",FloatType(),True)]
schema=StructType(list_of_cols)
df=spark.createDataFrame(empty_rdd,schema=schema)
print("The empty dataframe is:")
df.show()
Output:
The empty dataframe is:
+----+----+----------+
|Roll|Name|Percentage|
+----+----+----------+
+----+----+----------+
In this example, you can observe that the createDataFrame()
function takes an emptyRDD
object and the schema for the dataframe and returns an empty dataframe with given column names.
Instead of the createDataFrame()
function, you can also use the toDF()
method to convert an empty RDD to an empty pyspark dataframe with column names. The toDF()
method, when invoked on an emptyRDD
object, takes the schema as its input argument and returns an empty pyspark dataframe with column names. You can observe this in the following example.
from pyspark.sql.types import StructType, StructField, IntegerType, StringType,FloatType
import pyspark.sql as ps
spark = ps.SparkSession.builder \
.master("local[*]") \
.appName("create_dataframe") \
.getOrCreate()
empty_rdd=spark.sparkContext.emptyRDD()
list_of_cols=[StructField("Roll",IntegerType(),True),
StructField("Name",StringType(),True),
StructField("Percentage",FloatType(),True)]
schema=StructType(list_of_cols)
df=empty_rdd.toDF(schema=schema)
print("The empty dataframe is:")
df.show()
Output:
The empty dataframe is:
+----+----+----------+
|Roll|Name|Percentage|
+----+----+----------+
+----+----+----------+
In this example, we used the toDF() method instead of the createDataFrame() function to create an empty pyspark dataframe.
Create an Empty PySpark DataFrame Directly Using Schema
To create an empty dataframe directly using schema, you can pass an empty list to the createDataFrame()
function as its first input argument and the schema created using the column names as its second input argument. After execution of the createDataFrame()
function, you will get the empty dataframe as shown below.
from pyspark.sql.types import StructType, StructField, IntegerType, StringType,FloatType
import pyspark.sql as ps
spark = ps.SparkSession.builder \
.master("local[*]") \
.appName("create_dataframe") \
.getOrCreate()
empty_rdd=spark.sparkContext.emptyRDD()
list_of_cols=[StructField("Roll",IntegerType(),True),
StructField("Name",StringType(),True),
StructField("Percentage",FloatType(),True)]
schema=StructType(list_of_cols)
df=spark.createDataFrame([],schema=schema)
print("The empty dataframe is:")
df.show()
Output:
The empty dataframe is:
+----+----+----------+
|Roll|Name|Percentage|
+----+----+----------+
+----+----+----------+
In the above code, you can observe that we haven’t used an empty RDD object to create the empty dataframe. Instead, we directly passed an empty list and a schema to the createDataFrame()
function to obtain the empty dataframe with column names.
Create an Empty Data Frame Without Column Names
We can also create empty dataframes without column names. For this, we can pass an empty StructType
object to the functions discussed in the previous sections instead of a schema.
To create an empty StructType
object, we will first pass an empty list to the StructType()
function. After this, we can use an emptyRDD
object or createDataFrame()
function directly to create an empty pyspark dataframe without columns.
Empty DataFrame Without Column Names Using The emptyRDD Object
You can pass an emptyRDD
object and an empty StructType
object to the createDataFrame()
function as input arguments to create an empty pyspark dataframe with column names as shown below.
from pyspark.sql.types import StructType, StructField, IntegerType, StringType,FloatType
import pyspark.sql as ps
spark = ps.SparkSession.builder \
.master("local[*]") \
.appName("create_dataframe") \
.getOrCreate()
empty_rdd=spark.sparkContext.emptyRDD()
empty_schema=StructType([])
df=spark.createDataFrame(empty_rdd,schema=empty_schema)
print("The empty dataframe is:")
df.show()
Output:
The empty dataframe is:
++
||
++
++
In the above example, we first created an empty schema using the StructType()
function. Then, we passed the empty schema along with the emptyRDD
object to create an empty dataframe without column names.
Alternatively, you can also invoke the toDF()
method on the emptyRDD
object and pass the empty StructType
object to the toDF()
function as its input argument. After executing the toDF()
method, you will get an empty dataframe without column names as shown in the following example.
from pyspark.sql.types import StructType, StructField, IntegerType, StringType,FloatType
import pyspark.sql as ps
spark = ps.SparkSession.builder \
.master("local[*]") \
.appName("create_dataframe") \
.getOrCreate()
empty_rdd=spark.sparkContext.emptyRDD()
empty_schema=StructType([])
df=empty_rdd.toDF(schema=empty_schema)
print("The empty dataframe is:")
df.show()
Output:
The empty dataframe is:
++
||
++
++
Empty DataFrame Without Column Names Using The createDataFrame() Function
To create an empty pyspark dataframe using the createDataFrame()
function, we will pass an empty list as the first argument and the empty StructType
object as the second input argument to the createDataFrame()
function. After executing the function, we will get an empty dataframe without column names as shown below.
from pyspark.sql.types import StructType, StructField, IntegerType, StringType,FloatType
import pyspark.sql as ps
spark = ps.SparkSession.builder \
.master("local[*]") \
.appName("create_dataframe") \
.getOrCreate()
empty_schema=StructType([])
df=spark.createDataFrame([],schema=empty_schema)
print("The empty dataframe is:")
df.show()
Output:
The empty dataframe is:
++
||
++
++
Conclusion
In this article, we discuss different ways to create an empty pyspark dataframe. To learn more about Pyspark, you can read this article on pyspark vs pandas. You might also like this article on list of lists in Python.
I hope you enjoyed reading this article. Stay tuned for more informative articles.
Happy Learning!
Recommended Python Training
Course: Python 3 For Beginners
Over 15 hours of video content with guided instruction for beginners. Learn how to create real world applications and master the basics.