Updated April 12, 2023
Introduction to PySpark structtype
PySpark structtype is a class import that is used to define the structure for the creation of the data frame. The structtype provides the method of creation of data frame in PySpark. It is a collection or list of Struct Field Object.
The structtype has the schema of the data frame to be defined, it contains the object that defines the name of the column, The type of the column, and the flag for each data frame. It has struct Field inside which the column structure is defined in PySpark. It is a built-in data type that is a collection of Struct Field in PySpark data frame.
In this article, we will try to analyze the various method used for structtype in PySpark.
Syntax for PySpark structtype:
The syntax are as follows:
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType,StructField, StringType, IntegerType
sch = StructType([StructField("Name",StringType(),True),StructField("ID",StringType(), True),StructField("ADD",StringType() , True)])
- sch: The Schema defined for the Data Frame to be created.
- StructType: The StructType Class.
- StructField: The schema can be defined with Struct Field.
Screenshot:
Working of structtype in Python
The Structtype is used to define a schema of a data frame in PySpark. It is a built-in data type that is used to create a data frame in PySpark. The Structtype itself has <struct> in query plan, it is a Sequence of type Struct Field. Seq[StructField]. We can define the Column schema name with the parameters with Struct Field.
The first parameter includes the Name of the column, The Second being the Data Type to be used and the last being the Boolean flag for data frame. It is used in the Query Plan in PySpark for the creation of data frame. Creation of a new column can be done by adding a new Struct Field to the Structtype. We can add the column that is resolved at the query planning phase while the creation of PySpark Data Frame.
Let’s check the creation and working of STRUCTTYPE method with some coding examples.
Examples of PySpark structtype
Let us see some Examples:
Let’s start by creating simple data in PySpark.
data1 = [{'Name':'Jhon','ID':21.528,'Add':'USA'},{'Name':'Joe','ID':3.69,'Add':'USA'},{'Name':'Tina','ID':2.48,'Add':'IND'},{'Name':'Jhon','ID':22.22, 'Add':'USA'},{'Name':'Joe','ID':5.33,'Add':'INA'}]
A sample data is created with Name, ID and ADD as the field.
a = sc.parallelize(data1)
RDD is created using sc.parallelize.
b = spark.createDataFrame(a)
Created DataFrame using Spark.createDataFrame.
Screenshot:
This creates the data frame with the column name as Name, Add, and ID. Now let us try to create the schema with the Structtype.
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType,StructField, StringType, IntegerType
The import statement to be used for defining the Structtype and Struct Field.
sch = StructType([StructField("Name",StringType(),True),StructField("ID",StringType(), True),StructField("ADD",StringType() , True)])
The Structtype class and the Struct Field contain the name of the column, the data type of the column used, and the Boolean value for null values.
Let’s fill the data and create the data from the data frame using the Structtype.
data1 = [{'Name':'Jhon','ID':21.528,'Add':'USA'},{'Name':'Joe','ID':3.69,'Add':'USA'},{'Name':'Tina','ID':2.48,'Add':'IND'},{'Name':'Jhon','ID':22.22, 'Add':'USA'}
b = spark.createDataFrame(data1,sch)
b.show()
b.printSchema()
This creates a data frame in PySpark using the Structtype.
We can also create a Nested Structtype object that contains the nested columns. Let us create the same with a coding example.
The nested schema contains elements inside the schema element.
nes_Sch = StructType([StructField("Name",StructType([StructField("f_name",StringType(), True),StructField("l_name",StringType() , True)])),StructField("ID",StringType(),True),StructField("Add",StringType() , True)])
data1 = [(("John","cena"),"123","UK"),(("Singh","dd"),"234","IND")]
b = spark.createDataFrame(data1,nes_Sch)
b.show()
b.printSchema()
This shows how the nested schema is prepared using the Structtype as the Type of that Schema.
These are some of the Examples of PYSPARK STRUCTTYPE in PySpark.
Note:
- PySpark STRUCTTYPE is a way of creating of a data frame in PySpark.
- It contains a list of Struct Field that has the structure defined for the data frame.
- Removes the dependency from spark code.
- Returns the schema for the data frame.
- It has the structure of data that can be done at run time as well as compile time.
Conclusion
From the above article, we saw the conversion of PIVOT STRUCTTYPE in PySpark. From various examples and classification, we tried to understand how this STRUCTTYPE of PySpark data frame happens in PySpark and what are is used at the programming level.
We also saw the internal working and the advantages of STRUCTTYPE in PySpark Data Frame and its usage in various programming purposes. Also, the syntax and examples helped us to understand much precisely the function.
Recommended Articles
We hope that this EDUCBA information on “PySpark structtype” was beneficial to you. You can view EDUCBA’s recommended articles for more information.