PIVOT vs UNPIVOT: Must-Know Concepts for PySpark Developers

In both PySpark and Pandas, you can use PIVOT and UNPIVOT to reshape your data for analysis. Here’s how you can implement them.

PIVOT in PySpark

In PySpark, PIVOT is done using the pivot() function, which is available on a DataFrame.

Example (PIVOT in PySpark):

Here we are Pivoting on the subject, dividing it into multiple columns.

from pyspark.sql import SparkSession
from pyspark.sql import functions as F

# Initialize Spark session
spark = SparkSession.builder.appName("example").getOrCreate()

# Sample DataFrame
data = [
    ("John", "Math", 90),
    ("John", "Science", 85),
    ("Doe", "Math", 70),
    ("Doe", "Science", 80)
]

columns = ["Name", "Subject", "Score"]
df = spark.createDataFrame(data, columns)

# Pivot the DataFrame
pivot_df = df.groupBy("Name").pivot("Subject").agg(F.first("Score")¹)

# Show the pivoted DataFrame
pivot_df.show()

Output:

+----+----+-------+
|Name|Math|Science|
+----+----+-------+
|Doe |  70|     80|
|John|  90|     85|
+----+----+-------+

UNPIVOT in PySpark

In PySpark, there is no direct UNPIVOT function. However, you can achieve it by using selectExpr or manually transforming with stack().

Example (UNPIVOT in PySpark using `selectExpr`):

Here we are Unpivoting the subject, which means merging all subject columns into one.

unpivot_df = pivot_df.selectExpr("Name", "stack(2, 'Math', Math, 'Science', Science) as (Subject, Score)")

unpivot_df.show()

Output:

+----+-------+-----+
|Name|Subject|Score|
+----+-------+-----+
|Doe |   Math|   70|
|Doe |Science|   80|
|John|   Math|   90|
|John|Science|   85|
+----+-------+-----+

PIVOT in Pandas

In Pandas, pivot() and pivot_table() are used to reshape the DataFrame.

Example (PIVOT in Pandas):

import pandas as pd

# Sample DataFrame
data = {'Name': ['John', 'John', 'Doe', 'Doe'],
        'Subject': ['Math', 'Science', 'Math', 'Science'],
        'Score': [90, 85, 70, 80]}

df = pd.DataFrame(data)

# Pivot the DataFrame
pivot_df = df.pivot(index='Name', columns='Subject', values='Score')

print(pivot_df)

Output:

Subject  Math  Science
Name                   
Doe        70       80
John       90       85

Example (Using `pivot_table()`):

# Sample DataFrame with duplicate combinations
data = {'Name': ['Alice', 'Alice', 'Bob', 'Bob', 'Alice'],
        'Subject': ['Math', 'Math', 'Math', 'Science', 'Science'],
        'Score': [85, 90, 95, 80, 100]}

df = pd.DataFrame(data)

# Pivot table with aggregation (mean as the default)
pivot_table_df = df.pivot_table(index='Name', columns='Subject', values='Score')

print(pivot_table_df)

Output:

Subject  Math  Science
Name                   
Alice     87.5    100.0
Bob       95.0     80.0

UNPIVOT in Pandas (Melt)

In Pandas melt() is used to Unpivot the data.

Example (UNPIVOT in Pandas):

# Unpivot the DataFrame
unpivot_df = pivot_df.reset_index()² reset_index() converts an index into a regular column and resets the index to default integers.melt(id_vars='Name', var_name='Subject', value_name='Score')

print(unpivot_df)

Output:

Name  Subject  Score
0   Doe     Math     70
1  John     Math     90
2   Doe  Science     80
3  John  Science     85

Summary:

PySpark:
- Use pivot() to reshape your data for pivoting.
- Use selectExpr() with stack() for unpivoting.
Pandas:
- Use pivot() or pivot_table() for pivoting.
- Use melt() for unpivoting.

The first() function is used to get the first non-null value ↩︎
The rest_index() function is used to reset the index column into a normal column ↩︎

Srini

Data Engineer with deep AI and Generative AI expertise, crafting high-performance data pipelines in PySpark, Databricks, and SQL. Skilled in Python, AWS, and Linux—building scalable, cloud-native solutions for smart applications.