In both PySpark and Pandas, you can use PIVOT and UNPIVOT to reshape your data for analysis. Here’s how you can implement them.

PIVOT and UNPIVOT Examples in PySpark and Pandas
Photo by Ketut Subiyanto on Pexels.com

PIVOT in PySpark

In PySpark, PIVOT is done using the pivot() function, which is available on a DataFrame.

Example (PIVOT in PySpark):

Here we are Pivoting on the subject, dividing it into multiple columns.

from pyspark.sql import SparkSession
from pyspark.sql import functions as F

# Initialize Spark session
spark = SparkSession.builder.appName("example").getOrCreate()

# Sample DataFrame
data = [
("John", "Math", 90),
("John", "Science", 85),
("Doe", "Math", 70),
("Doe", "Science", 80)
]

columns = ["Name", "Subject", "Score"]
df = spark.createDataFrame(data, columns)

# Pivot the DataFrame
pivot_df = df.groupBy("Name").pivot("Subject").agg(F.first("Score")
1)

# Show the pivoted DataFrame
pivot_df.show()

Output:

+----+----+-------+
|Name|Math|Science|
+----+----+-------+
|Doe | 70| 80|
|John| 90| 85|
+----+----+-------+

UNPIVOT in PySpark

In PySpark, there is no direct UNPIVOT function. However, you can achieve it by using selectExpr or manually transforming with stack().

Example (UNPIVOT in PySpark using selectExpr):

Here we are Unpivoting the subject, which means merging all subject columns into one.

unpivot_df = pivot_df.selectExpr("Name", "stack(2, 'Math', Math, 'Science', Science) as (Subject, Score)")

unpivot_df.show()

Output:

+----+-------+-----+
|Name|Subject|Score|
+----+-------+-----+
|Doe | Math| 70|
|Doe |Science| 80|
|John| Math| 90|
|John|Science| 85|
+----+-------+-----+

PIVOT in Pandas

In Pandas, pivot() and pivot_table() are used to reshape the DataFrame.

Example (PIVOT in Pandas):

import pandas as pd

# Sample DataFrame
data = {'Name': ['John', 'John', 'Doe', 'Doe'],
'Subject': ['Math', 'Science', 'Math', 'Science'],
'Score': [90, 85, 70, 80]}

df = pd.DataFrame(data)

# Pivot the DataFrame
pivot_df = df.pivot(index='Name', columns='Subject', values='Score')

print(pivot_df)

Output:

Subject  Math  Science
Name
Doe 70 80
John 90 85

Example (Using pivot_table()):

# Sample DataFrame with duplicate combinations
data = {'Name': ['Alice', 'Alice', 'Bob', 'Bob', 'Alice'],
'Subject': ['Math', 'Math', 'Math', 'Science', 'Science'],
'Score': [85, 90, 95, 80, 100]}

df = pd.DataFrame(data)

# Pivot table with aggregation (mean as the default)
pivot_table_df = df.pivot_table(index='Name', columns='Subject', values='Score')

print(pivot_table_df)

Output:

Subject  Math  Science
Name
Alice 87.5 100.0
Bob 95.0 80.0

UNPIVOT in Pandas (Melt)

In Pandas melt() is used to Unpivot the data.

Example (UNPIVOT in Pandas):

# Unpivot the DataFrame
unpivot_df = pivot_df.reset_index()
2 reset_index() converts an index into a regular column and resets the index to default integers.melt(id_vars='Name', var_name='Subject', value_name='Score')

print(unpivot_df)

Output:

Name  Subject  Score
0 Doe Math 70
1 John Math 90
2 Doe Science 80
3 John Science 85

Summary:

  • PySpark:
    • Use pivot() to reshape your data for pivoting.
    • Use selectExpr() with stack() for unpivoting.
  • Pandas:
    • Use pivot() or pivot_table() for pivoting.
    • Use melt() for unpivoting.
  1. The first() function is used to get the first non-null value ↩︎
  2. The rest_index() function is used to reset the index column into a normal column ↩︎