It is common to deal with JSON files. Often, you’ll need to combine two datasets based on a common key — just like you would with SQL JOINs. In this post, we’ll walk through joining two JSON files and to transform to columns using PySpark, with simple examples you can try immediately.

Here are the steps

  1. 🛠 Setup: What You Need
  2. 📂 Sample JSON Files
  3. 🔥 Step-by-Step Guide
    1. Initialize Spark Session
    2. Load the JSON Files as DataFrames
    3. Perform the Join
    4. Different Types of Joins
  4. 🧹 Bonus Tip: Handling Column Name Conflicts
  5. 📌 Final Notes

🛠 Setup: What You Need

Before we begin, make sure you have:

  • Apache Spark installed (or use Databricks, AWS Glue, etc.)
  • PySpark available in your environment

If not installed locally, you can do it via pip:

pip install pyspark

📂 Sample JSON Files

Let’s assume we have two JSON files:

File 1: employees.json

[
  {"emp_id": 1, "name": "Alice", "dept_id": 101},
  {"emp_id": 2, "name": "Bob", "dept_id": 102},
  {"emp_id": 3, "name": "Charlie", "dept_id": 101}
]

File 2: departments.json

[
  {"dept_id": 101, "dept_name": "HR"},
  {"dept_id": 102, "dept_name": "Engineering"},
  {"dept_id": 103, "dept_name": "Finance"}
]

We want to join these two files on dept_id.

🔥 Step-by-Step Guide

Initialize Spark Session

from pyspark.sql import SparkSession
spark = SparkSession.builder \
    .appName("Join JSON Example") \
    .getOrCreate()

Load the JSON Files as DataFrames

# Load employee data
employees_df = spark.read.json("path_to/employees.json")
# Load department data
departments_df = spark.read.json("path_to/departments.json")

You can also inspect the data:

employees_df.show()
departments_df.show()

Perform the Join

Here’s how you can join them using the dept_id column:

# Inner Join: Only matching dept_ids will be kept
joined_df = employees_df.join(departments_df, on="dept_id", how="inner")
joined_df.show()

Output:

The joined dataset result.

Different Types of Joins

You can also use other join types to transform JSON to columns.

  • Left Join: Keep all employees even if no matching department
employees_df.join(departments_df, on="dept_id", how="left").show()
  • Right Join: Keep all departments even if no matching employee
employees_df.join(departments_df, on="dept_id", how="right").show()
  • Outer Join: Keep everything from both sides
employees_df.join(departments_df, on="dept_id", how="outer").show()

🧹 Bonus Tip: Handling Column Name Conflicts

If the two JSONs have common column names (other than the key), you can rename columns before joining to avoid conflicts:

employees_df.join(departments_df, on="dept_id", how="outer").show()

📌 Final Notes

  • Always check the schemas of your JSON files before joining.
  • For larger files, consider partitioning and caching your DataFrames for better performance.
  • Use broadcast joins if one of the DataFrames is very small, for faster operations.