Joining Two JSON Files Using a Common Key in PySpark (With Examples)

It is common to deal with JSON files. Often, you’ll need to combine two datasets based on a common key — just like you would with SQL JOINs. In this post, we’ll walk through joining two JSON files and to transform to columns using PySpark, with simple examples you can try immediately.

Here are the steps

🛠 Setup: What You Need

Before we begin, make sure you have:

Apache Spark installed (or use Databricks, AWS Glue, etc.)
PySpark available in your environment

If not installed locally, you can do it via pip:

pip install pyspark

📂 Sample JSON Files

Let’s assume we have two JSON files:

File 1: employees.json

[
  {"emp_id": 1, "name": "Alice", "dept_id": 101},
  {"emp_id": 2, "name": "Bob", "dept_id": 102},
  {"emp_id": 3, "name": "Charlie", "dept_id": 101}
]

File 2: departments.json

[
  {"dept_id": 101, "dept_name": "HR"},
  {"dept_id": 102, "dept_name": "Engineering"},
  {"dept_id": 103, "dept_name": "Finance"}
]

We want to join these two files on dept_id.

🔥 Step-by-Step Guide

Initialize Spark Session

from pyspark.sql import SparkSession
spark = SparkSession.builder \
    .appName("Join JSON Example") \
    .getOrCreate()

Load the JSON Files as DataFrames

# Load employee data
employees_df = spark.read.json("path_to/employees.json")
# Load department data
departments_df = spark.read.json("path_to/departments.json")

You can also inspect the data:

employees_df.show()
departments_df.show()

Perform the Join

Here’s how you can join them using the dept_id column:

# Inner Join: Only matching dept_ids will be kept
joined_df = employees_df.join(departments_df, on="dept_id", how="inner")
joined_df.show()

✅ Output:

Different Types of Joins

You can also use other join types to transform JSON to columns.

Left Join: Keep all employees even if no matching department

employees_df.join(departments_df, on="dept_id", how="left").show()

Right Join: Keep all departments even if no matching employee

employees_df.join(departments_df, on="dept_id", how="right").show()

Outer Join: Keep everything from both sides

employees_df.join(departments_df, on="dept_id", how="outer").show()

🧹 Bonus Tip: Handling Column Name Conflicts

If the two JSONs have common column names (other than the key), you can rename columns before joining to avoid conflicts:

employees_df.join(departments_df, on="dept_id", how="outer").show()

📌 Final Notes

Always check the schemas of your JSON files before joining.
For larger files, consider partitioning and caching your DataFrames for better performance.
Use broadcast joins if one of the DataFrames is very small, for faster operations.

Srini

Data Engineer with deep AI and Generative AI expertise, crafting high-performance data pipelines in PySpark, Databricks, and SQL. Skilled in Python, AWS, and Linux—building scalable, cloud-native solutions for smart applications.