Why DELETE with Subqueries Fails in PySpark SQL (And How to Fix It)

If you are coming from a traditional RDBMS background (Oracle, MySQL, PostgreSQL), you might expect SQL statements to behave the same way in PySpark SQL. However, many data engineers are surprised when a perfectly valid SQL query like:

DELETE FROM target_table
WHERE id IN (
    SELECT id
    FROM source_table
    WHERE status = 'INACTIVE'
);

either fails, throws an error, or deletes zero rows in Spark.

This blog explains why DELETE with WHERE IN (subquery) often does not work in PySpark SQL, what Spark actually supports, and the correct, production-ready alternatives you should use.

Understanding DELETE in Spark SQL

Spark SQL is not a traditional database engine. It was designed for distributed processing, not row-level mutations. As a result:

DELETE is not universally supported
Subqueries inside DELETE have strict limitations
Behavior depends on table format (Delta vs Parquet)

DELETE is supported only for:

Delta tables
Some Hive ACID tables

DELETE is NOT supported for:

Temporary views
DataFrames
Parquet / CSV / JSON tables

Common Mistake: Multiple WHERE Clauses

Many users try something like this:

DELETE FROM orders
WHERE country = 'IN'
WHERE order_status = 'CANCELLED';

❌ This is invalid SQL.
SQL allows only one WHERE clause.

Correct syntax

DELETE FROM orders
WHERE country = 'IN'
  AND order_status = 'CANCELLED';

WHERE IN (Subquery) – Valid SQL but Limited in Spark

This query is logically correct SQL:

DELETE FROM target_table
WHERE id IN (
    SELECT id
    FROM source_table
    WHERE status = 'INACTIVE'
);

One WHERE belongs to the DELETE
Another WHERE belongs to the subquery

✔️ This is valid SQL
❌ But Spark SQL may not support it

Why DELETE with Subqueries Fails in PySpark

1. Spark SQL limitations

Spark does not fully support subqueries inside DML operations (DELETE, UPDATE).

2. Query optimizer restrictions

Spark prefers join-based execution, not nested subqueries, for distributed deletes.

3. Non-Delta tables

If target_table is not a Delta table, the query will fail silently or throw:

DELETE is not supported
Subqueries are not supported in DELETE

Check If Your Table Supports DELETE

Before running DELETE, always verify table format:

DESCRIBE DETAIL target_table;

Look for:

format: delta

If the format is not delta, DELETE will not work.

Best Practice: Use DELETE USING (JOIN-Based Delete)

Spark SQL officially recommends JOIN-based deletes instead of subqueries.

Why this works better

No subquery limitations
Optimized for distributed execution
Fully supported by Delta Lake
Easier to debug and maintain

Comparison: IN Subquery vs DELETE USING

Approach	Works in Spark	Performance	Recommendation
IN (subquery)	❌ Limited	Poor	❌ Avoid
DELETE USING	✅ Yes	High	✅ Best
Filter + overwrite	✅ Yes	Medium	✅ For non-Delta

Alternative for Non-Delta Tables (Filter & Overwrite)

If your table is Parquet or Hive, use the DataFrame approach.

PySpark DataFrame Solution

target_df = spark.table("target_table")
source_df = (
    spark.table("source_table")
    .filter("status = 'INACTIVE'")
    .select("id")
)

final_df = target_df.join(
    source_df,
    on="id",
    how="left_anti"
)

final_df.write \
    .mode("overwrite") \
    .saveAsTable("target_table")

Explanation

left_anti keeps rows not present in source
Mimics DELETE behavior safely
Works in AWS Glue, EMR, Databricks

Real-World Example

Scenario

Delete cancelled orders created by inactive users.

❌ Problematic Query

DELETE FROM orders
WHERE user_id IN (
    SELECT user_id
    FROM users
    WHERE active_flag = 'N'
);

✅ Correct Spark SQL Approach

DELETE FROM orders o
USING users u
WHERE o.user_id = u.user_id
  AND u.active_flag = 'N';

Common Errors & Fixes

Error	Cause	Fix
DELETE not supported	Non-Delta table	Convert to Delta
Deletes 0 rows	Subquery limitation	Use DELETE USING
Syntax error	Multiple WHERE	Use AND / OR
Works in DB, not Spark	Engine difference	Join-based delete

Best Practices for Data Engineers

Always use Delta Lake if you need DELETE/UPDATE
Avoid IN (subquery) in Spark SQL
Prefer DELETE USING
For Glue/EMR → use DataFrame overwrite
Validate deletes in lower environments first

Conclusion

Although WHERE IN (subquery) is valid SQL, Spark SQL has clear limitations when it comes to DELETE operations. Relying on traditional SQL patterns can lead to silent failures or unexpected results.

Srini

Data Engineer with deep AI and Generative AI expertise, crafting high-performance data pipelines in PySpark, Databricks, and SQL. Skilled in Python, AWS, and Linux—building scalable, cloud-native solutions for smart applications.

Fediverse reactions

Latest Posts

2-Minute Random Topics to Improve English Speaking Skills

June 14, 2026
Spark Structured Streaming vs Spark Declarative Pipelines (SDP): What Every Data Engineer Should Know

June 7, 2026
Linking Words Practice: Improve Your English Speaking Fluency Naturally

May 31, 2026
Databricks DLT with @dp: A Complete Guide to Streaming and Batch Processing

May 31, 2026

Understanding DELETE in Spark SQL

DELETE is supported only for:

DELETE is NOT supported for:

Common Mistake: Multiple WHERE Clauses

Correct syntax

WHERE IN (Subquery) – Valid SQL but Limited in Spark

Why DELETE with Subqueries Fails in PySpark

1. Spark SQL limitations

2. Query optimizer restrictions

3. Non-Delta tables

Check If Your Table Supports DELETE

Best Practice: Use DELETE USING (JOIN-Based Delete)

Recommended Solution (Delta Tables)

Why this works better

Comparison: IN Subquery vs DELETE USING

Alternative for Non-Delta Tables (Filter & Overwrite)

PySpark DataFrame Solution

Explanation

Real-World Example

Scenario

❌ Problematic Query

✅ Correct Spark SQL Approach

Common Errors & Fixes

Best Practices for Data Engineers

Conclusion

Like this:

Fediverse reactions

Latest Posts

2-Minute Random Topics to Improve English Speaking Skills

Spark Structured Streaming vs Spark Declarative Pipelines (SDP): What Every Data Engineer Should Know

Linking Words Practice: Improve Your English Speaking Fluency Naturally

Databricks DLT with @dp: A Complete Guide to Streaming and Batch Processing

Subscribe for DAILY TIPS

Why DELETE with Subqueries Fails in PySpark SQL (And How to Fix It)

Understanding DELETE in Spark SQL

DELETE is supported only for:

DELETE is NOT supported for:

Common Mistake: Multiple WHERE Clauses

Correct syntax

WHERE IN (Subquery) – Valid SQL but Limited in Spark

Why DELETE with Subqueries Fails in PySpark

1. Spark SQL limitations

2. Query optimizer restrictions

3. Non-Delta tables

Check If Your Table Supports DELETE

Best Practice: Use DELETE USING (JOIN-Based Delete)

Recommended Solution (Delta Tables)

Why this works better

Comparison: IN Subquery vs DELETE USING

Alternative for Non-Delta Tables (Filter & Overwrite)

PySpark DataFrame Solution

Explanation

Real-World Example

Scenario

❌ Problematic Query

✅ Correct Spark SQL Approach

Common Errors & Fixes

Best Practices for Data Engineers

Conclusion

Share this:

Like this:

Fediverse reactions

Latest Posts

2-Minute Random Topics to Improve English Speaking Skills

Spark Structured Streaming vs Spark Declarative Pipelines (SDP): What Every Data Engineer Should Know

Linking Words Practice: Improve Your English Speaking Fluency Naturally

Databricks DLT with @dp: A Complete Guide to Streaming and Batch Processing

Discover more from Srinimf