If you are coming from a traditional RDBMS background (Oracle, MySQL, PostgreSQL), you might expect SQL statements to behave the same way in PySpark SQL. However, many data engineers are surprised when a perfectly valid SQL query like:

DELETE FROM target_table
WHERE id IN (
    SELECT id
    FROM source_table
    WHERE status = 'INACTIVE'
);

either fails, throws an error, or deletes zero rows in Spark.

This blog explains why DELETE with WHERE IN (subquery) often does not work in PySpark SQL, what Spark actually supports, and the correct, production-ready alternatives you should use.


Understanding DELETE in Spark SQL

Spark SQL is not a traditional database engine. It was designed for distributed processing, not row-level mutations. As a result:

  • DELETE is not universally supported
  • Subqueries inside DELETE have strict limitations
  • Behavior depends on table format (Delta vs Parquet)

DELETE is supported only for:

  • Delta tables
  • Some Hive ACID tables

DELETE is NOT supported for:

  • Temporary views
  • DataFrames
  • Parquet / CSV / JSON tables

Common Mistake: Multiple WHERE Clauses

Many users try something like this:

DELETE FROM orders
WHERE country = 'IN'
WHERE order_status = 'CANCELLED';

❌ This is invalid SQL.
SQL allows only one WHERE clause.

Correct syntax

DELETE FROM orders
WHERE country = 'IN'
  AND order_status = 'CANCELLED';

WHERE IN (Subquery) – Valid SQL but Limited in Spark

This query is logically correct SQL:

DELETE FROM target_table
WHERE id IN (
    SELECT id
    FROM source_table
    WHERE status = 'INACTIVE'
);
  • One WHERE belongs to the DELETE
  • Another WHERE belongs to the subquery

✔️ This is valid SQL
❌ But Spark SQL may not support it


Why DELETE with Subqueries Fails in PySpark

1. Spark SQL limitations

Spark does not fully support subqueries inside DML operations (DELETE, UPDATE).

2. Query optimizer restrictions

Spark prefers join-based execution, not nested subqueries, for distributed deletes.

3. Non-Delta tables

If target_table is not a Delta table, the query will fail silently or throw:

  • DELETE is not supported
  • Subqueries are not supported in DELETE

Check If Your Table Supports DELETE

Before running DELETE, always verify table format:

DESCRIBE DETAIL target_table;

Look for:

format: delta

If the format is not delta, DELETE will not work.


Best Practice: Use DELETE USING (JOIN-Based Delete)

Spark SQL officially recommends JOIN-based deletes instead of subqueries.

Recommended Solution (Delta Tables)

DELETE FROM target_table t
USING source_table s
WHERE t.id = s.id
  AND s.status = 'INACTIVE';

Why this works better

  • No subquery limitations
  • Optimized for distributed execution
  • Fully supported by Delta Lake
  • Easier to debug and maintain

Comparison: IN Subquery vs DELETE USING

ApproachWorks in SparkPerformanceRecommendation
IN (subquery)❌ LimitedPoor❌ Avoid
DELETE USING✅ YesHigh✅ Best
Filter + overwrite✅ YesMedium✅ For non-Delta

Alternative for Non-Delta Tables (Filter & Overwrite)

If your table is Parquet or Hive, use the DataFrame approach.

PySpark DataFrame Solution

target_df = spark.table("target_table")
source_df = (
    spark.table("source_table")
    .filter("status = 'INACTIVE'")
    .select("id")
)

final_df = target_df.join(
    source_df,
    on="id",
    how="left_anti"
)

final_df.write \
    .mode("overwrite") \
    .saveAsTable("target_table")

Explanation

  • left_anti keeps rows not present in source
  • Mimics DELETE behavior safely
  • Works in AWS Glue, EMR, Databricks

Real-World Example

Scenario

Delete cancelled orders created by inactive users.

❌ Problematic Query

DELETE FROM orders
WHERE user_id IN (
    SELECT user_id
    FROM users
    WHERE active_flag = 'N'
);

✅ Correct Spark SQL Approach

DELETE FROM orders o
USING users u
WHERE o.user_id = u.user_id
  AND u.active_flag = 'N';

Common Errors & Fixes

ErrorCauseFix
DELETE not supportedNon-Delta tableConvert to Delta
Deletes 0 rowsSubquery limitationUse DELETE USING
Syntax errorMultiple WHEREUse AND / OR
Works in DB, not SparkEngine differenceJoin-based delete

Best Practices for Data Engineers

  • Always use Delta Lake if you need DELETE/UPDATE
  • Avoid IN (subquery) in Spark SQL
  • Prefer DELETE USING
  • For Glue/EMR → use DataFrame overwrite
  • Validate deletes in lower environments first

Conclusion

Although WHERE IN (subquery) is valid SQL, Spark SQL has clear limitations when it comes to DELETE operations. Relying on traditional SQL patterns can lead to silent failures or unexpected results.