If you are coming from a traditional RDBMS background (Oracle, MySQL, PostgreSQL), you might expect SQL statements to behave the same way in PySpark SQL. However, many data engineers are surprised when a perfectly valid SQL query like:
DELETE FROM target_table
WHERE id IN (
SELECT id
FROM source_table
WHERE status = 'INACTIVE'
);
either fails, throws an error, or deletes zero rows in Spark.
This blog explains why DELETE with WHERE IN (subquery) often does not work in PySpark SQL, what Spark actually supports, and the correct, production-ready alternatives you should use.
Understanding DELETE in Spark SQL
Spark SQL is not a traditional database engine. It was designed for distributed processing, not row-level mutations. As a result:
DELETEis not universally supported- Subqueries inside
DELETEhave strict limitations - Behavior depends on table format (Delta vs Parquet)
DELETE is supported only for:
- Delta tables
- Some Hive ACID tables
DELETE is NOT supported for:
- Temporary views
- DataFrames
- Parquet / CSV / JSON tables
Common Mistake: Multiple WHERE Clauses
Many users try something like this:
DELETE FROM orders
WHERE country = 'IN'
WHERE order_status = 'CANCELLED';
❌ This is invalid SQL.
SQL allows only one WHERE clause.
Correct syntax
DELETE FROM orders
WHERE country = 'IN'
AND order_status = 'CANCELLED';
WHERE IN (Subquery) – Valid SQL but Limited in Spark
This query is logically correct SQL:
DELETE FROM target_table
WHERE id IN (
SELECT id
FROM source_table
WHERE status = 'INACTIVE'
);
- One
WHEREbelongs to theDELETE - Another
WHEREbelongs to the subquery
✔️ This is valid SQL
❌ But Spark SQL may not support it
Why DELETE with Subqueries Fails in PySpark
1. Spark SQL limitations
Spark does not fully support subqueries inside DML operations (DELETE, UPDATE).
2. Query optimizer restrictions
Spark prefers join-based execution, not nested subqueries, for distributed deletes.
3. Non-Delta tables
If target_table is not a Delta table, the query will fail silently or throw:
DELETE is not supportedSubqueries are not supported in DELETE
Check If Your Table Supports DELETE
Before running DELETE, always verify table format:
DESCRIBE DETAIL target_table;
Look for:
format: delta
If the format is not delta, DELETE will not work.
Best Practice: Use DELETE USING (JOIN-Based Delete)
Spark SQL officially recommends JOIN-based deletes instead of subqueries.
Recommended Solution (Delta Tables)
DELETE FROM target_table t
USING source_table s
WHERE t.id = s.id
AND s.status = 'INACTIVE';
Why this works better
- No subquery limitations
- Optimized for distributed execution
- Fully supported by Delta Lake
- Easier to debug and maintain
Comparison: IN Subquery vs DELETE USING
| Approach | Works in Spark | Performance | Recommendation |
|---|---|---|---|
| IN (subquery) | ❌ Limited | Poor | ❌ Avoid |
| DELETE USING | ✅ Yes | High | ✅ Best |
| Filter + overwrite | ✅ Yes | Medium | ✅ For non-Delta |
Alternative for Non-Delta Tables (Filter & Overwrite)
If your table is Parquet or Hive, use the DataFrame approach.
PySpark DataFrame Solution
target_df = spark.table("target_table")
source_df = (
spark.table("source_table")
.filter("status = 'INACTIVE'")
.select("id")
)
final_df = target_df.join(
source_df,
on="id",
how="left_anti"
)
final_df.write \
.mode("overwrite") \
.saveAsTable("target_table")
Explanation
left_antikeeps rows not present in source- Mimics DELETE behavior safely
- Works in AWS Glue, EMR, Databricks
Real-World Example
Scenario
Delete cancelled orders created by inactive users.
❌ Problematic Query
DELETE FROM orders
WHERE user_id IN (
SELECT user_id
FROM users
WHERE active_flag = 'N'
);
✅ Correct Spark SQL Approach
DELETE FROM orders o
USING users u
WHERE o.user_id = u.user_id
AND u.active_flag = 'N';
Common Errors & Fixes
| Error | Cause | Fix |
|---|---|---|
| DELETE not supported | Non-Delta table | Convert to Delta |
| Deletes 0 rows | Subquery limitation | Use DELETE USING |
| Syntax error | Multiple WHERE | Use AND / OR |
| Works in DB, not Spark | Engine difference | Join-based delete |
Best Practices for Data Engineers
- Always use Delta Lake if you need DELETE/UPDATE
- Avoid
IN (subquery)in Spark SQL - Prefer DELETE USING
- For Glue/EMR → use DataFrame overwrite
- Validate deletes in lower environments first
Conclusion
Although WHERE IN (subquery) is valid SQL, Spark SQL has clear limitations when it comes to DELETE operations. Relying on traditional SQL patterns can lead to silent failures or unexpected results.






