5 Best Ways to Delete Rows in PySpark

In PySpark, you can delete rows from a DataFrame using various methods depending on your criteria for deletion.

Delete Rows in PySpark: 5 Top Methods

Here are some common approaches:

01. Filtering

Use the filter() function to create a new DataFrame excluding rows that meet specific conditions.

# Example: Delete rows where Age is null

df_filtered = df.filter(df.Age.isNotNull())

02. Where clause

Use the where() function is equivalent to filter().

# Example: Delete rows where Age is null

df_filtered = df.where(df.Age.isNotNull())

03. Drop rows with null values

Use the na.drop() function to drop rows containing any null or NaN values.

# Example: Delete rows with any null values

df_filtered = df.na.drop()

Recommended Books

04. Drop rows based on specific conditions

Use the drop() function with specific conditions.

# Example: Delete rows where Age is less than 18

df_filtered = df.drop(df.Age < 18)

05. SQL Expression

You can use SQL expressions for more complex conditions.

# Example: Delete rows where Age is less than 18 using SQL expression

df.createOrReplaceTempView("data")

df_filtered = spark.sql("SELECT * FROM data WHERE Age >= 18")

Conclusion

Choose the method that best suits your precise requirements for deleting rows from the DataFrame. Remember, these operations return a new DataFrame with the specified rows removed, leaving the original DataFrame unchanged. If you want to modify the original DataFrame in place, you can reassign the filtered DataFrame to the original DataFrame variable.

Srini

Data Engineer with deep AI and Generative AI expertise, crafting high-performance data pipelines in PySpark, Databricks, and SQL. Skilled in Python, AWS, and Linux—building scalable, cloud-native solutions for smart applications.