In PySpark, you can delete rows from a DataFrame using various methods depending on your criteria for deletion.

Delete rows in PySpark
Photo by Markus Winkler on Pexels.com

Delete Rows in PySpark: 5 Top Methods

Here are some common approaches:

01. Filtering

Use the filter() function to create a new DataFrame excluding rows that meet specific conditions.

# Example: Delete rows where Age is null

df_filtered = df.filter(df.Age.isNotNull())

02. Where clause

Use the where() function is equivalent to filter().

# Example: Delete rows where Age is null

df_filtered = df.where(df.Age.isNotNull())

03. Drop rows with null values

Use the na.drop() function to drop rows containing any null or NaN values.

# Example: Delete rows with any null values

df_filtered = df.na.drop()

04. Drop rows based on specific conditions

Use the drop() function with specific conditions.

# Example: Delete rows where Age is less than 18

df_filtered = df.drop(df.Age < 18)

05. SQL Expression

You can use SQL expressions for more complex conditions.

# Example: Delete rows where Age is less than 18 using SQL expression

df.createOrReplaceTempView("data")

df_filtered = spark.sql("SELECT * FROM data WHERE Age >= 18")

Conclusion

Choose the method that best suits your precise requirements for deleting rows from the DataFrame. Remember, these operations return a new DataFrame with the specified rows removed, leaving the original DataFrame unchanged. If you want to modify the original DataFrame in place, you can reassign the filtered DataFrame to the original DataFrame variable.