In PySpark, you can delete rows from a DataFrame using various methods depending on your criteria for deletion.

Delete Rows in PySpark: 5 Top Methods
Here are some common approaches:
01. Filtering
Use the filter() function to create a new DataFrame excluding rows that meet specific conditions.
# Example: Delete rows where Age is null
df_filtered = df.filter(df.Age.isNotNull())
02. Where clause
Use the where() function is equivalent to filter().
# Example: Delete rows where Age is null
df_filtered = df.where(df.Age.isNotNull())
03. Drop rows with null values
Use the na.drop() function to drop rows containing any null or NaN values.
# Example: Delete rows with any null values
df_filtered = df.na.drop()
Recommended Books
04. Drop rows based on specific conditions
Use the drop() function with specific conditions.
# Example: Delete rows where Age is less than 18
df_filtered = df.drop(df.Age < 18)
05. SQL Expression
You can use SQL expressions for more complex conditions.
# Example: Delete rows where Age is less than 18 using SQL expression
df.createOrReplaceTempView("data")
df_filtered = spark.sql("SELECT * FROM data WHERE Age >= 18")
Conclusion
Choose the method that best suits your precise requirements for deleting rows from the DataFrame. Remember, these operations return a new DataFrame with the specified rows removed, leaving the original DataFrame unchanged. If you want to modify the original DataFrame in place, you can reassign the filtered DataFrame to the original DataFrame variable.







You must be logged in to post a comment.