In PySpark SQL, you can add conditions to your queries using the WHERE clause to filter rows based on specific criteria. Here are some examples of how to add conditions in PySpark SQL.
PySpark SQL Where Clause: Usage of Conditions
PySpark is Python API for Spark, you can use it for data transformations. Once you setup PySpark session, you can practice the below examples. For beginners, you can try these SQL queries, before you move on to complex conditions.

01. Simple Condition
Suppose you have a DataFrame named df, and you want to select rows where the ‘age’ column is greater than 30.
from pyspark.sql import SparkSession
#Initialize Spark session
spark = SparkSession.builder.appName("example").getOrCreate()
#Create DataFrame
data = [("Alice", 25), ("Bob", 35), ("Charlie", 30)]
columns = ["name", "age"]
df = spark.createDataFrame(data, columns)
#Add a condition
result = df.where(df["age"] > 30)
result.show()
02. Multiple Conditions with AND
To apply multiple conditions using the AND operator, use parentheses for clarity.
#Select rows where ‘age’ is greater than 25 and ‘name’ starts with ‘A’
result = df.where((df["age"] > 25) & (df["name"].startswith("A")))
result.show()
Multiple Conditions with OR: Similarly, you can use the OR operator to combine conditions.
#Select rows where ‘age’ is less than 30 or ‘name’ is ‘Bob’
result = df.where((df["age"] < 30) | (df["name"] == "Bob"))
result.show()
IS NULL / IS NOT NULL Conditions: To filter rows based on NULL values:
#Select rows where ‘age’ is not NULL
result = df.where(df["age"].isNotNull())
result.show()
#Select rows where ‘age’ is NULL
result = df.where(df["age"].isNull())
result.show()
03. IN Condition
You can use the isin method to filter rows based on a list of values. Select rows where ‘name’ is in a list of names
name_list = ["Alice", "Charlie"]
result = df.where(df["name"].isin(name_list))
result.show()
04. LIKE Condition (Pattern Matching)
Use the like method for pattern matching using wildcards % and _. Select rows where ‘name’ starts with ‘A’
result = df.where(df["name"].like("A%"))
result.show()
05. BETWEEN Condition
Filter rows within a range of values using the between method. Select rows where ‘age’ is between 25 and 35 (inclusive)
result = df.where(df["age"].between(25, 35))
result.show()
These are some common examples of how to add conditions in PySpark SQL queries. You can combine these conditions to create more complex filtering criteria as needed for your data analysis tasks.
Downloads







You must be logged in to post a comment.