In PySpark SQL, you can add conditions to your queries using the WHERE clause to filter rows based on specific criteria. Here are some examples of how to add conditions in PySpark SQL.

PySpark SQL Where Clause: Usage of Conditions

PySpark is Python API for Spark, you can use it for data transformations. Once you setup PySpark session, you can practice the below examples. For beginners, you can try these SQL queries, before you move on to complex conditions.

PySpark Examples
Photo by cottonbro studio on Pexels.com

01. Simple Condition

Suppose you have a DataFrame named df, and you want to select rows where the ‘age’ column is greater than 30.

from pyspark.sql import SparkSession

Spark session

spark = SparkSession.builder.appName("example").getOrCreate()

DataFrame

data = [("Alice", 25), ("Bob", 35), ("Charlie", 30)]
columns = ["name", "age"]
df = spark.createDataFrame(data, columns)

#Add a condition

result = df.where(df["age"] > 30)
result.show()

02. Multiple Conditions with AND

To apply multiple conditions using the AND operator, use parentheses for clarity.

rows where ‘age’ is greater than 25 and ‘name’ starts with ‘A’

result = df.where((df["age"] > 25) & (df["name"].startswith("A")))
result.show()

Multiple Conditions with OR: Similarly, you can use the OR operator to combine conditions.

rows where ‘age’ is less than 30 or ‘name’ is ‘Bob’

result = df.where((df["age"] < 30) | (df["name"] == "Bob"))
result.show()

IS NULL / IS NOT NULL Conditions: To filter rows based on NULL values:

rows where ‘age’ is not NULL

result = df.where(df["age"].isNotNull())
result.show()

rows where ‘age’ is NULL

result = df.where(df["age"].isNull())
result.show()

03. IN Condition

You can use the isin method to filter rows based on a list of values. Select rows where ‘name’ is in a list of names

name_list = ["Alice", "Charlie"]
result = df.where(df["name"].isin(name_list))
result.show()

04. LIKE Condition (Pattern Matching)

Use the like method for pattern matching using wildcards % and _. Select rows where ‘name’ starts with ‘A’

result = df.where(df["name"].like("A%"))
result.show()

05. BETWEEN Condition

Filter rows within a range of values using the between method. Select rows where ‘age’ is between 25 and 35 (inclusive)

result = df.where(df["age"].between(25, 35))
result.show()

These are some common examples of how to add conditions in PySpark SQL queries. You can combine these conditions to create more complex filtering criteria as needed for your data analysis tasks.

Downloads