In PySpark, you can write SQL queries using the Spark SQL module, which allows you to execute SQL queries on DataFrames. Here are several ways to write and execute SQL in PySpark.

Photo by RDNE Stock project on Pexels.com

01. Using the SQL Method

You can use the SQL method on a SparkSession to execute SQL queries. Here’s an example:

from pyspark.sql import SparkSession

a Spark session

spark = SparkSession.builder.appName("example").getOrCreate()

a DataFrame

data = [("Alice", 25),
("Bob", 30),
("Charlie", 35)]
columns = ["name", "age"]
df = spark.createDataFrame(data, columns)

the DataFrame as a temporary table

df.createOrReplaceTempView("people")

#Write and execute SQL queries

result = spark.sql("SELECT name, age FROM people WHERE age > 30")
result.show()

02. Using the spark.sql Function

You can also use the spark.sql function to execute SQL queries directly:

from pyspark.sql import SparkSession

a Spark session

spark = SparkSession.builder.appName("example").getOrCreate()

a DataFrame

data = [("Alice", 25),
("Bob", 30),
("Charlie", 35)]
columns = ["name", "age"]
df = spark.createDataFrame(data, columns)

the DataFrame as a temporary table

df.createOrReplaceTempView("people")

#Write and execute SQL queries using spark.sql

result = spark.sql("SELECT name, age FROM people WHERE age > 30")
result.show()

03. Using Inline SQL in DataFrame Operations

You can use inline SQL expressions with DataFrame operations like filter, selectExpr, and agg. For example:

from pyspark.sql import SparkSession

a Spark session

spark = SparkSession.builder.appName("example").getOrCreate()

a DataFrame

data = [("Alice", 25),
("Bob", 30),
("Charlie", 35)]
columns = ["name", "age"]
df = spark.createDataFrame(data, columns)

inline SQL with a filter

result = df.filter("age > 30")
result.show()

#Using Python Variables in SQL Queries

You can use Python variables in SQL queries by interpolating them using f-strings or .format():

from pyspark.sql import SparkSession

a Spark session

spark = SparkSession.builder.appName("example").getOrCreate()

a Python variable

min_age = 30

#Write and execute SQL queries with Python variables

result = spark.sql(f"SELECT name, age FROM people WHERE age > {min_age}")
result.show()

These are some common ways to write and execute SQL in PySpark. You can choose the method that best suits your requirements and coding style.

Sports/Running Shoes

ASIAN Men’s Wonder-13 Sports Running Shoes.

Related