In PySpark, you can write SQL queries using the Spark SQL module, which allows you to execute SQL queries on DataFrames. Here are several ways to write and execute SQL in PySpark.

01. Using the SQL Method
You can use the SQL method on a SparkSession to execute SQL queries. Here’s an example:
from pyspark.sql import SparkSession
#Create a Spark session
spark = SparkSession.builder.appName("example").getOrCreate()
#Create a DataFrame
data = [("Alice", 25),
("Bob", 30),
("Charlie", 35)]
columns = ["name", "age"]
df = spark.createDataFrame(data, columns)
#Register the DataFrame as a temporary table
df.createOrReplaceTempView("people")
#Write and execute SQL queries
result = spark.sql("SELECT name, age FROM people WHERE age > 30")
result.show()
02. Using the spark.sql Function
You can also use the spark.sql function to execute SQL queries directly:
from pyspark.sql import SparkSession
#Create a Spark session
spark = SparkSession.builder.appName("example").getOrCreate()
#Create a DataFrame
data = [("Alice", 25),
("Bob", 30),
("Charlie", 35)]
columns = ["name", "age"]
df = spark.createDataFrame(data, columns)
#Register the DataFrame as a temporary table
df.createOrReplaceTempView("people")
#Write and execute SQL queries using spark.sql
result = spark.sql("SELECT name, age FROM people WHERE age > 30")
result.show()
03. Using Inline SQL in DataFrame Operations
You can use inline SQL expressions with DataFrame operations like filter, selectExpr, and agg. For example:
from pyspark.sql import SparkSession
#Create a Spark session
spark = SparkSession.builder.appName("example").getOrCreate()
#Create a DataFrame
data = [("Alice", 25),
("Bob", 30),
("Charlie", 35)]
columns = ["name", "age"]
df = spark.createDataFrame(data, columns)
#Use inline SQL with a filter
result = df.filter("age > 30")
result.show()
#Using Python Variables in SQL Queries
You can use Python variables in SQL queries by interpolating them using f-strings or .format():
from pyspark.sql import SparkSession
#Create a Spark session
spark = SparkSession.builder.appName("example").getOrCreate()
#Create a Python variable
min_age = 30
#Write and execute SQL queries with Python variables
result = spark.sql(f"SELECT name, age FROM people WHERE age > {min_age}")
result.show()
These are some common ways to write and execute SQL in PySpark. You can choose the method that best suits your requirements and coding style.
Related







You must be logged in to post a comment.