PySpark Practice Guide: 10 Essential Examples for Data Analysis

10 PySpark examples with answers to help you start with Apache Spark using Python.

How to Practice Apache PySpark: 10 Best Examples — Photo by alleksana on Pexels.com

Initializing a Spark Session

from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("example").getOrCreate()

Loading Data from CSV

df = spark.read.csv("data.csv", header=True, inferSchema=True)
df.show()

Filtering Data

filtered_df = df.filter(df["age"] > 25)
filtered_df.show()

Grouping and Aggregating Data

grouped_df = df.groupBy("department").agg({"salary": "avg"})
grouped_df.show()

Joining DataFrames

joined_df = df1.join(df2, "employee_id", "inner")
joined_df.show()

Writing Data to Parquet

df.write.parquet("output.parquet")

Creating a User-Defined Function (UDF)

from pyspark.sql.functions import udf
from pyspark.sql.types import StringType
def custom_function(name):
return "Hello, " + name

custom_udf = udf(custom_function, StringType())
df = df.withColumn("greeting", custom_udf(df["name"]))
df.show()

Running SQL Queries

df.createOrReplaceTempView("employees")
result = spark.sql("SELECT * FROM employees WHERE age > 30")
result.show()

Handling Missing Data

df = df.fillna(0, subset=["salary"]) # Fill NaN values with 0 in the "salary" column
df.show()

Machine Learning with PySpark

from pyspark.ml.feature import VectorAssembler
from pyspark.ml.regression import LinearRegression
def custom_function(name):
return "Hello, " + name

custom_udf = udf(custom_function, StringType())
df = df.withColumn("greeting", custom_udf(df["name"]))
df.show()

These are some basic examples to help you get started with PySpark. Your specific use case and data will decide the features to explore. You can investigate more advanced features and operations that PySpark offers for data processing and analysis.

Srini

Data Engineer with deep AI and Generative AI expertise, crafting high-performance data pipelines in PySpark, Databricks, and SQL. Skilled in Python, AWS, and Linux—building scalable, cloud-native solutions for smart applications.