10 PySpark examples with answers to help you start with Apache Spark using Python.

Initializing a Spark Session
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("example").getOrCreate()
Loading Data from CSV
df = spark.read.csv("data.csv", header=True, inferSchema=True)
df.show()
Filtering Data
filtered_df = df.filter(df["age"] > 25)
filtered_df.show()
Grouping and Aggregating Data
grouped_df = df.groupBy("department").agg({"salary": "avg"})
grouped_df.show()
Joining DataFrames
joined_df = df1.join(df2, "employee_id", "inner")
joined_df.show()
Writing Data to Parquet
df.write.parquet("output.parquet")
Creating a User-Defined Function (UDF)
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType
def custom_function(name):
return "Hello, " + name
custom_udf = udf(custom_function, StringType())
df = df.withColumn("greeting", custom_udf(df["name"]))
df.show()
Running SQL Queries
df.createOrReplaceTempView("employees")
result = spark.sql("SELECT * FROM employees WHERE age > 30")
result.show()
Handling Missing Data
df = df.fillna(0, subset=["salary"]) # Fill NaN values with 0 in the "salary" column
df.show()
Machine Learning with PySpark
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.regression import LinearRegression
def custom_function(name):
return "Hello, " + name
custom_udf = udf(custom_function, StringType())
df = df.withColumn("greeting", custom_udf(df["name"]))
df.show()
These are some basic examples to help you get started with PySpark. Your specific use case and data will decide the features to explore. You can investigate more advanced features and operations that PySpark offers for data processing and analysis.







You must be logged in to post a comment.