The Easy Way to Learn PySpark: Top Examples

Here’s a beginner-friendly PySpark tutorial that covers the basics of PySpark.

On this page

Installation and Setup

It is imperative that you install Apache Spark and PySpark on your machine before proceeding. The installation process can be followed through official documentation or community tutorials.

Import PySpark modules and create a SparkSession

In script, begin by importing the required PySpark modules and creating a SparkSession. This SparkSession serves as the starting point for accessing Spark functionality.

from pyspark.sql import SparkSession

spark = SparkSession.builder \
.appName("PySpark Tutorial") \
.getOrCreate()

Load Data into DataFrame

PySpark uses DataFrames, which are distributed collections of data organized into named columns. You can load data from different sources such as CSV, JSON, Parquet, and more. To load a CSV file, you need to replace ‘path_to_your_data.csv’ with the actual path to your CSV file.

data_path = 'path_to_your_data.csv'
df = spark.read.csv(data_path, header=True, inferSchema=True)

Basic DataFrame Operations

Now that you have the DataFrame, you can display the first few rows by using the following operation:

df.head()

This will show the first few rows of the DataFrame.

df.show()

Print the schema of the DataFrame

df.printSchema()

Get the number of rows in the DataFrame

num_rows = df.count()
print(f"Number of rows: {num_rows}")

Select specific columns from the DataFrame

selected_df = df.select("column_name1", "column_name2")

Filter data based on conditions

filtered_df = df.filter(df["column_name"] > 100)

Group by a column and perform aggregation

grouped_df = df.groupBy("group_column").agg({"agg_column": "sum"})

SQL Queries with PySpark

To run SQL-like queries on DataFrames in PySpark, you first need to create a temporary view for the DataFrame.

Create a temporary view for SQL queries

df.createOrReplaceTempView("temp_table")

Execute SQL queries

sql_result = spark.sql("SELECT column_name1, column_name2 FROM temp_table WHERE column_name > 100")
sql_result.show()

Writing Data to Files

You can save the DataFrame back to various file formats, such as CSV, Parquet, etc.

Write DataFrame to a CSV file

df.write.csv("output_data.csv", header=True, mode="overwrite")

Write DataFrame to a Parquet file

df.write.parquet("output_data.parquet", mode="overwrite")

Working with RDDs (optional)

While DataFrames are recommended for most use cases, PySpark also supports RDDs (Resilient Distributed Datasets), which are the core data structures of Spark. RDDs can be useful in certain scenarios.

Create an RDD from a list

rdd = spark.sparkContext.parallelize([1, 2, 3, 4, 5])

Perform transformations on RDD

squared_rdd = rdd.map(lambda x: x ** 2)

Filter elements in RDD

filtered_rdd = rdd.filter(lambda x: x % 2 == 0)

Reduce elements in RDD

sum_rdd = rdd.reduce(lambda x, y: x + y)

Conclusion

This tutorial is a beginner’s guide to PySpark. Once you become more comfortable, you can dive into advanced topics like Spark streaming, machine learning with Spark MLlib, handling large datasets, and deploying clusters. To enhance your learning, check out the official Apache Spark documentation and community resources.

Srini

Data Engineer with deep AI and Generative AI expertise, crafting high-performance data pipelines in PySpark, Databricks, and SQL. Skilled in Python, AWS, and Linux—building scalable, cloud-native solutions for smart applications.

Latest Posts

Ingesting Data from AWS S3 into Databricks with Auto Loader: Building a Medallion Architecture

December 18, 2025
Building Scalable Data Pipelines with dlt-meta: A Metadata-Driven Approach on Databricks

December 9, 2025
Exploring Databricks Unity Catalog – System Tables and Information _Schema: Use Cases

December 8, 2025
PySpark Functions Real Use Cases

November 29, 2025

The Easy Way to Learn PySpark: Top Examples

Installation and Setup

Import PySpark modules and create a SparkSession

Load Data into DataFrame

Basic DataFrame Operations

SQL Queries with PySpark

Writing Data to Files

Working with RDDs (optional)

Share this:

Latest Posts

Ingesting Data from AWS S3 into Databricks with Auto Loader: Building a Medallion Architecture

Building Scalable Data Pipelines with dlt-meta: A Metadata-Driven Approach on Databricks

Exploring Databricks Unity Catalog – System Tables and Information _Schema: Use Cases

PySpark Functions Real Use Cases