PySpark DataFrame: Common Operations Cheat Sheet

In PySpark, many methods are directly available on DataFrame objects and other classes, so no separate import is needed. Here’s a cheat sheet of common PySpark methods.

1. DataFrame Methods

These methods are directly available on DataFrame objects:

show(): Displays the content of the DataFrame.
select(*cols): Select a subset of columns.
filter(condition) or where(condition): Filters rows based on a condition.
groupBy(*cols): Groups the DataFrame using the specified columns.
orderBy(*cols) or sort(*cols): Orders rows by specified columns.
join(other, on=None, how=None): Joins two DataFrames.
withColumn(colName, col): Adds a new column or replaces an existing column.
drop(*cols): Drops the specified column(s).
distinct(): Returns a new DataFrame with distinct rows.
count(): Returns the number of rows in the DataFrame.
union(other): Returns a new DataFrame containing the union of rows from this DataFrame and another DataFrame.
agg(*exprs): Computes aggregate statistics for each group.
cache(): Caches the DataFrame in memory for quicker access.
persist(storageLevel): Caches the DataFrame with the specified storage level.
collect(): Returns all the rows as a list of Row objects.

2. Spark Session Methods

These methods are directly available on the SparkSession object:

createDataFrame(data, schema=None): Creates a DataFrame from an RDD, a list, or a Pandas DataFrame.
read: Provides access to DataFrameReader for loading data (e.g., spark.read.csv("path")).
sql(sqlQuery): Executes a SQL query using Spark SQL.
stop(): Stops the current Spark session.

3. RDD Methods

If you’re working with RDDs, these methods are directly available:

map(func): Returns a new RDD by applying a function to each element.
filter(func): Returns a new RDD containing only the elements that satisfy a predicate.
reduce(func): Aggregates the elements of an RDD using a specified function.
collect(): Returns all elements of the RDD as an array.
count(): Returns the number of elements in the RDD.
saveAsTextFile(path): Saves the RDD data to a text file at the specified path.

4. DataFrameReader and DataFrameWriter Methods

When you use spark.read or df.write, you can directly access:

csv(path): Reads or writes a CSV file.
json(path): Reads or writes a JSON file.
parquet(path): Reads or writes a Parquet file.
format(source): Specifies the format of the data source.
option(key, value): Adds an option to DataFrameReader or DataFrameWriter.

Summary

These methods are integral to PySpark classes like DataFrame, SparkSession, RDD, and DataFrameReader/Writer, so they don’t require an explicit import. For more advanced functions like countDistinct, col, lit, etc., you need to import them from pyspark.sql.functions.

Srini

Data Engineer with deep AI and Generative AI expertise, crafting high-performance data pipelines in PySpark, Databricks, and SQL. Skilled in Python, AWS, and Linux—building scalable, cloud-native solutions for smart applications.