In PySpark, many methods are directly available on DataFrame objects and other classes, so no separate import is needed. Here’s a cheat sheet of common PySpark methods.

Common PySpark Methods

1. DataFrame Methods

These methods are directly available on DataFrame objects:

  • show(): Displays the content of the DataFrame.
  • select(*cols): Select a subset of columns.
  • filter(condition) or where(condition): Filters rows based on a condition.
  • groupBy(*cols): Groups the DataFrame using the specified columns.
  • orderBy(*cols) or sort(*cols): Orders rows by specified columns.
  • join(other, on=None, how=None): Joins two DataFrames.
  • withColumn(colName, col): Adds a new column or replaces an existing column.
  • drop(*cols): Drops the specified column(s).
  • distinct(): Returns a new DataFrame with distinct rows.
  • count(): Returns the number of rows in the DataFrame.
  • union(other): Returns a new DataFrame containing the union of rows from this DataFrame and another DataFrame.
  • agg(*exprs): Computes aggregate statistics for each group.
  • cache(): Caches the DataFrame in memory for quicker access.
  • persist(storageLevel): Caches the DataFrame with the specified storage level.
  • collect(): Returns all the rows as a list of Row objects.

2. Spark Session Methods

These methods are directly available on the SparkSession object:

  • createDataFrame(data, schema=None): Creates a DataFrame from an RDD, a list, or a Pandas DataFrame.
  • read: Provides access to DataFrameReader for loading data (e.g., spark.read.csv("path")).
  • sql(sqlQuery): Executes a SQL query using Spark SQL.
  • stop(): Stops the current Spark session.

3. RDD Methods

If you’re working with RDDs, these methods are directly available:

  • map(func): Returns a new RDD by applying a function to each element.
  • filter(func): Returns a new RDD containing only the elements that satisfy a predicate.
  • reduce(func): Aggregates the elements of an RDD using a specified function.
  • collect(): Returns all elements of the RDD as an array.
  • count(): Returns the number of elements in the RDD.
  • saveAsTextFile(path): Saves the RDD data to a text file at the specified path.

4. DataFrameReader and DataFrameWriter Methods

When you use spark.read or df.write, you can directly access:

  • csv(path): Reads or writes a CSV file.
  • json(path): Reads or writes a JSON file.
  • parquet(path): Reads or writes a Parquet file.
  • format(source): Specifies the format of the data source.
  • option(key, value): Adds an option to DataFrameReader or DataFrameWriter.

Summary

These methods are integral to PySpark classes like DataFrame, SparkSession, RDD, and DataFrameReader/Writer, so they don’t require an explicit import. For more advanced functions like countDistinct, col, lit, etc., you need to import them from pyspark.sql.functions.