In PySpark, many methods are directly available on DataFrame objects and other classes, so no separate import is needed. Here’s a cheat sheet of common PySpark methods.

1. DataFrame Methods
These methods are directly available on DataFrame objects:
show(): Displays the content of the DataFrame.select(*cols): Select a subset of columns.filter(condition)orwhere(condition): Filters rows based on a condition.groupBy(*cols): Groups the DataFrame using the specified columns.orderBy(*cols)orsort(*cols): Orders rows by specified columns.join(other, on=None, how=None): Joins two DataFrames.withColumn(colName, col): Adds a new column or replaces an existing column.drop(*cols): Drops the specified column(s).distinct(): Returns a new DataFrame with distinct rows.count(): Returns the number of rows in the DataFrame.union(other): Returns a new DataFrame containing the union of rows from this DataFrame and another DataFrame.agg(*exprs): Computes aggregate statistics for each group.cache(): Caches the DataFrame in memory for quicker access.persist(storageLevel): Caches the DataFrame with the specified storage level.collect(): Returns all the rows as a list of Row objects.
2. Spark Session Methods
These methods are directly available on the SparkSession object:
createDataFrame(data, schema=None): Creates a DataFrame from an RDD, a list, or a Pandas DataFrame.read: Provides access to DataFrameReader for loading data (e.g.,spark.read.csv("path")).sql(sqlQuery): Executes a SQL query using Spark SQL.stop(): Stops the current Spark session.
3. RDD Methods
If you’re working with RDDs, these methods are directly available:
map(func): Returns a new RDD by applying a function to each element.filter(func): Returns a new RDD containing only the elements that satisfy a predicate.reduce(func): Aggregates the elements of an RDD using a specified function.collect(): Returns all elements of the RDD as an array.count(): Returns the number of elements in the RDD.saveAsTextFile(path): Saves the RDD data to a text file at the specified path.
4. DataFrameReader and DataFrameWriter Methods
When you use spark.read or df.write, you can directly access:
csv(path): Reads or writes a CSV file.json(path): Reads or writes a JSON file.parquet(path): Reads or writes a Parquet file.format(source): Specifies the format of the data source.option(key, value): Adds an option to DataFrameReader or DataFrameWriter.
Summary
These methods are integral to PySpark classes like DataFrame, SparkSession, RDD, and DataFrameReader/Writer, so they don’t require an explicit import. For more advanced functions like countDistinct, col, lit, etc., you need to import them from pyspark.sql.functions.







You must be logged in to post a comment.