Master These 20 PySpark Functions to Land Your Next Big Data Job

PySpark interviews often go beyond simple joins or filters. To truly impress, you need to show depth with both DataFrame methods and SQL functions. Therefore, we’ll go over 10 tough PySpark methods and 10 advanced SQL functions — each with a simple example and output.

🔟 10 Tough PySpark Methods (with Examples)

1. `withColumn()` – Add a new column


df = spark.createDataFrame([(1, "Alice")], ["id", "name"])
df.withColumn("greeting", df.name + " says hi").show()

Output:

+---+-----+-------------+
| id| name|     greeting|
+---+-----+-------------+
|  1|Alice|Alice says hi|
+---+-----+-------------+

2. `when()` + `otherwise()` – Conditional logic

from pyspark.sql.functions import when

df = spark.createDataFrame([(1, 70), (2, 40)], ["id", "score"])
df.withColumn("result", when(df.score > 50, "Pass").otherwise("Fail")).show()

Output:

+---+-----+------+
| id|score|result|
+---+-----+------+
|  1|   70|  Pass|
|  2|   40|  Fail|
+---+-----+------+

3. `explode()` – Flatten lists

from pyspark.sql.functions import explode

df = spark.createDataFrame([(1, ["a", "b"]), (2, ["c"])], ["id", "letters"])
df.select("id", explode("letters").alias("letter")).show()

Output:

+---+------+
| id|letter|
+---+------+
|  1|     a|
|  1|     b|
|  2|     c|
+---+------+

4. `selectExpr()` – SQL-like column expressions

df = spark.createDataFrame([(1, 2)], ["x", "y"])
df.selectExpr("x + y as sum").show()

Output:

+---+
|sum|
+---+
|  3|
+---+

5. `dropDuplicates()` – Remove duplicates

df = spark.createDataFrame([(1, "A"), (1, "A"), (2, "B")], ["id", "val"])
df.dropDuplicates().show()

Output:

+---+---+
| id|val|
+---+---+
|  1|  A|
|  2|  B|
+---+---+

6. `fillna()` – Replace nulls

df = spark.createDataFrame([(1, None), (2, "B")], ["id", "val"])
df.fillna("N/A").show()

Output:

+---+-----+
| id|  val|
+---+-----+
|  1|  N/A|
|  2|    B|
+---+-----+

7. `pivot()` – Reshape data

df = spark.createDataFrame([("A", 2022, 100), ("A", 2023, 200)], ["dept", "year", "rev"])
df.groupBy("dept").pivot("year").sum("rev").show()

Output:

+----+----+----+
|dept|2022|2023|
+----+----+----+
|   A| 100| 200|
+----+----+----+

8. `alias()` – Rename for clarity

df = spark.createDataFrame([(1, "Alice")], ["id", "name"])
df.select(df.name.alias("employee")).show()

Output:

+--------+
|employee|
+--------+
|   Alice|
+--------+

9. `repartition()` – Redistribute partitions

df = spark.range(0, 100).repartition(5)
print(df.rdd.getNumPartitions())  # Output: 5

10. `cache()` – Speed up repeated reads

df = spark.range(1, 100000)
df.cache().count()  # Data is cached in memory

🔟 10 Tough PySpark SQL Functions (with Examples)

1. `row_number()` – Rank within groups

from pyspark.sql.window import Window
from pyspark.sql.functions import row_number

df = spark.createDataFrame([("A", 10), ("A", 20), ("B", 30)], ["grp", "val"])
window = Window.partitionBy("grp").orderBy("val")
df.withColumn("rnk", row_number().over(window)).show()

Output:

+----+---+---+
| grp|val|rnk|
+----+---+---+
|   A| 10|  1|
|   A| 20|  2|
|   B| 30|  1|
+----+---+---+

2. `dense_rank()` – Ranking with no gaps

from pyspark.sql.functions import dense_rank
df.withColumn("drank", dense_rank().over(window)).show()

3. `lead()` – Look ahead

from pyspark.sql.functions import lead

df.withColumn("next_val", lead("val").over(window)).show()

4. `lag()` – Look behind

from pyspark.sql.functions import lag

df.withColumn("prev_val", lag("val").over(window)).show()

5. `coalesce()` – First non-null value

from pyspark.sql.functions import coalesce

df = spark.createDataFrame([(None, "x"), ("a", "b")], ["col1", "col2"])
df.select(coalesce("col1", "col2").alias("first_non_null")).show()

Output:

+--------------+
|first_non_null|
+--------------+
|             x|
|             a|
+--------------+

6. `collect_list()` – Aggregate to list

from pyspark.sql.functions import collect_list

df.groupBy("grp").agg(collect_list("val")).show()

7. `collect_set()` – Unique values only

from pyspark.sql.functions import collect_set

df.groupBy("grp").agg(collect_set("val")).show()

8. `size()` – Count items in array

from pyspark.sql.functions import size

df = spark.createDataFrame([([1, 2],), ([3],)], ["nums"])
df.select(size("nums").alias("count")).show()

Output:

+-----+
|count|
+-----+
|    2|
|    1|
+-----+

9. `array_contains()` – Check for value in array

from pyspark.sql.functions import array_contains

df.select(array_contains("nums", 2).alias("has_2")).show()

Output:

+-----+
|has_2|
+-----+
| true|
|false|
+-----+

10. `regexp_replace()` – Clean text using regex

from pyspark.sql.functions import regexp_replace

df = spark.createDataFrame([("abc-123",)], ["raw"])
df.select(regexp_replace("raw", "-", "_").alias("cleaned")).show()

Output:

+--------+
| cleaned|
+--------+
|abc_123 |
+--------+

✅ Final Thoughts

To sum up, mastering these functions not only boosts your interview chances but also your day-to-day productivity with PySpark. Although these may seem tricky at first, practicing with small examples — like the ones above — makes them second nature. So before your next interview, try these out in a notebook, and you’ll walk in with confidence!

Srini

Data Engineer with deep AI and Generative AI expertise, crafting high-performance data pipelines in PySpark, Databricks, and SQL. Skilled in Python, AWS, and Linux—building scalable, cloud-native solutions for smart applications.

Latest Posts

Why DELETE with Subqueries Fails in PySpark SQL (And How to Fix It)

January 1, 2026
GitHub Features & Settings Explained: The Ultimate GitHub Options Guide

December 28, 2025
Ingesting Data from AWS S3 into Databricks with Auto Loader: Building a Medallion Architecture

December 18, 2025
Building Scalable Data Pipelines with dlt-meta: A Metadata-Driven Approach on Databricks

December 9, 2025

Master These 20 PySpark Functions to Land Your Next Big Data Job

🔟 10 Tough PySpark Methods (with Examples)

1. withColumn() – Add a new column

2. when() + otherwise() – Conditional logic

3. explode() – Flatten lists

4. selectExpr() – SQL-like column expressions

5. dropDuplicates() – Remove duplicates

6. fillna() – Replace nulls

7. pivot() – Reshape data

8. alias() – Rename for clarity

9. repartition() – Redistribute partitions

10. cache() – Speed up repeated reads

🔟 10 Tough PySpark SQL Functions (with Examples)

1. row_number() – Rank within groups

2. dense_rank() – Ranking with no gaps

3. lead() – Look ahead

4. lag() – Look behind

5. coalesce() – First non-null value

6. collect_list() – Aggregate to list

7. collect_set() – Unique values only

8. size() – Count items in array

9. array_contains() – Check for value in array

10. regexp_replace() – Clean text using regex

✅ Final Thoughts

Share this:

Latest Posts

Why DELETE with Subqueries Fails in PySpark SQL (And How to Fix It)

GitHub Features & Settings Explained: The Ultimate GitHub Options Guide

Ingesting Data from AWS S3 into Databricks with Auto Loader: Building a Medallion Architecture

Building Scalable Data Pipelines with dlt-meta: A Metadata-Driven Approach on Databricks

1. `withColumn()` – Add a new column

2. `when()` + `otherwise()` – Conditional logic

3. `explode()` – Flatten lists

4. `selectExpr()` – SQL-like column expressions

5. `dropDuplicates()` – Remove duplicates

6. `fillna()` – Replace nulls

7. `pivot()` – Reshape data

8. `alias()` – Rename for clarity

9. `repartition()` – Redistribute partitions

10. `cache()` – Speed up repeated reads

1. `row_number()` – Rank within groups

2. `dense_rank()` – Ranking with no gaps

3. `lead()` – Look ahead

4. `lag()` – Look behind

5. `coalesce()` – First non-null value

6. `collect_list()` – Aggregate to list

7. `collect_set()` – Unique values only

8. `size()` – Count items in array

9. `array_contains()` – Check for value in array

10. `regexp_replace()` – Clean text using regex