PySpark interviews often go beyond simple joins or filters. To truly impress, you need to show depth with both DataFrame methods and SQL functions. Therefore, we’ll go over 10 tough PySpark methods and 10 advanced SQL functions — each with a simple example and output.

🔟 10 Tough PySpark Methods (with Examples)

1. withColumn() – Add a new column


df = spark.createDataFrame([(1, "Alice")], ["id", "name"])
df.withColumn("greeting", df.name + " says hi").show()

Output:

+---+-----+-------------+
| id| name|     greeting|
+---+-----+-------------+
|  1|Alice|Alice says hi|
+---+-----+-------------+

2. when() + otherwise() – Conditional logic

from pyspark.sql.functions import when

df = spark.createDataFrame([(1, 70), (2, 40)], ["id", "score"])
df.withColumn("result", when(df.score > 50, "Pass").otherwise("Fail")).show()

Output:

+---+-----+------+
| id|score|result|
+---+-----+------+
| 1| 70| Pass|
| 2| 40| Fail|
+---+-----+------+

3. explode() – Flatten lists

from pyspark.sql.functions import explode

df = spark.createDataFrame([(1, ["a", "b"]), (2, ["c"])], ["id", "letters"])
df.select("id", explode("letters").alias("letter")).show()

Output:

+---+------+
| id|letter|
+---+------+
| 1| a|
| 1| b|
| 2| c|
+---+------+

4. selectExpr() – SQL-like column expressions

df = spark.createDataFrame([(1, 2)], ["x", "y"])
df.selectExpr("x + y as sum").show()

Output:

+---+
|sum|
+---+
| 3|
+---+

5. dropDuplicates() – Remove duplicates

df = spark.createDataFrame([(1, "A"), (1, "A"), (2, "B")], ["id", "val"])
df.dropDuplicates().show()

Output:

+---+---+
| id|val|
+---+---+
| 1| A|
| 2| B|
+---+---+

6. fillna() – Replace nulls

df = spark.createDataFrame([(1, None), (2, "B")], ["id", "val"])
df.fillna("N/A").show()

Output:

+---+-----+
| id| val|
+---+-----+
| 1| N/A|
| 2| B|
+---+-----+

7. pivot() – Reshape data

df = spark.createDataFrame([("A", 2022, 100), ("A", 2023, 200)], ["dept", "year", "rev"])
df.groupBy("dept").pivot("year").sum("rev").show()

Output:

+----+----+----+
|dept|2022|2023|
+----+----+----+
| A| 100| 200|
+----+----+----+

8. alias() – Rename for clarity

df = spark.createDataFrame([(1, "Alice")], ["id", "name"])
df.select(df.name.alias("employee")).show()

Output:

+--------+
|employee|
+--------+
| Alice|
+--------+

9. repartition() – Redistribute partitions

df = spark.range(0, 100).repartition(5)
print(df.rdd.getNumPartitions()) # Output: 5

10. cache() – Speed up repeated reads

df = spark.range(1, 100000)
df.cache().count() # Data is cached in memory

🔟 10 Tough PySpark SQL Functions (with Examples)

1. row_number() – Rank within groups

from pyspark.sql.window import Window
from pyspark.sql.functions import row_number

df = spark.createDataFrame([("A", 10), ("A", 20), ("B", 30)], ["grp", "val"])
window = Window.partitionBy("grp").orderBy("val")
df.withColumn("rnk", row_number().over(window)).show()

Output:

+----+---+---+
| grp|val|rnk|
+----+---+---+
| A| 10| 1|
| A| 20| 2|
| B| 30| 1|
+----+---+---+

2. dense_rank() – Ranking with no gaps

from pyspark.sql.functions import dense_rank
df.withColumn("drank", dense_rank().over(window)).show()

3. lead() – Look ahead

from pyspark.sql.functions import lead

df.withColumn("next_val", lead("val").over(window)).show()

4. lag() – Look behind

from pyspark.sql.functions import lag

df.withColumn("prev_val", lag("val").over(window)).show()

5. coalesce() – First non-null value

from pyspark.sql.functions import coalesce

df = spark.createDataFrame([(None, "x"), ("a", "b")], ["col1", "col2"])
df.select(coalesce("col1", "col2").alias("first_non_null")).show()

Output:

+--------------+
|first_non_null|
+--------------+
| x|
| a|
+--------------+

6. collect_list() – Aggregate to list

from pyspark.sql.functions import collect_list

df.groupBy("grp").agg(collect_list("val")).show()

7. collect_set() – Unique values only

from pyspark.sql.functions import collect_set

df.groupBy("grp").agg(collect_set("val")).show()

8. size() – Count items in array

from pyspark.sql.functions import size

df = spark.createDataFrame([([1, 2],), ([3],)], ["nums"])
df.select(size("nums").alias("count")).show()

Output:

+-----+
|count|
+-----+
| 2|
| 1|
+-----+

9. array_contains() – Check for value in array

from pyspark.sql.functions import array_contains

df.select(array_contains("nums", 2).alias("has_2")).show()

Output:

+-----+
|has_2|
+-----+
| true|
|false|
+-----+

10. regexp_replace() – Clean text using regex

from pyspark.sql.functions import regexp_replace

df = spark.createDataFrame([("abc-123",)], ["raw"])
df.select(regexp_replace("raw", "-", "_").alias("cleaned")).show()

Output:

+--------+
| cleaned|
+--------+
|abc_123 |
+--------+

✅ Final Thoughts

To sum up, mastering these functions not only boosts your interview chances but also your day-to-day productivity with PySpark. Although these may seem tricky at first, practicing with small examples — like the ones above — makes them second nature. So before your next interview, try these out in a notebook, and you’ll walk in with confidence!