PySpark interviews often go beyond simple joins or filters. To truly impress, you need to show depth with both DataFrame methods and SQL functions. Therefore, we’ll go over 10 tough PySpark methods and 10 advanced SQL functions — each with a simple example and output.
🔟 10 Tough PySpark Methods (with Examples)
1. withColumn() – Add a new column
df = spark.createDataFrame([(1, "Alice")], ["id", "name"])df.withColumn("greeting", df.name + " says hi").show()
Output:
+---+-----+-------------+| id| name| greeting|+---+-----+-------------+| 1|Alice|Alice says hi|+---+-----+-------------+
2. when() + otherwise() – Conditional logic
from pyspark.sql.functions import when
df = spark.createDataFrame([(1, 70), (2, 40)], ["id", "score"])
df.withColumn("result", when(df.score > 50, "Pass").otherwise("Fail")).show()
Output:
+---+-----+------+
| id|score|result|
+---+-----+------+
| 1| 70| Pass|
| 2| 40| Fail|
+---+-----+------+
3. explode() – Flatten lists
from pyspark.sql.functions import explode
df = spark.createDataFrame([(1, ["a", "b"]), (2, ["c"])], ["id", "letters"])
df.select("id", explode("letters").alias("letter")).show()
Output:
+---+------+
| id|letter|
+---+------+
| 1| a|
| 1| b|
| 2| c|
+---+------+
4. selectExpr() – SQL-like column expressions
df = spark.createDataFrame([(1, 2)], ["x", "y"])
df.selectExpr("x + y as sum").show()
Output:
+---+
|sum|
+---+
| 3|
+---+
5. dropDuplicates() – Remove duplicates
df = spark.createDataFrame([(1, "A"), (1, "A"), (2, "B")], ["id", "val"])
df.dropDuplicates().show()
Output:
+---+---+
| id|val|
+---+---+
| 1| A|
| 2| B|
+---+---+
6. fillna() – Replace nulls
df = spark.createDataFrame([(1, None), (2, "B")], ["id", "val"])
df.fillna("N/A").show()
Output:
+---+-----+
| id| val|
+---+-----+
| 1| N/A|
| 2| B|
+---+-----+
7. pivot() – Reshape data
df = spark.createDataFrame([("A", 2022, 100), ("A", 2023, 200)], ["dept", "year", "rev"])
df.groupBy("dept").pivot("year").sum("rev").show()
Output:
+----+----+----+
|dept|2022|2023|
+----+----+----+
| A| 100| 200|
+----+----+----+
8. alias() – Rename for clarity
df = spark.createDataFrame([(1, "Alice")], ["id", "name"])
df.select(df.name.alias("employee")).show()
Output:
+--------+
|employee|
+--------+
| Alice|
+--------+
9. repartition() – Redistribute partitions
df = spark.range(0, 100).repartition(5)
print(df.rdd.getNumPartitions()) # Output: 5
10. cache() – Speed up repeated reads
df = spark.range(1, 100000)
df.cache().count() # Data is cached in memory
🔟 10 Tough PySpark SQL Functions (with Examples)
1. row_number() – Rank within groups
from pyspark.sql.window import Window
from pyspark.sql.functions import row_number
df = spark.createDataFrame([("A", 10), ("A", 20), ("B", 30)], ["grp", "val"])
window = Window.partitionBy("grp").orderBy("val")
df.withColumn("rnk", row_number().over(window)).show()
Output:
+----+---+---+
| grp|val|rnk|
+----+---+---+
| A| 10| 1|
| A| 20| 2|
| B| 30| 1|
+----+---+---+
2. dense_rank() – Ranking with no gaps
from pyspark.sql.functions import dense_rank
df.withColumn("drank", dense_rank().over(window)).show()
3. lead() – Look ahead
from pyspark.sql.functions import lead
df.withColumn("next_val", lead("val").over(window)).show()
4. lag() – Look behind
from pyspark.sql.functions import lag
df.withColumn("prev_val", lag("val").over(window)).show()
5. coalesce() – First non-null value
from pyspark.sql.functions import coalesce
df = spark.createDataFrame([(None, "x"), ("a", "b")], ["col1", "col2"])
df.select(coalesce("col1", "col2").alias("first_non_null")).show()
Output:
+--------------+
|first_non_null|
+--------------+
| x|
| a|
+--------------+
6. collect_list() – Aggregate to list
from pyspark.sql.functions import collect_list
df.groupBy("grp").agg(collect_list("val")).show()
7. collect_set() – Unique values only
from pyspark.sql.functions import collect_set
df.groupBy("grp").agg(collect_set("val")).show()
8. size() – Count items in array
from pyspark.sql.functions import size
df = spark.createDataFrame([([1, 2],), ([3],)], ["nums"])
df.select(size("nums").alias("count")).show()
Output:
+-----+
|count|
+-----+
| 2|
| 1|
+-----+
9. array_contains() – Check for value in array
from pyspark.sql.functions import array_contains
df.select(array_contains("nums", 2).alias("has_2")).show()
Output:
+-----+
|has_2|
+-----+
| true|
|false|
+-----+
10. regexp_replace() – Clean text using regex
from pyspark.sql.functions import regexp_replace
df = spark.createDataFrame([("abc-123",)], ["raw"])
df.select(regexp_replace("raw", "-", "_").alias("cleaned")).show()
Output:
+--------+
| cleaned|
+--------+
|abc_123 |
+--------+
✅ Final Thoughts
To sum up, mastering these functions not only boosts your interview chances but also your day-to-day productivity with PySpark. Although these may seem tricky at first, practicing with small examples — like the ones above — makes them second nature. So before your next interview, try these out in a notebook, and you’ll walk in with confidence!






