PySpark is constantly evolving, introducing powerful new features that simplify development, enhance performance, and unlock new analytical possibilities. You can see significant improvements in Spark 3.4 and especially in Spark 3.5. Let’s explore the most exciting ones!

1. Arrow-Optimized Python UDFs

One of the most impactful upgrades arrives with Arrow-optimized Python UDFs. When you enable it via:

spark.conf.set("spark.sql.execution.pythonUDF.arrow.enabled", True)

or use @udf(..., useArrow=True), PySpark switches from traditional pickled UDFs to vectorized Arrow-based processing. This can deliver up to 2× speed improvements on modern CPUs due to efficient columnar data handling.

Practical Tip: Use this for any heavy UDF workloads—especially in data enrichment or feature engineering—where performance matters.

2. Python User-Defined Table Functions (UDTFs)

PySpark now supports Python UDTFs, a significant upgrade that lets you create Python functions that return full tables rather than single column values. Defined in code and callable both in Python and via SQL, UDTFs open a powerful door in PySpark development:

from pyspark.sql.functions import udtf

class MyHelloUDTF:
    def eval(self, *args):
        yield "hello", "world"

test_udtf = udtf(MyHelloUDTF, returnType="c1: string, c2: string")
test_udtf().show()

Use Case: Exploding complex JSONs, unpacking embedded arrays, or returning multiple rows from a single input—without complex transformations.

3. Enhanced Testing API & Better Error Messages

Testing becomes much cleaner with the new PySpark testing utilities:

from pyspark.testing import assertDataFrameEqual, assertPandasOnSparkEqual, assertSchemaEqual

These tools provide color-coded, detailed diff output, helping quickly spot schema or content mismatches during testing.

Moreover, PySpark now maps errors to structured error classes and codes, improving debuggability and integration with external monitoring.

4. New Array Helper Functions

Spark 3.5 brings convenient array manipulation functions like array_append and array_prepend to PySpark—eliminating common UDF or SQL hackiness.

from pyspark.sql.functions import array_append

df.withColumn("new_arr", array_append("arr_col", lit("new_val")))

Why it matters: Cleaner code, better readability, and faster development when working with nested or list-like data.

5. Expanded SQL Function Support + IDENTIFIER Clause

Spark 3.5 marshaled 150 new SQL functions into the PySpark DataFrame API—covering functions earlier only accessible via raw SQL strings.

It also introduced the IDENTIFIER clause, enabling safer template-based SQL queries by safely injecting table or column names, guarding against SQL injection risks.

Named argumen1t support is now available for PySpark SQL function calls, making complex multi-param functions significantly more readable.

6. HyperLogLog Aggregations via Datasketches

Spark 3.5 adds high-performance HyperLogLog (HLL) approximate aggregation functions powered by Apache Datasketches—enabling efficient estimations of distinct counts at scale. These functions persist and merge sketches effectively.

Ideal for: Real-time analytics, high cardinality data grouping, and memory-efficient summarization.

7. Structured Streaming Enhancements

Spark Connect now supports structured streaming in Python, making streaming jobs more portable and client-enabled. Spark 3.5 introduces watermark propagation across operators and the dropDuplicatesWithinWatermark operation—making deduplication in event-time logic much smoother.

8. Other Notable PySpark API Upgrades

Additional noteworthy features include:

  • .offset method on DataFrame—for intuitive pagination.
  • Enhanced dir(df) to list DataFrame columns directly.
  • Support for nested timestamp types and TimestampNTZType.
  • Runtime control over Python executables in UDFs.
  • assertDataFrameEqual utility (also mentions in 3.5.0).
  • Support for fill_value in pandas-on-Spark Series.

Summary Table

FeatureBenefit
Arrow-optimized UDFsFaster UDF execution via vectorization
Python UDTFsTable-level function output flexibility
Testing APIs + Error ClassesEasier, robust testing & debugging
Array helpers (append/prepend)Cleaner list/array manipulation
SQL function expansion & IDENTIFIERMore Pythonic SQL and secure injection
HLL aggregationsEfficient approximate distinct counts
Streaming enhancementsBetter streaming logic and dedup support
API convenience methodsDeveloper productivity gains

Final Thoughts

These updates make PySpark more powerful and developer-friendly than ever. Whether you’re optimizing performance with Arrow UDFs, tackling streaming deduplication, enriching your testing workflows, or exploring UDTFs, Spark 3.4–3.5 sets a solid foundation.

  1. Named argument support allows you to pass parameters using explicit names. This makes your code more readable, easier to maintain, and less error-prone. ↩︎