PySpark is constantly evolving, introducing powerful new features that simplify development, enhance performance, and unlock new analytical possibilities. You can see significant improvements in Spark 3.4 and especially in Spark 3.5. Let’s explore the most exciting ones!
1. Arrow-Optimized Python UDFs
One of the most impactful upgrades arrives with Arrow-optimized Python UDFs. When you enable it via:
spark.conf.set("spark.sql.execution.pythonUDF.arrow.enabled", True)
or use @udf(..., useArrow=True), PySpark switches from traditional pickled UDFs to vectorized Arrow-based processing. This can deliver up to 2× speed improvements on modern CPUs due to efficient columnar data handling.
Practical Tip: Use this for any heavy UDF workloads—especially in data enrichment or feature engineering—where performance matters.
2. Python User-Defined Table Functions (UDTFs)
PySpark now supports Python UDTFs, a significant upgrade that lets you create Python functions that return full tables rather than single column values. Defined in code and callable both in Python and via SQL, UDTFs open a powerful door in PySpark development:
from pyspark.sql.functions import udtf class MyHelloUDTF: def eval(self, *args): yield "hello", "world" test_udtf = udtf(MyHelloUDTF, returnType="c1: string, c2: string") test_udtf().show()
Use Case: Exploding complex JSONs, unpacking embedded arrays, or returning multiple rows from a single input—without complex transformations.
3. Enhanced Testing API & Better Error Messages
Testing becomes much cleaner with the new PySpark testing utilities:
from pyspark.testing import assertDataFrameEqual, assertPandasOnSparkEqual, assertSchemaEqual
These tools provide color-coded, detailed diff output, helping quickly spot schema or content mismatches during testing.
Moreover, PySpark now maps errors to structured error classes and codes, improving debuggability and integration with external monitoring.
4. New Array Helper Functions
Spark 3.5 brings convenient array manipulation functions like array_append and array_prepend to PySpark—eliminating common UDF or SQL hackiness.
from pyspark.sql.functions import array_append df.withColumn("new_arr", array_append("arr_col", lit("new_val")))
Why it matters: Cleaner code, better readability, and faster development when working with nested or list-like data.
5. Expanded SQL Function Support + IDENTIFIER Clause
Spark 3.5 marshaled 150 new SQL functions into the PySpark DataFrame API—covering functions earlier only accessible via raw SQL strings.
It also introduced the IDENTIFIER clause, enabling safer template-based SQL queries by safely injecting table or column names, guarding against SQL injection risks.
Named argumen1t support is now available for PySpark SQL function calls, making complex multi-param functions significantly more readable.
6. HyperLogLog Aggregations via Datasketches
Spark 3.5 adds high-performance HyperLogLog (HLL) approximate aggregation functions powered by Apache Datasketches—enabling efficient estimations of distinct counts at scale. These functions persist and merge sketches effectively.
Ideal for: Real-time analytics, high cardinality data grouping, and memory-efficient summarization.
7. Structured Streaming Enhancements
Spark Connect now supports structured streaming in Python, making streaming jobs more portable and client-enabled. Spark 3.5 introduces watermark propagation across operators and the dropDuplicatesWithinWatermark operation—making deduplication in event-time logic much smoother.
8. Other Notable PySpark API Upgrades
Additional noteworthy features include:
.offsetmethod on DataFrame—for intuitive pagination.- Enhanced
dir(df)to list DataFrame columns directly. - Support for nested timestamp types and
TimestampNTZType. - Runtime control over Python executables in UDFs.
assertDataFrameEqualutility (also mentions in 3.5.0).- Support for
fill_valuein pandas-on-Spark Series.
Summary Table
| Feature | Benefit |
|---|---|
| Arrow-optimized UDFs | Faster UDF execution via vectorization |
| Python UDTFs | Table-level function output flexibility |
| Testing APIs + Error Classes | Easier, robust testing & debugging |
| Array helpers (append/prepend) | Cleaner list/array manipulation |
| SQL function expansion & IDENTIFIER | More Pythonic SQL and secure injection |
| HLL aggregations | Efficient approximate distinct counts |
| Streaming enhancements | Better streaming logic and dedup support |
| API convenience methods | Developer productivity gains |
Final Thoughts
These updates make PySpark more powerful and developer-friendly than ever. Whether you’re optimizing performance with Arrow UDFs, tackling streaming deduplication, enriching your testing workflows, or exploring UDTFs, Spark 3.4–3.5 sets a solid foundation.
- Named argument support allows you to pass parameters using explicit names. This makes your code more readable, easier to maintain, and less error-prone. ↩︎






