Step-by-Step Guide to PySpark UDFs

We can create functions in PySpark to make our code easier. There are three steps to create a UDF in PySpark.

Creating a function
Register the function as udf
Apply the function to each column

How to Create UDF in PySpark Best Example — Photo by REAFON GATES on Pexels.com

Creating UDF in PySpark

Convert the first letter of first_name and last_name to uppercase in the PySpark code example.

from pyspark.sql import SparkSession
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType

# Create a Spark session
spark = SparkSession.builder.appName("Capitalize Names UDF").getOrCreate()

# Sample data
data = [(1, "john jones"), (2, "tracey smith"), (3, "amy sanders")]
columns = ["Seqno", "Name"]

# Create a DataFrame
df = spark.createDataFrame(data, columns)

# Define a Python function to capitalize the first letter of each word
def capitalize_name(name):
    return " ".join([word.capitalize() for word in name.split()])

# Register the function as a UDF
capitalize_name_udf = udf(capitalize_name, StringType())

# Apply the UDF to the Name column
df_transformed = df.withColumn("Name", capitalize_name_udf("Name"))

# Show the result
df_transformed.show()

Output

Before Converting
+-----+------------+
|Seqno|Name        |
+-----+------------+
|1    |john jones  |
|2    |tracey smith|
|3    |amy sanders |
+-----+------------+

After Converting
+-----+-------------+
|Seqno|Name         |
+-----+-------------+
|1    |John Jones   |
|2    |Tracey Smith |
|3    |Amy Sanders  |
+-----+-------------+

Srini

Data Engineer with deep AI and Generative AI expertise, crafting high-performance data pipelines in PySpark, Databricks, and SQL. Skilled in Python, AWS, and Linux—building scalable, cloud-native solutions for smart applications.