We can create functions in PySpark to make our code easier. There are three steps to create a UDF in PySpark.

  1. Creating a function
  2. Register the function as udf
  3. Apply the function to each column
How to Create UDF in PySpark Best Example
Photo by REAFON GATES on Pexels.com

Creating UDF in PySpark

Convert the first letter of first_name and last_name to uppercase in the PySpark code example.

from pyspark.sql import SparkSession
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType

# Create a Spark session
spark = SparkSession.builder.appName("Capitalize Names UDF").getOrCreate()

# Sample data
data = [(1, "john jones"), (2, "tracey smith"), (3, "amy sanders")]
columns = ["Seqno", "Name"]

# Create a DataFrame
df = spark.createDataFrame(data, columns)

# Define a Python function to capitalize the first letter of each word
def capitalize_name(name):
return " ".join([word.capitalize() for word in name.split()])

# Register the function as a UDF
capitalize_name_udf = udf(capitalize_name, StringType())

# Apply the UDF to the Name column
df_transformed = df.withColumn("Name", capitalize_name_udf("Name"))

# Show the result
df_transformed.show()

Output

Before Converting
+-----+------------+
|Seqno|Name        |
+-----+------------+
|1    |john jones  |
|2    |tracey smith|
|3    |amy sanders |
+-----+------------+

After Converting
+-----+-------------+
|Seqno|Name         |
+-----+-------------+
|1    |John Jones   |
|2    |Tracey Smith |
|3    |Amy Sanders  |
+-----+-------------+