During a data engineer interview, a question arose about adding a new column in a specific position using PySpark. Typically, the withColumn will add a new column at the end. The example below demonstrates how to add in a desired location.

Adding new_column at specified position PySpark
Photo by Charlotte May on Pexels.com

Table of contents

  1. PySpark adding new-column particular position
  2. Conclusion

PySpark adding new-column particular position

In PySpark, you can add a new column to a DataFrame wherever you want using the withColumn(). To add a column at a specific position, you must create a new DataFrame. You then rearrange the columns suitably. Still, note that DataFrames in PySpark are immutable. This means you can’t directly insert a column at an arbitrary position. You need to create a new DataFrame for that.

Sample code

from pyspark.sql import SparkSession
from pyspark.sql.functions import lit

# Initialize SparkSession
spark = SparkSession.builder \
    .appName("Adding Column Example") \
    .getOrCreate()

# Create sample DataFrame
data = [(1, "John", 30, 100), (2, "Alice", 25,200), (3, "Bob", 35, 300)]
df = spark.createDataFrame(data, ["Id", "Name", "Age", "Sal"])

# Define the new column
##new_column_data = ["Engineer", "Doctor", "Teacher"]

# Add the new column at a specific position
# Here, let's say we want to add it after the "Name" column
position = 1

# Get existing columns
existing_columns = df.columns

##Prnting of existing columns

print (existing_columns)

## Position of "Id" column
print(existing_columns.index("Id"))


# Rearrange columns to insert the new column at the desired position
new_columns = existing_columns[:position] + ["Profession"] + existing_columns[position:]

# Create a new DataFrame with the rearranged columns and add the new column
new_df = df.select(*existing_columns).withColumn("Profession", lit(None)).select(*new_columns)

# Display the new DataFrame
new_df.show()

Output

['Id', 'Name', 'Age', 'Sal']
0
+---+----------+-----+---+---+
| Id|Profession| Name|Age|Sal|
+---+----------+-----+---+---+
| 1| NULL| John| 30|100|
| 2| NULL|Alice| 25|200|
| 3| NULL| Bob| 35|300|
+---+----------+-----+---+---+

Conclusion

This code will add a new column – called “Profession” after the “Name” column in the DataFrame. You can adjust the position variable to specify where to insert a new column.