PySpark: Splitting Text File into Columns Using Substring Function

If you have a text file in PySpark where each line represents a record with no separators, and you need to split each line into separate columns based on a fixed length (e.g., 10 characters), you can do this using the substring function along with select statements. Here’s how you can do it.

PySpark Code: Reading Text file & Write as Columns

Input file: p.txt

12345oooooooQQQQQQQQQPPPPPPPP

PySpark Code

from pyspark.sql import SparkSession 
from pyspark.sql.functions import substring
spark=SparkSession.builder.appName("Name").getOrCreate()

df=spark.read.text("/content/p.txt") 

df=df.select(
    substring(df.value, 1, 10).alias("col1"),
    substring(df.value, 11, 10).alias("col2"),
    substring(df.value, 21, 1000000).alias("col3")  ## you can give any big value: means till end of the row.
             
)

df.show()

Recommended Books

Output

+----------+----------+---------+
|      col1|      col2|     col3|
+----------+----------+---------+
|12345ooooo|ooQQQQQQQQ|QPPPPPPPP|
+----------+----------+---------+

Conclusion

We first read the text file into a DataFrame using spark.read.text(“/content/p.txt”). Then, we use the substring function to split each line into individual columns based on a fixed length of 10 characters.
We use select statements to extract substrings from each line and alias them as separate columns (e.g., “col1”, “col2”, etc.).
Finally, we show the resulting DataFrame where each line has been split into separate columns based on the fixed length. Adjust the substring parameters and alias names according to your specific requirements.

Srini

Data Engineer with deep AI and Generative AI expertise, crafting high-performance data pipelines in PySpark, Databricks, and SQL. Skilled in Python, AWS, and Linux—building scalable, cloud-native solutions for smart applications.