If you have a text file in PySpark where each line represents a record with no separators, and you need to split each line into separate columns based on a fixed length (e.g., 10 characters), you can do this using the substring function along with select statements. Here’s how you can do it.
PySpark Code: Reading Text file & Write as Columns
Input file: p.txt
12345oooooooQQQQQQQQQPPPPPPPP
PySpark Code
from pyspark.sql import SparkSession
from pyspark.sql.functions import substring
spark=SparkSession.builder.appName("Name").getOrCreate()
df=spark.read.text("/content/p.txt")
df=df.select(
substring(df.value, 1, 10).alias("col1"),
substring(df.value, 11, 10).alias("col2"),
substring(df.value, 21, 1000000).alias("col3") ## you can give any big value: means till end of the row.
)
df.show()
Recommended Books
Output
+----------+----------+---------+
| col1| col2| col3|
+----------+----------+---------+
|12345ooooo|ooQQQQQQQQ|QPPPPPPPP|
+----------+----------+---------+
Conclusion
- We first read the text file into a DataFrame using spark.read.text(“/content/p.txt”). Then, we use the substring function to split each line into individual columns based on a fixed length of 10 characters.
- We use select statements to extract substrings from each line and alias them as separate columns (e.g., “col1”, “col2”, etc.).
- Finally, we show the resulting DataFrame where each line has been split into separate columns based on the fixed length. Adjust the substring parameters and alias names according to your specific requirements.






