If you have a text file in PySpark where each line represents a record with no separators, and you need to split each line into separate columns based on a fixed length (e.g., 10 characters), you can do this using the substring function along with select statements. Here’s how you can do it.

PySpark Code: Reading Text file & Write as Columns

Input file: p.txt

12345oooooooQQQQQQQQQPPPPPPPP

PySpark Code

from pyspark.sql import SparkSession 
from pyspark.sql.functions import substring
spark=SparkSession.builder.appName("Name").getOrCreate()

df=spark.read.text("/content/p.txt")

df=df.select(
substring(df.value, 1, 10).alias("col1"),
substring(df.value, 11, 10).alias("col2"),
substring(df.value, 21, 1000000).alias("col3") ## you can give any big value: means till end of the row.

)

df.show()

Output

+----------+----------+---------+
| col1| col2| col3|
+----------+----------+---------+
|12345ooooo|ooQQQQQQQQ|QPPPPPPPP|
+----------+----------+---------+

Conclusion

  • We first read the text file into a DataFrame using spark.read.text(“/content/p.txt”). Then, we use the substring function to split each line into individual columns based on a fixed length of 10 characters.
  • We use select statements to extract substrings from each line and alias them as separate columns (e.g., “col1”, “col2”, etc.).
  • Finally, we show the resulting DataFrame where each line has been split into separate columns based on the fixed length. Adjust the substring parameters and alias names according to your specific requirements.