How to Read CSV File as Text: PySpark Top Code

Here’s a PySpark code that shows you how to read the input CSV file as text. I chose this topic because this is an interview question. In PySpark, you can read a CSV file with the schema or without it (later, you write your schema), but the point here is that I am reading all the fields of the entire CSV file as text.

Table of contents

Pyspark reading CSV file as Text — Photo by Jonathan Cooper on Pexels.com

PySpark Schema

According to Microsoft documentation, a Schema is a structure of a field. This means it tells the type of a field.

PySpark Code

It shows how to read a CSV file as Text.

#Initialize Spark Session
from pyspark.sql import SparkSession

spark= SparkSession.builder.appName("Test").getOrCreate()

#Read the CSV file. The ideas is set infer schema False. 
df=spark.read.option("inferSchema", False).option("Header", True).csv("dbfs:/FileStore/shared_uploads/info@srinimf.com/customers_100.csv")

df.show()
df.printSchema()

Output

Infer Schema True (or) Flase

Per Apache documentation, inferSchema=False, which means default schema. When you set it to True, Pyspark automatically infers the Schema. It’s a tricky point for interviews since True or False confuses many people.

The False default data type is String Type, so all the CSV fields read as String type, which is a Text. By doing this, we can read CSV files as Text.

References

Sample CSV files from GitHub

Srini

Data Engineer with deep AI and Generative AI expertise, crafting high-performance data pipelines in PySpark, Databricks, and SQL. Skilled in Python, AWS, and Linux—building scalable, cloud-native solutions for smart applications.