Here’s a PySpark code that shows you how to read the input CSV file as text. I chose this topic because this is an interview question. In PySpark, you can read a CSV file with the schema or without it (later, you write your schema), but the point here is that I am reading all the fields of the entire CSV file as text. 

Table of contents

  1. PySpark Schema
  2. PySpark Code
  3. Infer Schema True (or) Flase
Pyspark reading CSV file as Text
Photo by Jonathan Cooper on Pexels.com

PySpark Schema

According to Microsoft documentation, a Schema is a structure of a field. This means it tells the type of a field.

PySpark Code

It shows how to read a CSV file as Text.

#Initialize Spark Session
from pyspark.sql import SparkSession

spark= SparkSession.builder.appName("Test").getOrCreate()

#Read the CSV file. The ideas is set infer schema False. 
df=spark.read.option("inferSchema", False).option("Header", True).csv("dbfs:/FileStore/shared_uploads/info@srinimf.com/customers_100.csv")

df.show()
df.printSchema()

Output

CSV Output

Infer Schema True (or) Flase

Per Apache documentation, inferSchema=False, which means default schema. When you set it to True, Pyspark automatically infers the Schema. It’s a tricky point for interviews since True or False confuses many people.

The False default data type is String Type, so all the CSV fields read as String type, which is a Text. By doing this, we can read CSV files as Text.

References