Here’s a PySpark code that shows you how to read the input CSV file as text. I chose this topic because this is an interview question. In PySpark, you can read a CSV file with the schema or without it (later, you write your schema), but the point here is that I am reading all the fields of the entire CSV file as text.
Table of contents

PySpark Schema
According to Microsoft documentation, a Schema is a structure of a field. This means it tells the type of a field.
PySpark Code
It shows how to read a CSV file as Text.
#Initialize Spark Session
from pyspark.sql import SparkSession
spark= SparkSession.builder.appName("Test").getOrCreate()
#Read the CSV file. The ideas is set infer schema False.
df=spark.read.option("inferSchema", False).option("Header", True).csv("dbfs:/FileStore/shared_uploads/info@srinimf.com/customers_100.csv")
df.show()
df.printSchema()
Output

Infer Schema True (or) Flase
Per Apache documentation, inferSchema=False, which means default schema. When you set it to True, Pyspark automatically infers the Schema. It’s a tricky point for interviews since True or False confuses many people.
The False default data type is String Type, so all the CSV fields read as String type, which is a Text. By doing this, we can read CSV files as Text.
References







You must be logged in to post a comment.