How do we read multiple files in PySpark? Here’s an interview question asked in Infosys. We would do it by using the wholeTextFiles() and recursiveFileLookup.

Table of contents
- Use wholeTextFiles() to read multiple files
- Use recursiveFileLookup() to read multiple files
Use wholeTextFiles() to read multiple files
Reads a directory of text files into an RDD, where each file is represented as a pair of (filename, content). Each file is read as a whole. This means that the entire content of each file is treated as a single string. It returns an RDD of key-value pairs. Here, the key is the file path and the value is the file content.
Firstly, read all file paths into RDD. Write RDD Keys, which tells path names into a list. Finally, Using for loop and the UNION merge all the files into a dataframe.
from pyspark.sql import SparkSession
spark=SparkSession.builder.appName("Reading Multiple Files").getOrCreate()
files_rdd=sc.wholeTextFiles("dbfs:/FileStore/shared_uploads/info@srinimf.com")
file_paths=files_rdd.keys().collect()
combined_df= None
for i in file_paths:
df=spark.read.csv(i, header=True)
if combined_df is None:
combined_df = df
else:
combined_df=combined_df.union(df)
combined_df.show()
Use recursiveFileLookup() to read multiple files
Recommended Books
A parameter used with methods like spark.read.text(), spark.read.csv(), etc., to enable reading files recursively from a directory and its subdirectories. It is an option parameter set to True or False.
from pyspark.sql import SparkSession
spark=SparkSession.builder.appName("Reading Multiple Files").getOrCreate()
combined_df=spark.read \
.option("recursiveFileLookup", "true") \
.csv("dbfs:/FileStore/shared_uploads/info@srinimf.com", header=True)
combined_df.show()
It is a new approach in Spark 3.0. You can use it while reading the file. It reads all the files, even in the nested folders.
References







You must be logged in to post a comment.