2 Easy Ways to Read Multiple Files into a Dataframe: PySpark

How do we read multiple files in PySpark? Here’s an interview question asked in Infosys. We would do it by using the wholeTextFiles() and recursiveFileLookup.

Reading multiple files into Dataframe in PySpark — Photo by Scott Webb on Pexels.com

Table of contents

Use wholeTextFiles() to read multiple files
Use recursiveFileLookup() to read multiple files

**Use `wholeTextFiles()` to read multiple files**

Reads a directory of text files into an RDD, where each file is represented as a pair of (filename, content). Each file is read as a whole. This means that the entire content of each file is treated as a single string. It returns an RDD of key-value pairs. Here, the key is the file path and the value is the file content.

Firstly, read all file paths into RDD. Write RDD Keys, which tells path names into a list. Finally, Using for loop and the UNION merge all the files into a dataframe.

from pyspark.sql import SparkSession
spark=SparkSession.builder.appName("Reading Multiple Files").getOrCreate()
 
files_rdd=sc.wholeTextFiles("dbfs:/FileStore/shared_uploads/info@srinimf.com")

file_paths=files_rdd.keys().collect()

combined_df= None
for i in file_paths:
    df=spark.read.csv(i, header=True)

    if combined_df is None:
        combined_df = df 
    else:
        combined_df=combined_df.union(df) 

combined_df.show()

Use recursiveFileLookup() to read multiple files

Recommended Books

A parameter used with methods like spark.read.text(), spark.read.csv(), etc., to enable reading files recursively from a directory and its subdirectories. It is an option parameter set to True or False.

from pyspark.sql import SparkSession
spark=SparkSession.builder.appName("Reading Multiple Files").getOrCreate()
 
combined_df=spark.read \
    .option("recursiveFileLookup", "true") \
    .csv("dbfs:/FileStore/shared_uploads/info@srinimf.com", header=True)

combined_df.show()

It is a new approach in Spark 3.0. You can use it while reading the file. It reads all the files, even in the nested folders.

References

Spark RDD

Srini

Data Engineer with deep AI and Generative AI expertise, crafting high-performance data pipelines in PySpark, Databricks, and SQL. Skilled in Python, AWS, and Linux—building scalable, cloud-native solutions for smart applications.

Latest Posts

Ingesting Data from AWS S3 into Databricks with Auto Loader: Building a Medallion Architecture

December 18, 2025
Building Scalable Data Pipelines with dlt-meta: A Metadata-Driven Approach on Databricks

December 9, 2025
Exploring Databricks Unity Catalog – System Tables and Information _Schema: Use Cases

December 8, 2025
PySpark Functions Real Use Cases

November 29, 2025

2 Easy Ways to Read Multiple Files into a Dataframe: PySpark

Use wholeTextFiles() to read multiple files

Use recursiveFileLookup() to read multiple files

Share this:

Latest Posts

Ingesting Data from AWS S3 into Databricks with Auto Loader: Building a Medallion Architecture

Building Scalable Data Pipelines with dlt-meta: A Metadata-Driven Approach on Databricks

Exploring Databricks Unity Catalog – System Tables and Information _Schema: Use Cases

PySpark Functions Real Use Cases

**Use `wholeTextFiles()` to read multiple files**