In a recent Infogain interview, the following five questions were asked for the Data engineer role. Here are the questions and answers for your preview.

Data engineer infogain interview questions
Photo by George Milton on Pexels.com

1. Partitioning Vs. Bucketing in PySpark?

In the interview, you can say these points:

Partitioning:

  • It divides data into smaller manageable parts based on the column value
  • So, to understand column value, let us say you have sales data for years 2019, 2020, 2021, etc. During the partitioning, it partitions the data of 2019 to one, 2020 to one, and 2021 to one
  • PySpark can bypass partitions that don’t meet the filter()/where() conditions, speeding up processing.

Bucketing:

  • Bucketing minimizes data shuffling in Join operations and requires grouping data by bucketed columns.
  • Bucketing improves performance for joins and large datasets by reducing data movement.
  • Here, you will see an example of creating a bucketed data frame:
from pyspark.sql import SparkSession
# Create a SparkSession
spark = SparkSession.builder \
    .appName("Bucketing Example") \
    .getOrCreate()
# Read the dataset
df = spark.read.format("csv").option("header", "true").load("path_to_input_data")
# Bucketing options
num_buckets = 10  # Specify the number of buckets
bucket_column = "customer_id"  # Specify the column to bucket on

# Bucket the dataset
df.write.format("parquet") \
    .bucketBy(num_buckets, bucket_column) \
    .saveAsTable("bucketed_table")
# Stop the SparkSession
spark.stop()

CGI PySpark Interview Questions

2. How do we read petabyte-size file data in PySpark?

Normally, Spark’s operators spill data to disk if it does not fit in memory, allowing it to run well on any sized data.

from pyspark.sql import SparkSession
# Create a SparkSession
spark = SparkSession.builder \
    .appName("Persist Example") \
    .getOrCreate()
# Read the dataset
df = spark.read.format("csv").option("header", "true").load("path_to_input_data")
# Persist the DataFrame in memory and disk
persisted_df = df.persist(storageLevel="MEMORY_AND_DISK")

# Perform some operations on the persisted DataFrame
result_df = persisted_df.filter(persisted_df["column"] > 10).groupBy("group_column").count()
# Show the result
result_df.show()
# Stop the SparkSession
spark.stop()

3. Explain the architecture of Delta Lake?

In the interview we need to tell these points:

  • It supports for both Batch and Streaming data
  • It follows the Bronz, Silver, and Gold Tables approach. The Bronz tables have Raw data, the Silver tables have transformational data, and the Gold tables have fine data useful for BI and reporting.
  • It supports ACID properties, and it works as a normal RDBMS

4. How do we read the schema-less file in PySpark?

When we set inferSchema=True, it automatically defines the schema. To change the schema, we must create it ourselves.

from pyspark.sql import SparkSession
# Create a SparkSession
spark = SparkSession.builder \
    .appName("Read Schemaless CSV") \
    .getOrCreate()
# Read CSV file with schema inference
df = spark.read.csv("path_to_csv_file", header=True, inferSchema=True)
# Show the inferred schema
df.printSchema()
# Show the data
df.show()

“The world is a tragedy to those who feel, but a comedy to those who think.”

― Horace Walpole

5. How do we know to increase partitions in the PySpark Databricks?

The ideal partition size in PySpark is 128MB. Here’s how to calculate partitions in PySpark.

from pyspark.sql import SparkSession
# Create a SparkSession
spark = SparkSession.builder \
    .appName("Calculate Partitions Example") \
    .getOrCreate()
# Get the file size in MB (for example)
file_size_mb = 1000  # Replace with the actual file size in MB
# Get the default parallelism
default_parallelism = spark.sparkContext.defaultParallelism
# Calculate the number of partitions
# Whichever is smaller it considers
num_partitions = min(128, max(1, int(file_size_mb / default_parallelism)))

print("Number of partitions:", num_partitions)
# Stop the SparkSession
spark.stop()