Essential PySpark Concepts to Review Before Your Brillio Interview

These are the questions and solutions that were asked in Brillio’s interview.

Brillio Python questions — Photo by Angela Roma on Pexels.com

Table of contents

SQL Questions

Q1). Write a query that outputs the time to reach the next station for each train.

Train_id	Station	Time

110

San Francisco

10:00:00

110

Redwood City

10:54:00

110

Palo Alto

11:02:00

110

San Jose

12:35:00

120

San Francisco

11:00:00

120

Palo Alto

12:49:00

120

San Jose

13:30:00

Solution

SELECT 
    Train_id, 
    Station, 
    TIME_TO_SEC(TIMEDIFF(Time, LAG(Time) OVER (PARTITION BY Train_id ORDER BY Time))) / 60 AS time_to_next_station_minutes
FROM 
    train_table;

PySpark Questions

Q2). Differences between RDD Vs dataFrame

Here is the answer link.

Q3). How to configure a range of executors in PySpark. For instance, minimum, maximum, and default.

spark.dynamicAllocation.enabled=true
spark.dynamicAllocation.minExecutors=1 # Minimum number of executors to keep
spark.dynamicAllocation.maxExecutors=10  # Maximum number of executors to use
spark.dynamicAllocation.initialExecutors=2  # Initial number of executors to start with

Q4). How do we achieve fault tolerance in PySpark?

Fault tolerance in PySpark is an essential feature that allows applications to recover from failures and continue processing without data loss. PySpark achieves fault tolerance mainly through the following mechanisms:

1. RDDs (Resilient Distributed Datasets)

Immutable Data Structure: RDDs are immutable, meaning once created, they cannot be altered. This allows Spark to keep track of the transformations applied to RDDs, which is essential for recovering lost data.
Lineage: RDDs maintain a lineage graph that tracks the sequence of operations (transformations) used to create them. If a partition of an RDD is lost due to a failure, Spark can recompute that partition using the lineage information.

2. Checkpointing

Persistent Storage: Checkpointing saves an RDD to a reliable storage system (like HDFS) to break the lineage and avoid excessive recomputation.
Usage: This is particularly useful for long lineage chains or iterative algorithms, as it can significantly reduce recovery time.

3. Data Replication

Spark can replicate data across different nodes. This is more relevant in scenarios using data storage systems like HDFS, where data is inherently replicated.

4. Task Resubmission

If a task fails due to a node failure or a lost task, Spark automatically resubmits that task to another node, using the lineage information to recompute any lost data.

5. Automatic Retry Mechanism

Spark automatically retries failed tasks a specified number of times, allowing transient issues to be resolved without manual intervention.

6. Structured Streaming

For streaming applications, Spark provides a fault-tolerant mechanism by maintaining offsets in a checkpoint location. If a failure occurs, the streaming application can recover from the last checkpoint.

Example of Checkpointing in PySpark

Here’s a simple example of checkpointing in PySpark:

from pyspark import SparkContext
from pyspark import SparkConf

# Initialize Spark
conf = SparkConf().setAppName("FaultToleranceExample")
sc = SparkContext(conf=conf)

# Set checkpoint directory
sc.setCheckpointDir("hdfs://path/to/checkpoint/dir")

# Create an RDD
data = [1, 2, 3, 4, 5]
rdd = sc.parallelize(data)

# Transform the RDD (example: square each number)
transformed_rdd = rdd.map(lambda x: x ** 2)

# Checkpoint the RDD
transformed_rdd.checkpoint()

# Perform an action
result = transformed_rdd.collect()
print(result)

Q5). Please write down the execution steps in Pyspark to find users with respective unique mobile count

username, mobile
-------   ------
user1,    999999991:888888882
user3,    777777771
user2,    777777234:823232351
user5,    734452343:943433434:834323434
user1,    999999991:9994433777

To find the number of unique mobile numbers for each user in PySpark from the given dataset, follow these execution steps:

Step-by-Step Approach in PySpark

1. Create a Spark Session

First, you need to initialize a Spark session to work with PySpark.

from pyspark.sql import SparkSession

# Create a Spark session
spark = SparkSession.builder.appName("UniqueMobileCount").getOrCreate()

2. Create the Initial DataFrame

Create a DataFrame from the raw data. You can either manually create it or load it from a CSV file.

# Sample data for the DataFrame
data = [
    ("user1", "999999991:888888882"),
    ("user3", "777777771"),
    ("user2", "777777234:823232351"),
    ("user5", "734452343:943433434:834323434"),
    ("user1", "999999991:9994433777")
]

# Create the DataFrame
columns = ["username", "mobile"]
df = spark.createDataFrame(data, columns)

3. Split Mobile Numbers

Since mobile numbers are separated by colons (:), you need to split the mobile column into individual numbers.

from pyspark.sql.functions import split, explode

# Split the mobile numbers by ':' and create an array of mobile numbers
df_split = df.withColumn("mobile_split", split(df["mobile"], ":"))

4. Explode Mobile Numbers into Rows

Use the explode() function to transform the array of mobile numbers into individual rows.

# Explode the array of mobile numbers into separate rows
df_exploded = df_split.withColumn("mobile_exploded", explode(df_split["mobile_split"]))

5. Drop Unnecessary Columns

After exploding the mobile_split column, you can drop it to keep only the exploded mobile numbers.

# Drop the unnecessary mobile_split column
df_cleaned = df_exploded.select("username", "mobile_exploded")

6. Remove Duplicates for Each User

To count unique mobile numbers for each user, use distinct() to remove duplicate mobile numbers per user.

# Remove duplicate mobile numbers for each user
df_distinct = df_cleaned.distinct()

7. Group by Username and Count Unique Mobiles

Now, group the data by username and count the distinct mobile numbers for each user.

from pyspark.sql.functions import countDistinct

# Group by username and count distinct mobile numbers
df_result = df_distinct.groupBy("username").agg(countDistinct("mobile_exploded").alias("unique_mobile_count"))

8. Show the Final Result

Finally, show the result which includes each username and the corresponding unique mobile number count.

# Show the result
df_result.show()

Full Code Example

from pyspark.sql import SparkSession
from pyspark.sql.functions import split, explode, countDistinct

# Step 1: Create Spark session
spark = SparkSession.builder.appName("UniqueMobileCount").getOrCreate()

# Step 2: Define data
data = [
    ("user1", "999999991:888888882"),
    ("user3", "777777771"),
    ("user2", "777777234:823232351"),
    ("user5", "734452343:943433434:834323434"),
    ("user1", "999999991:9994433777")
]

# Step 3: Create DataFrame
columns = ["username", "mobile"]
df = spark.createDataFrame(data, columns)

# Step 4: Split mobile numbers by ':'
df_split = df.withColumn("mobile_split", split(df["mobile"], ":"))

# Step 5: Explode the array of mobile numbers into separate rows
df_exploded = df_split.withColumn("mobile_exploded", explode(df_split["mobile_split"]))

# Step 6: Drop the unnecessary mobile_split column
df_cleaned = df_exploded.select("username", "mobile_exploded")

# Step 7: Remove duplicates
df_distinct = df_cleaned.distinct()

# Step 8: Group by username and count distinct mobile numbers
df_result = df_distinct.groupBy("username").agg(countDistinct("mobile_exploded").alias("unique_mobile_count"))

# Step 9: Show result
df_result.show()

Expected Output:

+--------+------------------+
|username|unique_mobile_count|
+--------+------------------+
|   user1|                 3 |
|   user2|                 2 |
|   user3|                 1 |
|   user5|                 3 |
+--------+------------------+

AWS Questions

Q6). What is the policy we need to use to delete a file in S3 Folder after a retention period?

To automatically delete files in an Amazon S3 bucket after a specified retention period, you would use S3 Lifecycle Policies. These policies allow you to define actions such as expiration, which deletes objects after a certain number of days.

Here’s a step-by-step guide and an example of how to set up an S3 Lifecycle Policy to delete files after a retention period:

Steps to Create a Lifecycle Policy for Deleting Files

Go to the S3 Console:
- Sign in to the AWS Management Console.
- Navigate to the S3 service.
Select the Bucket:
- Select the bucket where you want to apply the lifecycle rule.
Open the Management Tab:
- Go to the “Management” tab within the selected bucket.
Create a Lifecycle Rule:
- Click on the “Lifecycle rules” section and select “Create lifecycle rule”.
- Provide a name for the rule (e.g., “Delete old files after 30 days”).
Configure the Rule Actions:
- Choose rule scope: Apply the rule to the entire bucket or specific prefixes or tags.
- Lifecycle rule actions:
  - Select “Expiration”.
  - Choose “Expire current versions of objects”.
  - Set the retention period (e.g., “Delete objects 30 days after creation”).
Save the Rule:
- Review the rule and save it.

Example: JSON Policy for S3 Lifecycle Configuration

Here’s an example of an S3 Lifecycle configuration JSON that deletes objects after 30 days:

{
    "Rules": [
        {
            "ID": "DeleteOldFilesAfter30Days",
            "Prefix": "",  // Applies to the entire bucket or specific prefix
            "Status": "Enabled",
            "Expiration": {
                "Days": 30
            }
        }
    ]
}

This configuration will automatically delete all files in the S3 bucket that are older than 30 days.

Additional Options for Lifecycle Rules

Transitions: You can also move files to a cheaper storage class like Glacier or S3 Infrequent Access after some time before eventual deletion.
Versioning: If versioning is enabled on the bucket, you can configure separate rules for current and previous versions of the objects.

Python Questions

Q7). Take two inputs: from & to. Find prime numbers in that range and sum two digit numbers to a single digit and create the final list.

Here’s how you can approach the task:

Generate prime numbers within the given range (from and to).
Sum the digits of the prime numbers that are two digits long until a single digit remains.
Return the final list of single-digit sums of the two-digit primes.

Python Code:

# Function to check if a number is prime
def is_prime(n):
    if n < 2:
        return False
    for i in range(2, int(n ** 0.5) + 1):
        if n % i == 0:
            return False
    return True

# Function to sum digits of a number until it becomes a single digit
def sum_to_single_digit(n):
    while n >= 10:
        n = sum(int(digit) for digit in str(n))
    return n

# Function to find primes in a given range and process two-digit primes
def prime_sum_in_range(from_num, to_num):
    final_list = []
    for num in range(from_num, to_num + 1):
        if is_prime(num) and 10 <= num <= 99:  # Check if the number is prime and two-digit
            single_digit_sum = sum_to_single_digit(num)
            final_list.append(single_digit_sum)
    return final_list

# Input range from user
from_num = int(input("Enter the starting number (from): "))
to_num = int(input("Enter the ending number (to): "))

# Get the final list of single-digit sums for two-digit primes
result = prime_sum_in_range(from_num, to_num)

# Output the final list
print("Final list of single-digit sums of two-digit primes:", result)

Example Walkthrough:

Let’s say the user inputs the range from = 10 and to = 50.

Step 1: Identify prime numbers between 10 and 50:
- Prime numbers: 11, 13, 17, 19, 23, 29, 31, 37, 41, 43, 47
Step 2: Sum the digits of the two-digit prime numbers:
- 11 → 1 + 1 = 2
- 13 → 1 + 3 = 4
- 17 → 1 + 7 = 8
- 19 → 1 + 9 = 10 → 1 + 0 = 1
- 23 → 2 + 3 = 5
- 29 → 2 + 9 = 11 → 1 + 1 = 2
- 31 → 3 + 1 = 4
- 37 → 3 + 7 = 10 → 1 + 0 = 1
- 41 → 4 + 1 = 5
- 43 → 4 + 3 = 7
- 47 → 4 + 7 = 11 → 1 + 1 = 2
Final list: [2, 4, 8, 1, 5, 2, 4, 1, 5, 7, 2]

Output:

Final list of single-digit sums of two-digit primes: [2, 4, 8, 1, 5, 2, 4, 1, 5, 7, 2]

Srini

Data Engineer with deep AI and Generative AI expertise, crafting high-performance data pipelines in PySpark, Databricks, and SQL. Skilled in Python, AWS, and Linux—building scalable, cloud-native solutions for smart applications.