10 Tricky PySpark, SQL, Python Interview Questions

Explore the top ten interview questions covering PySpark, AWS, Python, Databricks, and SQL.

Data engineer interview questions — xr:d:DAF-z_NNlqI:3,j:716493247427626159,t:24030707

Table of contents

Data Engineer Interview Questions

Data Engineer Interview Questions

Python

01. How do you reverse each word instead of a character?

s=”My Name Is Srinivasa”

Output=”Srinivasa Is Name My”

Solution:

s="My Name Is Srinivasa"
words=s.split()
reversed_string = " ".join(reversed(words))
print(reversed_string)

Output
-----
Srinivasa Is Name My

02. How do you check if a given string is palindrome?

Solution:

a="malayalam"
b=a[::-1]
if a == b:
    print("A is palindrome")

Output
------
A is palindrome

03. How do you replace a specific digit in Python?

s1=190.099.123.215

Replace the 099 with 99. The output will now look as 190.99.123.215

Solution:

## Method-1
s1="190.099.123.215"
s2=s1.replace("099", "99", 1) # The '1' means replace first occurance
print(s2)

Output
------
190.99.123.215

## Method-2
## Use the index method to replace '099' with '99'
s1 = "190.099.123.215"
index = s1.index('099')
s1 = s1[:index] + '99' + s1[index + len('099'):]
print(s1)

Output
------
190.99.123.215

SQL

04. How do you check and concatenate values in a MYSQL query?

Here is an example tested in MySQL.

T1
---
code1   code2
---     -----
A       AA
B       BB
C       CC

Result
------
code1  code2   Combined
----   -----    -------
A       AA      AAA
B       BB      BBB
C       CC      CCC

SQL Query
=========
SELECT code1, code2, 
    CASE
        WHEN code1 = 'A' AND code2 = 'AA' THEN CONCAT('A', 'AA')
        WHEN code1 = 'B' AND code2 = 'BB' THEN CONCAT('B', 'BB')
        WHEN code1 = 'C' AND code2 = 'CC' THEN CONCAT('C', 'CC')
        ELSE NULL -- Optional: to handle cases that don't match
    END AS Combined
FROM T1;

Output

code1	code2	Combined
A	AA	AAA
B	BB	BBB
C	CC	CCC

05. Write an SQL query to achieve the result as here?

Table1
-----
id  name  title
1   Venkat developer
2   Rao    manager
3   Srini  developer
4   Swapna manager

The output is expected as:
name1    name2   title
----     -----     -----
Venkat   Srini    developer
Rao      Swapna   manager

Solution:

SELECT e1.name AS name1, e2.name AS name2, e1.title
FROM emp e1
JOIN emp e2
  ON e1.title = e2.title
WHERE e1.id < e2.id;

PySpark

06. groupByKey() vs reduceByKey() in PySpark?

The groupByKey groups values by keys and returns them as lists. ReduceByKey groups values by keys and applies a reduction function. This function computes the sum of values for each key. Here is a link to understand more.

07. What are SCD Type 1 & Type 2 changes in Data warehousing?

Type 1 Slowly Changing Dimension (SCD):
- In a Type 1 SCD approach, the process updates the existing record with the new value when a dimension attribute is changed (It overwrites the old value)
- This approach does not retain any historical information about the changes. Only the latest version of the data is stored.
- Type 1 SCDs are suitable when historical changes to dimension attributes are not important or when it’s acceptable to lose historical information
- It is simpler and requires less storage space compared to Type 2.
Type 2 Slowly Changing Dimension (SCD):
- In a Type 2 SCD approach, when a dimension attribute changes, a new record is inserted into the dimension table. This new record has the updated attribute values. The existing record is typically marked as inactive or historical.
- This approach retains the full history of changes to dimension attributes over time. It allows users to track changes and analyze historical data.
- Type 2 SCDs are suitable when historical changes to dimension attributes need to be preserved. This is necessary for analysis. It is also required for reporting or auditing purposes.
- It provides a more comprehensive view of historical data. However, it requires more storage space compared to Type 1. This is due to the retention of historical records.

Here’s a comparison between Type 1 and Type 2 SCD approaches:

Aspect	Type 1 SCD	Type 2 SCD
Historical Data	Only latest data retained	Full history retained
Storage	Less storage required	More storage required
Complexity	Simpler to implement	More complex to implement
Historical Analysis	Limited	Comprehensive
Updates	Overwrites existing data	Inserts new records

SCD Type 1 and Type 2 Comparision

Ultimately, the choice between Type 1 and Type 2 SCDs depends on the specific requirements of the data warehousing project. This includes the need for historical analysis. Data storage considerations are also important. Additionally, the desired level of complexity in managing dimension data changes over time must be considered.

08. What is Data Lineage?

Data lineage is the process of understanding, recording, and visualizing data as it flows from data sources to consumption.

Databricks

09. What’s the memory calculation in the Databricks cluster?

A cluster with two workers has 40 cores and 100 GB of RAM each. This setup has the same computing and memory as an eight-worker cluster with 10 cores and 25 GB of RAM.

2 (nodes) x 40 (cores) of 100 GB RAM = 800
8 (nodes) x 10 (cores) of 25 GB RAM = 800
A node --> is virtual machine
A codre --> is a CPU unit

10. What is ACID in DBMS?

A – Atomicity

Example: Consider a bank transfer transaction in which money is transferred from one account to another. The atomicity property ensures that if the transfer fails midway (e.g., due to a network error), the entire transaction is rolled back, and neither account is affected.

C – Consistency

Example: Let us assume the student database has courses. Consistency ensures that if a student enrolls in a new course, all related data are updated correctly. This ensures that no data is left in an inconsistent state.

I – Isolation

Example: Consider two bank transfer transactions happening concurrently, transferring money between the same two accounts. Isolation ensures that even if these transactions are executed simultaneously. They do not interfere with each other. The final state of the accounts remains consistent.

D – Durability

Example: After a successful bank transfer transaction, the changes to the account balances must be durable. They should be permanently saved to the database’s storage system. The database server crashes immediately after the transaction is committed. The transferred amount should not be lost upon recovery.

References

Pandas examples to practice

Srini

Data Engineer with deep AI and Generative AI expertise, crafting high-performance data pipelines in PySpark, Databricks, and SQL. Skilled in Python, AWS, and Linux—building scalable, cloud-native solutions for smart applications.