Explore the top ten interview questions covering PySpark, AWS, Python, Databricks, and SQL.

Table of contents
Data Engineer Interview Questions
Python
01. How do you reverse each word instead of a character?
s=”My Name Is Srinivasa”
Output=”Srinivasa Is Name My”
Solution:
s="My Name Is Srinivasa"
words=s.split()
reversed_string = " ".join(reversed(words))
print(reversed_string)
Output
-----
Srinivasa Is Name My
02. How do you check if a given string is palindrome?
Solution:
a="malayalam"
b=a[::-1]
if a == b:
print("A is palindrome")
Output
------
A is palindrome
03. How do you replace a specific digit in Python?
s1=190.099.123.215
Replace the 099 with 99. The output will now look as 190.99.123.215
Solution:
## Method-1
s1="190.099.123.215"
s2=s1.replace("099", "99", 1) # The '1' means replace first occurance
print(s2)
Output
------
190.99.123.215
## Method-2
## Use the index method to replace '099' with '99'
s1 = "190.099.123.215"
index = s1.index('099')
s1 = s1[:index] + '99' + s1[index + len('099'):]
print(s1)
Output
------
190.99.123.215
SQL
04. How do you check and concatenate values in a MYSQL query?
Here is an example tested in MySQL.
T1
---
code1 code2
--- -----
A AA
B BB
C CC
Result
------
code1 code2 Combined
---- ----- -------
A AA AAA
B BB BBB
C CC CCC
SQL Query
=========
SELECT code1, code2,
CASE
WHEN code1 = 'A' AND code2 = 'AA' THEN CONCAT('A', 'AA')
WHEN code1 = 'B' AND code2 = 'BB' THEN CONCAT('B', 'BB')
WHEN code1 = 'C' AND code2 = 'CC' THEN CONCAT('C', 'CC')
ELSE NULL -- Optional: to handle cases that don't match
END AS Combined
FROM T1;
Output
code1 code2 Combined
A AA AAA
B BB BBB
C CC CCC
05. Write an SQL query to achieve the result as here?
Table1
-----
id name title
1 Venkat developer
2 Rao manager
3 Srini developer
4 Swapna manager
The output is expected as:
name1 name2 title
---- ----- -----
Venkat Srini developer
Rao Swapna manager
Solution:
SELECT e1.name AS name1, e2.name AS name2, e1.title
FROM emp e1
JOIN emp e2
ON e1.title = e2.title
WHERE e1.id < e2.id;
PySpark
06. groupByKey() vs reduceByKey() in PySpark?
The groupByKey groups values by keys and returns them as lists. ReduceByKey groups values by keys and applies a reduction function. This function computes the sum of values for each key. Here is a link to understand more.
07. What are SCD Type 1 & Type 2 changes in Data warehousing?
- Type 1 Slowly Changing Dimension (SCD):
- In a Type 1 SCD approach, the process updates the existing record with the new value when a dimension attribute is changed (It overwrites the old value)
- This approach does not retain any historical information about the changes. Only the latest version of the data is stored.
- Type 1 SCDs are suitable when historical changes to dimension attributes are not important or when it’s acceptable to lose historical information
- It is simpler and requires less storage space compared to Type 2.
- Type 2 Slowly Changing Dimension (SCD):
- In a Type 2 SCD approach, when a dimension attribute changes, a new record is inserted into the dimension table. This new record has the updated attribute values. The existing record is typically marked as inactive or historical.
- This approach retains the full history of changes to dimension attributes over time. It allows users to track changes and analyze historical data.
- Type 2 SCDs are suitable when historical changes to dimension attributes need to be preserved. This is necessary for analysis. It is also required for reporting or auditing purposes.
- It provides a more comprehensive view of historical data. However, it requires more storage space compared to Type 1. This is due to the retention of historical records.
Here’s a comparison between Type 1 and Type 2 SCD approaches:
| Aspect | Type 1 SCD | Type 2 SCD |
|---|---|---|
| Historical Data | Only latest data retained | Full history retained |
| Storage | Less storage required | More storage required |
| Complexity | Simpler to implement | More complex to implement |
| Historical Analysis | Limited | Comprehensive |
| Updates | Overwrites existing data | Inserts new records |
Ultimately, the choice between Type 1 and Type 2 SCDs depends on the specific requirements of the data warehousing project. This includes the need for historical analysis. Data storage considerations are also important. Additionally, the desired level of complexity in managing dimension data changes over time must be considered.
08. What is Data Lineage?
Data lineage is the process of understanding, recording, and visualizing data as it flows from data sources to consumption.
Databricks
09. What’s the memory calculation in the Databricks cluster?
A cluster with two workers has 40 cores and 100 GB of RAM each. This setup has the same computing and memory as an eight-worker cluster with 10 cores and 25 GB of RAM.
2 (nodes) x 40 (cores) of 100 GB RAM = 800
8 (nodes) x 10 (cores) of 25 GB RAM = 800
A node --> is virtual machine
A codre --> is a CPU unit
10. What is ACID in DBMS?
A – Atomicity
Example: Consider a bank transfer transaction in which money is transferred from one account to another. The atomicity property ensures that if the transfer fails midway (e.g., due to a network error), the entire transaction is rolled back, and neither account is affected.
C – Consistency
Example: Let us assume the student database has courses. Consistency ensures that if a student enrolls in a new course, all related data are updated correctly. This ensures that no data is left in an inconsistent state.
I – Isolation
Example: Consider two bank transfer transactions happening concurrently, transferring money between the same two accounts. Isolation ensures that even if these transactions are executed simultaneously. They do not interfere with each other. The final state of the accounts remains consistent.
D – Durability
Example: After a successful bank transfer transaction, the changes to the account balances must be durable. They should be permanently saved to the database’s storage system. The database server crashes immediately after the transaction is committed. The transferred amount should not be lost upon recovery.
References







You must be logged in to post a comment.