Here are SQL and PySpark examples on ETL and string slicing examples. In a recent interview, these were asked.

PySpark String Slicing
from pyspark.sql import SparkSession
from pyspark.sql.functions import udf, col
from pyspark.sql.types import StringType
# Create a Spark session
spark = SparkSession.builder.appName("StringManipulation").getOrCreate()
# Sample data
data = [("sample",), ("samiple",)]
df = spark.createDataFrame(data, ["input_string"])
# Define a Python function to split and manipulate the string
def reverse_half(s):
length = len(s)
if length % 2 == 0:
# For even length, reverse the left and right halves equally
left_half = s[:length//2][::-1]
right_half = s[length//2:][::-1]
else:
# For odd length, reverse the left half (which is larger) and right half
left_half = s[:length//2 + 1][::-1]
right_half = s[length//2 + 1:][::-1]
return left_half + right_half
# Register the function as a UDF
reverse_half_udf = udf(reverse_half, StringType())
# Apply the UDF to the DataFrame
result_df = df.withColumn("manipulated_string", reverse_half_udf(col("input_string")))
# Show the result
result_df.show(truncate=False)
Explanation of the Logic
- Calculating
length:- For an odd-length string, the length will always be an odd number. For example, consider the string
"abcde"which has a length of 5.
- For an odd-length string, the length will always be an odd number. For example, consider the string
- Using
length//2:length // 2gives the index of the middle character in the string. For"abcde",length // 2evaluates to2, which corresponds to the character"c"(the middle character).
- Slicing for the Left Half:
left_half = s[:length//2 + 1][::-1]:- Here,
length // 2 + 1calculates to3. Therefore,s[:3]gives the first three characters, which is"abc". - The left half consists of the characters from the start of the string up to and including the middle character:
- In this case:
- Original:
"abcde" - Left Half:
"abc" - After reversing:
"cba"
- Original:
- In this case:
- Here,
- Slicing for the Right Half:
right_half = s[length//2 + 1:][::-1]:- Here,
length // 2 + 1evaluates to3, sos[3:]captures all characters starting from index3, which is"de". - The right half consists of all characters after the middle character:
- In this case:
- Original:
"abcde" - Right Half:
"de" - After reversing:
"ed"
- Original:
- In this case:
- Here,
Output
+------------+------------------+
|input_string|manipulated_string|
+------------+------------------+
|sample |maselp |
|samiple |imaselp |
+------------+------------------+

“The more you learn, the more you earn.”
Warren Buffett
SQL ETL Query
SELECT
ed.emp_id,
ed.mob_no,
ed.Address,
'Y' AS Flag -- Keep old records with Flag as 'Y'
FROM emp_data ed
UNION ALL
-- Selecting rows from emp_latest
SELECT
el.emp_id,
el.mob_no,
el.Address,
CASE
WHEN ed.emp_id IS NOT NULL THEN 'N' -- Existing entry with changes, set Flag to 'N'
ELSE 'Y' -- Brand new entry, set Flag to 'Y'
END AS Flag
FROM emp_latest el
LEFT JOIN emp_data ed
ON el.emp_id = ed.emp_id
AND (el.mob_no <> ed.mob_no OR el.Address <> ed.Address) -- Check for changes
WHERE ed.emp_id IS NULL OR (el.mob_no <> ed.mob_no OR el.Address <> ed.Address);
Tables and Data
CREATE TABLE emp_data (
emp_id INT PRIMARY KEY, -- Employee ID, primary key
mob_no VARCHAR(15), -- Mobile number, can store up to 15 characters
Address VARCHAR(255), -- Address, can store up to 255 characters
Flag CHAR(1) -- Flag indicating status, single character (Y/N)
);
--
CREATE TABLE emp_latest (
emp_id INT PRIMARY KEY, -- Employee ID, primary key
mob_no VARCHAR(15), -- Mobile number, can store up to 15 characters
Address VARCHAR(255) -- Address, can store up to 255 characters
);
--
INSERT INTO emp_data (emp_id, mob_no, Address, Flag) VALUES
(123, '7777777', 'Bangalore', 'Y'),
(456, '8888888', 'Chennai', 'Y'),
(789, '9999999', 'Pune', 'Y'),
(124, '6666666', 'Hyderabad', 'Y'),
(333, '9199191', 'Delhi', 'Y');
--
INSERT INTO emp_latest (emp_id, mob_no, Address) VALUES
(789, '9999999', 'Bangalore'),
(124, '6666666', 'Hyderabad'),
(333, '9199191', 'Bangalore'),
(444, '7171717', 'Jaipur');

“Live as if you were to die tomorrow. Learn as if you were to live forever.”
Mahatma Gandhi
Explanation of the Query:
- First SELECT Block:
- This selects all existing records from the
emp_datatable with theirFlagset to'Y'.
- This selects all existing records from the
- Second SELECT Block:
- This part selects records from the
emp_latesttable. - The
CASEstatement is used to determine the value ofFlag:- If the employee ID exists in both tables (
ed.emp_id IS NOT NULL), this indicates that the record is not new, and we setFlagto'N'if there are changes inmob_noorAddress. - If the record does not exist in
emp_data(ed.emp_id IS NULL), it means it’s a new entry, and we setFlagto'Y'.
- If the employee ID exists in both tables (
- This part selects records from the
- WHERE Clause:
- The condition
WHERE ed.emp_id IS NULLcaptures new entries that don’t exist inemp_data. - The additional condition
(el.mob_no <> ed.mob_no OR el.Address <> ed.Address)checks for modifications. If there’s a mismatch in eithermob_noorAddress, it implies that the entry inemp_latestis not identical to the one inemp_data.
- The condition
Output
emp_id mob_no Address Flag
-- ----- ------ -----
123 7777777 Bangalore Y
124 6666666 Hyderabad Y
333 9199191 Delhi Y
456 8888888 Chennai Y
789 9999999 Pune Y
124 6666666 Hyderabad Y
333 9199191 Bangalore N
444 7171717 Jaipur Y
789 9999999 Bangalore N
Result:
This query will:
- Keep the
Flagas'Y'for all existing records inemp_data. - Set the
Flagto'N'for modified records inemp_latestthat match existing records inemp_data. - Set the
Flagto'Y'for completely new entries inemp_latest.







You must be logged in to post a comment.