Here are SQL and PySpark examples on ETL and string slicing examples. In a recent interview, these were asked.

PySpark and SQL interview questions
Photo by Monstera Production on Pexels.com

PySpark String Slicing

from pyspark.sql import SparkSession
from pyspark.sql.functions import udf, col
from pyspark.sql.types import StringType

# Create a Spark session
spark = SparkSession.builder.appName("StringManipulation").getOrCreate()

# Sample data
data = [("sample",), ("samiple",)]
df = spark.createDataFrame(data, ["input_string"])

# Define a Python function to split and manipulate the string
def reverse_half(s):
    length = len(s)
    if length % 2 == 0:
        # For even length, reverse the left and right halves equally
        left_half = s[:length//2][::-1]
        right_half = s[length//2:][::-1]
    else:
        # For odd length, reverse the left half (which is larger) and right half
        left_half = s[:length//2 + 1][::-1]
        right_half = s[length//2 + 1:][::-1]
    return left_half + right_half

# Register the function as a UDF
reverse_half_udf = udf(reverse_half, StringType())

# Apply the UDF to the DataFrame
result_df = df.withColumn("manipulated_string", reverse_half_udf(col("input_string")))

# Show the result
result_df.show(truncate=False)

Explanation of the Logic

  1. Calculating length:
    • For an odd-length string, the length will always be an odd number. For example, consider the string "abcde" which has a length of 5.
  2. Using length//2:
    • length // 2 gives the index of the middle character in the string. For "abcde", length // 2 evaluates to 2, which corresponds to the character "c" (the middle character).
  3. Slicing for the Left Half:
    • left_half = s[:length//2 + 1][::-1]:
      • Here, length // 2 + 1 calculates to 3. Therefore, s[:3] gives the first three characters, which is "abc".
      • The left half consists of the characters from the start of the string up to and including the middle character:
        • In this case:
          • Original: "abcde"
          • Left Half: "abc"
          • After reversing: "cba"
  4. Slicing for the Right Half:
    • right_half = s[length//2 + 1:][::-1]:
      • Here, length // 2 + 1 evaluates to 3, so s[3:] captures all characters starting from index 3, which is "de".
      • The right half consists of all characters after the middle character:
        • In this case:
          • Original: "abcde"
          • Right Half: "de"
          • After reversing: "ed"

Output

+------------+------------------+
|input_string|manipulated_string|
+------------+------------------+
|sample      |maselp            |
|samiple     |imaselp           |
+------------+------------------+

“The more you learn, the more you earn.”

Warren Buffett

SQL ETL Query

SELECT 
    ed.emp_id, 
    ed.mob_no, 
    ed.Address, 
    'Y' AS Flag  -- Keep old records with Flag as 'Y'
FROM emp_data ed

UNION ALL

-- Selecting rows from emp_latest
SELECT 
    el.emp_id,
    el.mob_no, 
    el.Address, 
    CASE 
        WHEN ed.emp_id IS NOT NULL THEN 'N'  -- Existing entry with changes, set Flag to 'N'
        ELSE 'Y'  -- Brand new entry, set Flag to 'Y'
    END AS Flag
FROM emp_latest el
LEFT JOIN emp_data ed 
    ON el.emp_id = ed.emp_id
    AND (el.mob_no <> ed.mob_no OR el.Address <> ed.Address)  -- Check for changes
WHERE ed.emp_id IS NULL OR (el.mob_no <> ed.mob_no OR el.Address <> ed.Address);

Tables and Data

CREATE TABLE emp_data (
    emp_id INT PRIMARY KEY,       -- Employee ID, primary key
    mob_no VARCHAR(15),          -- Mobile number, can store up to 15 characters
    Address VARCHAR(255),        -- Address, can store up to 255 characters
    Flag CHAR(1)                -- Flag indicating status, single character (Y/N)
);
-- 
CREATE TABLE emp_latest (
    emp_id INT PRIMARY KEY,       -- Employee ID, primary key
    mob_no VARCHAR(15),          -- Mobile number, can store up to 15 characters
    Address VARCHAR(255)         -- Address, can store up to 255 characters
);

-- 
INSERT INTO emp_data (emp_id, mob_no, Address, Flag) VALUES
(123, '7777777', 'Bangalore', 'Y'),
(456, '8888888', 'Chennai', 'Y'),
(789, '9999999', 'Pune', 'Y'),
(124, '6666666', 'Hyderabad', 'Y'),
(333, '9199191', 'Delhi', 'Y');

-- 

INSERT INTO emp_latest (emp_id, mob_no, Address) VALUES
(789, '9999999', 'Bangalore'),
(124, '6666666', 'Hyderabad'),
(333, '9199191', 'Bangalore'),
(444, '7171717', 'Jaipur');

Mahatma Gandhi

Explanation of the Query:

  1. First SELECT Block:
    • This selects all existing records from the emp_data table with their Flag set to 'Y'.
  2. Second SELECT Block:
    • This part selects records from the emp_latest table.
    • The CASE statement is used to determine the value of Flag:
      • If the employee ID exists in both tables (ed.emp_id IS NOT NULL), this indicates that the record is not new, and we set Flag to 'N' if there are changes in mob_no or Address.
      • If the record does not exist in emp_data (ed.emp_id IS NULL), it means it’s a new entry, and we set Flag to 'Y'.
  3. WHERE Clause:
    • The condition WHERE ed.emp_id IS NULL captures new entries that don’t exist in emp_data.
    • The additional condition (el.mob_no <> ed.mob_no OR el.Address <> ed.Address) checks for modifications. If there’s a mismatch in either mob_no or Address, it implies that the entry in emp_latest is not identical to the one in emp_data.

Output

emp_id	mob_no	Address	       Flag
--      -----   ------         -----
123	7777777	Bangalore	Y
124	6666666	Hyderabad	Y
333	9199191	Delhi	        Y
456	8888888	Chennai	        Y
789	9999999	Pune	        Y
124	6666666	Hyderabad	Y
333	9199191	Bangalore	N
444	7171717	Jaipur	        Y
789	9999999	Bangalore	N

Result:

This query will:

  • Keep the Flag as 'Y' for all existing records in emp_data.
  • Set the Flag to 'N' for modified records in emp_latest that match existing records in emp_data.
  • Set the Flag to 'Y' for completely new entries in emp_latest.