13 Python PySpark Interview Questions: TCS and EXL

Here are the interview questions on Python and PySpark asked in TCS and EXL. I have summarized them for readability and to prepare for other interviews.

SQL Interview Questions

01. What is the difference between WITH and SELECT?

The SELECT clause selects specific (or) all rows from a Table. On the contrary, the WITH clause creates Common Table Expression (a.k.a CTE), which is like a temporary table with specific columns, which avoids accessing all the columns.

02. What’s the difference between a clustered index and a non-clustered index?

A clustered index is faster since its physical storage is the same as the order of the index. In contrast, non-clustered index stores physically not in the order of the index (simply put, index structure and the storage of the actual data are distinct).

03. How to write an SQL query to find count=1 in a Table?

T1
===
id
===
1
2
1
1
3

SELECT ID FROm T1
GROUP BY ID
HAVING COUNT(ID) = 1;
Output
===
3

04. What’s star schema?

Star schema is a multi-dimensional relational database model. Start schemas are used in data warehouses and data marts. The schema is useful for querying large datasets.

05. What’s the snowflake schema?

It’s also ~~multi~~ multi-dimensional data model. It’s an extension of the star schema. Here, dimensions are broken down into subdimensions. The schema is useful for OLAP and relational databases.

06. How to create a Trigger and the use of it?

A trigger is a database object, which triggers automatically when a particular event occurs. Here, when we insert a new row into the “employees” table, it sets the system date for the newly inserted row. Triggers should be used judiciously; otherwise, they may lead to performance issues.

-- Creating a table for demonstration purposes
CREATE TABLE employees (
    employee_id NUMBER PRIMARY KEY,
    employee_name VARCHAR2(50),
    hire_date DATE
);
-- Creating a trigger
CREATE OR REPLACE TRIGGER trg_employee_insert
BEFORE INSERT ON employees
FOR EACH ROW
DECLARE
BEGIN
-- Setting the hire_date to the current timestamp when a new record is inserted
    :NEW.hire_date := SYSDATE;
END trg_employee_insert;
/

07. What’s ACID compliance?

The ACID (Atomicity, Consistency, Isolation, Durability) compliance tells the database is stable and ensures no data loss.

Python Interview Questions

08. Why is a tuple faster than a List?

A tuple is faster than a List. Because of immutability, operations on tuples are minimal compared to lists, and creating a tuple is faster.

Also, Tuples internally are stored in a single block of memory. On the other hand, Lists internally are stored in two blocks – Fixed and variable blocks. The fixed block has the Python object’s information, and the variable block has actual data.

09. Can we have a List in Tuple?

Yes, you can have a List in a Tuple. Tuple supports heterogeneous data types.

10. What’s context manager in Python?

Context managers ensure resource cleanup and proper exception handling. With the “with” statement, __exit__() is called automatically, reducing the need for try-finally blocks. Here is a link.

11. Why are arrays faster? Will they have different data types or collection types?

The arrays are internally stored contagious, So accessing array elements is faster. Secondly, arrays can only have homogeneous numeric items (data types of all the elements should be the same).

PySpark Interview Questions

12. Can we write a PySpark query to pull id count=1?

In PySpark, to find count=1 of the table elements(id), you can write a query in the following way:

from pyspark.sql.functions import count, col
from pyspark.sql import SparkSession

# Create a Spark session
spark = SparkSession.builder.appName("Test").getOrCreate()
# Sample data
data = [("A", 1), ("B", 2), ("A", 3), ("C", 4), ("B", 5), ("D", 6)]
# Create a DataFrame
columns = ["id", "value"]
df = spark.createDataFrame(data, columns)
# Perform the aggregation and filter
result = df.groupBy("id") \
    .agg(count("id").alias("count_ids")) \
    .where(col("count_ids") == 1)
# Show the result
result.show(truncate=False)

13. Can we use SQL queries in Pyspark and how?

Yes, you can write SQL queries directly. For this, you need to create a Temporary view with some name(“people”). We can write SQL queries to get rows from the view. Here’s an example:

from pyspark.sql import SparkSession

# Create a Spark session
spark = SparkSession.builder.appName("Example").getOrCreate()
# Sample data
data = [("Alice", 25), ("Bob", 30), ("Charlie", 22)]
# Create a DataFrame
columns = ["Name", "Age"]
df = spark.createDataFrame(data, columns)
# Create a temporary SQL table/view
df.createOrReplaceTempView("people")
# Execute SQL queries
result = spark.sql("SELECT Name, Age FROM people WHERE Age > 25")
# Show the result
result.show()

Srini

Data Engineer with deep AI and Generative AI expertise, crafting high-performance data pipelines in PySpark, Databricks, and SQL. Skilled in Python, AWS, and Linux—building scalable, cloud-native solutions for smart applications.