Perficient: Top SQL, PySprak Interview Questions

We’ve compiled a comprehensive list of SQL and Pyspark interview questions asked at Perficient.

Table of contents

SQL Interview Questions

01. Write a JOIN SQL query?

SELECT * FROM EMP
JOIN ADDRESS
ON EMP.ID = ADDRESS.ID;

02. Write a DELETE query to delete duplicates?

The easy ways are using the DELETE query with a Common table expression and DELETE from the WHERE clause.

-- Method-1
DELETE FROM ADDRESS
WHERE ID IN (SELECT ID FROM ADDRESS 
                            GROUP BY ID
                            HAVING COUNT(*) > 1);
-- Method -2
WITH DUPLICATE_CTE AS (
SELECT ID FROM ADDRESS
GROUP BY ID
HAVING COUNT(*) > 1
)
DELETE FROM ADDRESS
WHERE ID IN (SELECT ID FROM DUPLICATE_CTE);

03. Tell the number of rows that will be present in the inner join query?

table_1
========
id
===
1
1
1

table_2
======
id
===
1
1
1
null

select * from table_1
join table_2
on table_1.id = table_2.id;
-- result
9 rows

PySpark Interview Questions

04. How to insert a new row to a Dataframe using PySpark?

from pyspark.sql import SparkSession
from pyspark.sql import Row
# Initialize SparkSession
spark = SparkSession.builder \
    .appName("Insert Row Example") \
    .getOrCreate()

# Sample DataFrame
data = [("John", 25), ("Alice", 30)]
columns = ["Name", "Age"]

df = spark.createDataFrame(data, columns)

# New row to insert
new_row = Row("Bob", 35)

# Convert row to DataFrame
new_df = spark.createDataFrame([new_row], columns)

# Concatenate the original DataFrame with the new DataFrame
df = df.union(new_df)

# Show the updated DataFrame
df.show()

05. How to delete a specific row from a Dataframe using PySpark?

In PySpark, you can’t directly delete a specific row from a DataFrame because DataFrames are immutable. However, you can filter out the specific row that you want to delete and create a new DataFrame without that row. Here’s how you can do it.

from pyspark.sql import SparkSession

# Initialize SparkSession
spark = SparkSession.builder \
    .appName("Delete Row Example") \
    .getOrCreate()

# Sample DataFrame
data = [("John", 25), ("Alice", 30), ("Bob", 35)]
columns = ["Name", "Age"]
df = spark.createDataFrame(data, columns)

# Define the row to delete
row_to_delete = ("Alice", 30)  # Example: the row with Name="Alice" and Age=30

# Filter out the row to delete
filtered_df = df.filter((df["Name"] != row_to_delete[0]) | (df["Age"] != row_to_delete[1]))

# Show the updated DataFrame
filtered_df.show()

06. How to replace a specific value of Dataframe using PySpark?

In PySpark, you can replace specific values in a DataFrame using the withColumn() method along with the when() function from the pyspark.sql.functions module. Here’s how you can do it.

from pyspark.sql import SparkSession
from pyspark.sql.functions import when

# Initialize SparkSession
spark = SparkSession.builder \
    .appName("Replace Value Example") \
    .getOrCreate()

# Sample DataFrame
data = [("John", 25), ("Alice", 30), ("Bob", 35)]
columns = ["Name", "Age"]
df = spark.createDataFrame(data, columns)

# Define the value to replace and the new value
old_value = 30
new_value = 40

# Replace the specific value in the DataFrame
df = df.withColumn("Age", when(df["Age"] == old_value, new_value).otherwise(df["Age"]))

# Show the updated DataFrame
df.show()

Srini

Data Engineer with deep AI and Generative AI expertise, crafting high-performance data pipelines in PySpark, Databricks, and SQL. Skilled in Python, AWS, and Linux—building scalable, cloud-native solutions for smart applications.