We’ve compiled a comprehensive list of SQL and Pyspark interview questions asked at Perficient.

Table of contents
SQL Interview Questions
01. Write a JOIN SQL query?
SELECT * FROM EMP
JOIN ADDRESS
ON EMP.ID = ADDRESS.ID;
02. Write a DELETE query to delete duplicates?
The easy ways are using the DELETE query with a Common table expression and DELETE from the WHERE clause.
-- Method-1
DELETE FROM ADDRESS
WHERE ID IN (SELECT ID FROM ADDRESS
GROUP BY ID
HAVING COUNT(*) > 1);
-- Method -2
WITH DUPLICATE_CTE AS (
SELECT ID FROM ADDRESS
GROUP BY ID
HAVING COUNT(*) > 1
)
DELETE FROM ADDRESS
WHERE ID IN (SELECT ID FROM DUPLICATE_CTE);
03. Tell the number of rows that will be present in the inner join query?
table_1
========
id
===
1
1
1
table_2
======
id
===
1
1
1
null
select * from table_1
join table_2
on table_1.id = table_2.id;
-- result
9 rows
PySpark Interview Questions
04. How to insert a new row to a Dataframe using PySpark?
from pyspark.sql import SparkSession
from pyspark.sql import Row
# Initialize SparkSession
spark = SparkSession.builder \
.appName("Insert Row Example") \
.getOrCreate()
# Sample DataFrame
data = [("John", 25), ("Alice", 30)]
columns = ["Name", "Age"]
df = spark.createDataFrame(data, columns)
# New row to insert
new_row = Row("Bob", 35)
# Convert row to DataFrame
new_df = spark.createDataFrame([new_row], columns)
# Concatenate the original DataFrame with the new DataFrame
df = df.union(new_df)
# Show the updated DataFrame
df.show()
05. How to delete a specific row from a Dataframe using PySpark?
In PySpark, you can’t directly delete a specific row from a DataFrame because DataFrames are immutable. However, you can filter out the specific row that you want to delete and create a new DataFrame without that row. Here’s how you can do it.
from pyspark.sql import SparkSession
# Initialize SparkSession
spark = SparkSession.builder \
.appName("Delete Row Example") \
.getOrCreate()
# Sample DataFrame
data = [("John", 25), ("Alice", 30), ("Bob", 35)]
columns = ["Name", "Age"]
df = spark.createDataFrame(data, columns)
# Define the row to delete
row_to_delete = ("Alice", 30) # Example: the row with Name="Alice" and Age=30
# Filter out the row to delete
filtered_df = df.filter((df["Name"] != row_to_delete[0]) | (df["Age"] != row_to_delete[1]))
# Show the updated DataFrame
filtered_df.show()
06. How to replace a specific value of Dataframe using PySpark?
In PySpark, you can replace specific values in a DataFrame using the withColumn() method along with the when() function from the pyspark.sql.functions module. Here’s how you can do it.
from pyspark.sql import SparkSession
from pyspark.sql.functions import when
# Initialize SparkSession
spark = SparkSession.builder \
.appName("Replace Value Example") \
.getOrCreate()
# Sample DataFrame
data = [("John", 25), ("Alice", 30), ("Bob", 35)]
columns = ["Name", "Age"]
df = spark.createDataFrame(data, columns)
# Define the value to replace and the new value
old_value = 30
new_value = 40
# Replace the specific value in the DataFrame
df = df.withColumn("Age", when(df["Age"] == old_value, new_value).otherwise(df["Age"]))
# Show the updated DataFrame
df.show()







You must be logged in to post a comment.