We’ve compiled a comprehensive list of SQL and Pyspark interview questions asked at Perficient.

SQL, PySprak Interview Questions
Photo by Tatiana Syrikova on Pexels.com

Table of contents

  1. SQL Interview Questions
    1. 01. Write a JOIN SQL query?
    2. 02. Write a DELETE query to delete duplicates?
    3. 03. Tell the number of rows that will be present in the inner join query?
  2. PySpark Interview Questions
    1. 04. How to insert a new row to a Dataframe using PySpark?
    2. 05. How to delete a specific row from a Dataframe using PySpark?
    3. 06. How to replace a specific value of Dataframe using PySpark?

SQL Interview Questions

01. Write a JOIN SQL query?

SELECT * FROM EMP
JOIN ADDRESS
ON EMP.ID = ADDRESS.ID;

02. Write a DELETE query to delete duplicates?

The easy ways are using the DELETE query with a Common table expression and DELETE from the WHERE clause.

-- Method-1
DELETE FROM ADDRESS
WHERE ID IN (SELECT ID FROM ADDRESS
GROUP BY ID
HAVING COUNT(*) > 1);
-- Method -2
WITH DUPLICATE_CTE AS (
SELECT ID FROM ADDRESS
GROUP BY ID
HAVING COUNT(*) > 1
)
DELETE FROM ADDRESS
WHERE ID IN (SELECT ID FROM DUPLICATE_CTE);

03. Tell the number of rows that will be present in the inner join query?

table_1
========
id
===
1
1
1

table_2
======
id
===
1
1
1
null

select * from table_1
join table_2
on table_1.id = table_2.id;
-- result
9 rows

PySpark Interview Questions

04. How to insert a new row to a Dataframe using PySpark?

from pyspark.sql import SparkSession
from pyspark.sql import Row
# Initialize SparkSession
spark = SparkSession.builder \
.appName("Insert Row Example") \
.getOrCreate()

# Sample DataFrame
data = [("John", 25), ("Alice", 30)]
columns = ["Name", "Age"]

df = spark.createDataFrame(data, columns)

# New row to insert
new_row = Row("Bob", 35)

# Convert row to DataFrame
new_df = spark.createDataFrame([new_row], columns)

# Concatenate the original DataFrame with the new DataFrame
df = df.union(new_df)

# Show the updated DataFrame
df.show()

05. How to delete a specific row from a Dataframe using PySpark?

In PySpark, you can’t directly delete a specific row from a DataFrame because DataFrames are immutable. However, you can filter out the specific row that you want to delete and create a new DataFrame without that row. Here’s how you can do it.

from pyspark.sql import SparkSession

# Initialize SparkSession
spark = SparkSession.builder \
.appName("Delete Row Example") \
.getOrCreate()

# Sample DataFrame
data = [("John", 25), ("Alice", 30), ("Bob", 35)]
columns = ["Name", "Age"]
df = spark.createDataFrame(data, columns)

# Define the row to delete
row_to_delete = ("Alice", 30) # Example: the row with Name="Alice" and Age=30

# Filter out the row to delete
filtered_df = df.filter((df["Name"] != row_to_delete[0]) | (df["Age"] != row_to_delete[1]))

# Show the updated DataFrame
filtered_df.show()

06. How to replace a specific value of Dataframe using PySpark?

In PySpark, you can replace specific values in a DataFrame using the withColumn() method along with the when() function from the pyspark.sql.functions module. Here’s how you can do it.

from pyspark.sql import SparkSession
from pyspark.sql.functions import when

# Initialize SparkSession
spark = SparkSession.builder \
.appName("Replace Value Example") \
.getOrCreate()

# Sample DataFrame
data = [("John", 25), ("Alice", 30), ("Bob", 35)]
columns = ["Name", "Age"]
df = spark.createDataFrame(data, columns)

# Define the value to replace and the new value
old_value = 30
new_value = 40

# Replace the specific value in the DataFrame
df = df.withColumn("Age", when(df["Age"] == old_value, new_value).otherwise(df["Age"]))

# Show the updated DataFrame
df.show()