Top 5 PySpark Interview Questions: Tredence Analytics

Tredence is top in data science projects. Here are some of the interview questions asked for the Data engineer role.

PySpark Interview Questions

01. How do we retrieve positive, negative, and zero values from a Table in SQL?

Here are the SQL queries that know the count of positive, negative, and zero values.

Table1
-----
NUMBER1
-------
1
2
3
-1
-2
0
0
1

-- SQL Queries
SELECT COUNT(*) FROM TABLE1 WHERE NUMBER < 0;
SELECT COUNT(*) FROM TABLE1 WHERE NUMBER > 0;
SELECT COUNT(*) FROM TABLE1 WHERE NUMBER = 0;

-- PySpark queries
# Assuming df is your DataFrame representing TABLE1

# Perform the counts
count_negative = df.filter(df['NUMBER'] < 0).count()
count_positive = df.filter(df['NUMBER'] > 0).count()
count_zero = df.filter(df['NUMBER'] == 0).count()

# Print the counts
print("Count of negative numbers:", count_negative)
print("Count of positive numbers:", count_positive)
print("Count of zeros:", count_zero)

02. How do we remove list duplicates in Python?

my_list=[1,2,3,1,2,3,5,6]

# 01.using set built-in
my_list=[1,2,3,1,2,3,5,6]
output=set(my_list)
print(output)

# 02.using for loop
my_list = [1, 2, 2, 3, 4, 4, 5]

unique_elements = []
for item in my_list:
    if item not in unique_elements:
        unique_elements.append(item)

# 03.From collections
from collections import OrderedDict

my_list = [1, 2, 2, 3, 4, 4, 5]
unique_elements = list(OrderedDict.fromkeys(my_list))

# 04.From NumPy Unique
import numpy as np
my_list = [1, 2, 2, 3, 4, 4, 5]
unique_elements = np.unique(my_list)

03. How do we join two data frames in PySpark?

from pyspark.sql import SparkSession
# Initialize SparkSession
spark = SparkSession.builder \
    .appName("DataFrameJoinExample") \
    .getOrCreate()

# Assuming df1 and df2 are your DataFrames
# Joining df1 and df2 on a common column

joined_df = df1.join(df2, df1['common_column'] == df2['common_column'], 'inner')

# Show the joined DataFrame
joined_df.show()

04. How do we get the top n rows from the joined dataframe?

from pyspark.sql.window import Window
from pyspark.sql.functions import col, row_number
windowDept = Window.partitionBy("department").orderBy(col("salary").desc())

df.withColumn("row", row_number().over(windowDept))
  .filter(col("row") <= n)
  .drop("row").show()

05. How do you decide whether a SQL query is a narrow transformation?

Here is a link for more details.

Narrow Transformation: Operations like filter and adding a column using withColumn can be performed on a single RDD partition. These operations do not need shuffling data across partitions. These transformations, known as Narrow transformations, are less costly. They do not need data to be moved between executor or worker nodes.

Wide Transformations: These are the operations that need shuffling data across partitions. This means that the data needs to be moved between executor or worker nodes. Some examples of wide transformations in Spark include eg. Joins, repartitioning, groupBy, etc.

Srini

Data Engineer with deep AI and Generative AI expertise, crafting high-performance data pipelines in PySpark, Databricks, and SQL. Skilled in Python, AWS, and Linux—building scalable, cloud-native solutions for smart applications.