Here are the interview questions on Python, SQL, PySpark, and Databricks asked in a recent interview. These are explained with resolutions.

PySpark Interview Questions

Table of contents

Interview Questions

SQL

01. Write an SQL query to select non-matching rows of the left side table? Use only JOINS

Tabl
===
id
==
1
2
3
4
5

Tab2
===
id
===
1
10
3
14
15

The output will be
================
2
4
5

SQL Query
==========

select t1.id from tab1 t1
join tab2 t2
on t1.id = t2.id
where t2.id is null;

02. Write an SQL query to capitalize the first letter of the name. Ensure the remaining portion of the name is in lowercase.

Table1
------
ename
=====
raJu
venKat
kRIshna

Solution:
==========
SELECT CONCAT(UPPER(SUBSTRING(name, 1, 1)), LOWER(SUBSTRING(name, 2))) AS capitalized_name
FROM Table1;

Python

03. Write Python code to extract the common character in the list of items?

l1=["abc", "aca", "bbc", "cca"]
Here, "c" is the common char in all the elements. So we should get "c" in the output.

Solution
=======
l1=["abc", "aca", "bbc", "cca"]

#convert list to sets
mysets = [set(i) for i in l1]
# use intersection method to find common char of all ("*") sets
common_char=set.intersection(*mysets)
print(common_char)

Output
====
{'c'}

** Process exited - Return Code: 0 **
Press Enter to exit terminal

04. Write Python code in Pandas how can I fill null values?

import pandas as pd

# Sample DataFrame
data = {'ID': [1, 2, None, 4],
'Name': ['Alice', 'Bob', None, 'David'],
'Age': [25, None, 30, 35]}
df = pd.DataFrame(data)

# Specify columns to fill null values for
columns_to_fill = ['Name', 'Age']

# Fill null values only in the specified columns
fill_values = {'Name': 'Unknown', 'Age': 0} # Specify fill values for each column
df[columns_to_fill] = df[columns_to_fill].fillna(value=fill_values)

print(df)

PySpark

05. How do you check whether the given column value is a valid timestamp, or, not?

You can do it in two ways.

Method#1
==============

from pyspark.sql import SparkSession
from pyspark.sql.functions import col, when, to_timestamp

spark= SparkSession.builder.appName("Test").getOrCreate()

data=(("Ravi", 30, "2000-10-01 10:20:10"), ("Vasu", 20, "2011-11-01 05:10:00"))
cols=("Name", "Age", "Mytimestamp")

df=spark.createDataFrame(data, cols)

#df.show()

df=df.withColumn("Newcol", when(to_timestamp(col("Mytimestamp")).isNotNull(), col("Mytimestamp")).otherwise(None))

df.show()

Method#2
=============

from pyspark.sql import SparkSession
from pyspark.sql.functions import col, udf
from pyspark.sql.types import BooleanType

# Create a SparkSession
spark = SparkSession.builder.appName("Test").getOrCreate()

# Sample data
data = [("Ravi", 30, "2000-10-01 10:20:10"), ("Vasu", 20, "2011-11-01 05:10:00")]
cols = ["Name", "Age", "Mytimestamp"]

# Create a DataFrame
df = spark.createDataFrame(data, cols)

# Define a UDF to check if the timestamp is valid
def is_valid_timestamp(timestamp):
try:
# Try converting the timestamp to a valid timestamp
pd.to_datetime(timestamp)
return True
except ValueError:
# If ValueError is raised, the timestamp is not valid
return False


# Register the UDF
is_valid_timestamp_udf = udf(is_valid_timestamp, BooleanType())

# Apply the UDF to the DataFrame
df_with_valid_timestamps = df.withColumn("IsValidTimestamp", is_valid_timestamp_udf(col("Mytimestamp")))

# Filter out rows with invalid timestamps
df_valid_timestamps = df_with_valid_timestamps.filter(col("IsValidTimestamp"))

# Show the DataFrame with valid timestamps
df_valid_timestamps.show()

06. How can I locate a particular column in TimestampType using PySpark?

from pyspark.sql import SparkSession
from pyspark.sql.types import TimestampType

# Initialize SparkSession
spark = SparkSession.builder \
.appName("LocateTimestampColumn") \
.getOrCreate()

# Sample DataFrame
data = [("2022-01-01 12:00:00", 1), ("2022-01-02 12:00:00", 2)]
df = spark.createDataFrame(data, ["timestamp_column", "other_column"])

df = df.select(
to_timestamp(col("timestamp_column")).alias("ts_column"),
col("other_column")
)
# Get column names with TimestampType
timestamp_columns = [col_name for col_name, col_type in df.dtypes if col_type == "timestamp"]

print("Columns with TimestampType:", timestamp_columns)

Output
=====
Columns with TimestampType: ['ts_column']

Databricks

07. What is DAG?

In PySpark, DAG stands for Directed Acyclic Graph. It is a directed graph of the entire computation flow of a Spark job. The graph’s nodes represent the RDDs (Resilient Distributed Datasets). The edges represent the operations to be applied to the RDDs.

The DAG helps Spark optimize the execution of operations by rearranging and parallelizing them. This optimization and parallelization make the computation more efficient and scalable.

08. Bloom filter in Databricks?

In Databricks, you can implement a Bloom Filter using the bloomFilter function provided by Spark’s DataFrame API. The bloomFilter function is available starting from Apache Spark 3.1.0. A Bloom Filter is a probabilistic data structure that tests whether an element is a set member.

Here’s an example of how you can use Bloom Filter in Databricks:

##Import necessary libraries
from pyspark.sql import SparkSession
from pyspark.sql.functions import bloomFilter

##Create a SparkSession
spark = SparkSession.builder \
.appName("Bloom Filter Example") \
.getOrCreate()

##Sample data
data = [("Alice",), ("Bob",), ("Charlie",), ("David",), ("Emily",)]

##Create a DataFrame
df = spark.createDataFrame(data, ["Name"])

##Create a Bloom Filter with false positive rate 0.1 and capacity 1000
bloom_filter = df.stat.bloomFilter("Name", 1000, 0.1)

##Test membership
print("Alice:", bloom_filter.mightContain("Alice"))
print("John:", bloom_filter.mightContain("John"))

##Stop the SparkSession
spark.stop()

In this example:

  • We import the necessary libraries, including SparkSession.
  • We create a SparkSession.
  • We create a DataFrame df with sample data containing names.
  • We use the stat.bloomFilter function to create a Bloom Filter on the “Name” column of the DataFrame. We specify a capacity of 1000 (maximum expected number of elements) and a false positive rate of 0.1.
  • We test membership using the mightContain method of the Bloom Filter. It returns True if the element is possibly in the set. There is a small probability of false positives. It returns False if it is not in the set.
  • Finally, we stopped the SparkSession.
  • Note: The Bloom Filter in Spark is a part of DataFrame statistics. So it operates on DataFrames. It does not directly provide operations on individual elements. Also, the Bloom Filter is an approximate data structure, meaning it may produce false positives but not false negatives.