Here are the interview questions on Python, SQL, PySpark, and Databricks asked in a recent interview. These are explained with resolutions.
Table of contents
Interview Questions
SQL
01. Write an SQL query to select non-matching rows of the left side table? Use only JOINS
Tabl===id==12345Tab2===id===11031415The output will be ================245SQL Query==========select t1.id from tab1 t1join tab2 t2on t1.id = t2.idwhere t2.id is null;
02. Write an SQL query to capitalize the first letter of the name. Ensure the remaining portion of the name is in lowercase.
Table1------ename=====raJuvenKatkRIshnaSolution:==========SELECT CONCAT(UPPER(SUBSTRING(name, 1, 1)),LOWER(SUBSTRING(name, 2))) AS capitalized_nameFROM Table1;
Python
03. Write Python code to extract the common character in the list of items.
l1=["abc", "aca", "bbc", "cca"]Here, "c" is the common char in all the elements. So we should get "c" in the output.Solution=======l1=["abc", "aca", "bbc", "cca"]#convert list to setsmysets = [set(i) for i in l1]# use intersection method to find common char of all ("*") setscommon_char=set.intersection(*mysets)print(common_char)Output===={'c'}** Process exited - Return Code: 0 **Press Enter to exit terminal
04. Write Python code in Pandas on how to fill null values?
import pandas as pd# Sample DataFramedata = {'ID': [1, 2, None, 4], 'Name': ['Alice', 'Bob', None, 'David'], 'Age': [25, None, 30, 35]}df = pd.DataFrame(data)# Specify columns to fill null values forcolumns_to_fill = ['Name', 'Age']# Fill null values only in the specified columnsfill_values = {'Name': 'Unknown', 'Age': 0} # Specify fill values for each columndf[columns_to_fill] = df[columns_to_fill].fillna(value=fill_values)print(df)
PySpark
05. How do you check whether the given column value is a valid timestamp, or, not?
You can do it in two ways.Method#1==============from pyspark.sql import SparkSessionfrom pyspark.sql.functions import col, when, to_timestampspark= SparkSession.builder.appName("Test").getOrCreate()data=(("Ravi", 30, "2000-10-01 10:20:10"), ("Vasu", 20, "2011-11-01 05:10:00"))cols=("Name", "Age", "Mytimestamp")df=spark.createDataFrame(data, cols)#df.show()df=df.withColumn("Newcol", when(to_timestamp(col("Mytimestamp")).isNotNull(), col("Mytimestamp")).otherwise(None)) df.show()Method#2=============from pyspark.sql import SparkSessionfrom pyspark.sql.functions import col, udffrom pyspark.sql.types import BooleanType# Create a SparkSessionspark = SparkSession.builder.appName("Test").getOrCreate()# Sample datadata = [("Ravi", 30, "2000-10-01 10:20:10"), ("Vasu", 20, "2011-11-01 05:10:00")]cols = ["Name", "Age", "Mytimestamp"]# Create a DataFramedf = spark.createDataFrame(data, cols)# Define a UDF to check if the timestamp is validdef is_valid_timestamp(timestamp): try: # Try converting the timestamp to a valid timestamp pd.to_datetime(timestamp) return True except ValueError: # If ValueError is raised, the timestamp is not valid return False# Register the UDFis_valid_timestamp_udf = udf(is_valid_timestamp, BooleanType())# Apply the UDF to the DataFramedf_with_valid_timestamps = df.withColumn("IsValidTimestamp", is_valid_timestamp_udf(col("Mytimestamp")))# Filter out rows with invalid timestampsdf_valid_timestamps = df_with_valid_timestamps .filter(col("IsValidTimestamp"))# Show the DataFrame with valid timestampsdf_valid_timestamps.show()
06. How can I locate a particular column in TimestampType using PySpark?
from pyspark.sql import SparkSessionfrom pyspark.sql.types import TimestampType# Initialize SparkSessionspark = SparkSession.builder \ .appName("LocateTimestampColumn") \ .getOrCreate()# Sample DataFramedata = [("2022-01-01 12:00:00", 1), ("2022-01-02 12:00:00", 2)]df = spark.createDataFrame(data, ["timestamp_column", "other_column"])df = df.select( to_timestamp(col("timestamp_column")).alias("ts_column"), col("other_column"))# Get column names with TimestampTypetimestamp_columns = [col_name for col_name, col_type in df.dtypes if col_type == "timestamp"]print("Columns with TimestampType:", timestamp_columns)Output=====Columns with TimestampType: ['ts_column']
Databricks
07. What is DAG?
In PySpark, DAG stands for Directed Acyclic Graph. It is a directed graph of the entire computation flow of a Spark job. The graph’s nodes represent the RDDs (Resilient Distributed Datasets). The edges represent the operations to be applied to the RDDs.
The DAG helps Spark optimize the execution of operations by rearranging and parallelizing them. This optimization and parallelization make the computation more efficient and scalable.
08. Bloom filter in Databricks?
In Databricks, you can implement a Bloom Filter using the bloomFilter function provided by Spark’s DataFrame API. The bloomFilter function is available starting from Apache Spark 3.1.0. A Bloom Filter is a probabilistic data structure that tests whether an element is a set member.
Here’s an example of how you can use Bloom Filter in Databricks:
##Import necessary librariesfrom pyspark.sql import SparkSessionfrom pyspark.sql.functions import bloomFilter##Create a SparkSessionspark = SparkSession.builder \.appName("Bloom Filter Example") \.getOrCreate()##Sample datadata = [("Alice",), ("Bob",), ("Charlie",), ("David",), ("Emily",)]##Create a DataFramedf = spark.createDataFrame(data, ["Name"])##Create a Bloom Filter with false positive rate 0.1 and capacity 1000bloom_filter = df.stat.bloomFilter("Name", 1000, 0.1)##Test membershipprint("Alice:", bloom_filter.mightContain("Alice"))print("John:", bloom_filter.mightContain("John"))##Stop the SparkSessionspark.stop()
In this example:
- We import the necessary libraries, including SparkSession.
- We create a SparkSession.
- We create a DataFrame df with sample data containing names.
- We use the stat.bloomFilter function to create a Bloom Filter on the “Name” column of the DataFrame. We specify a capacity of 1000 (maximum expected number of elements) and a false positive rate of 0.1.
- We test membership using the mightContain method of the Bloom Filter. It returns True if the element is in the set. There is a small probability of false positives. It returns False if it is not in the set.
- Finally, we stopped the SparkSession.
- Note: The Bloom Filter in Spark is a part of DataFrame statistics. So it operates on DataFrames. It does not directly provide operations on individual elements. Also, the Bloom Filter is an approximate data structure, meaning it produce false positives but not false negatives.






