Here are the interview questions asked in Genpact. These are particularly useful for Data engineer role interviews.

Table of contents
Databricks Interview Questions
01. What is your action if a node fails during the data processing?
When a node fails during processing in a distributed computing environment like Apache Spark, you can take several steps to best address the issue and ensure that your Spark job completes successfully:
- Monitor Cluster Health: Implement monitoring and alerting mechanisms to detect node failures as soon as they occur. Utilize Spark UI, cluster manager logs (e.g., YARN, Mesos), and external monitoring tools to identify the failed node.
- Identify the Cause: Determine the root cause of the node failure. Check the cluster logs, system logs, and any other relevant sources of information to understand what went wrong. Common causes of node failures include hardware issues, resource exhaustion, or software errors.
- Replace the Failed Node: If the failed node is recoverable, replace or repair the hardware/software components causing the failure. Depending on your cluster management system, this may involve manual intervention or automated recovery mechanisms.
- Reallocate Resources: If the failed node cannot be immediately replaced, reallocate its tasks to the remaining nodes. Also, redistribute the resources to the remaining nodes in the cluster. Many cluster managers (e.g., YARN, Mesos) support automatic task reassignment and resource reallocation to handle node failures.
- Retry Failed Tasks: Spark automatically retries failed tasks by default. If a task fails due to a node failure, Spark will retry the task on another available node. Configure the number of retries and task failure handling behavior according to your application requirements.
- Checkpointing and Fault Tolerance: Utilize Spark’s checkpointing and fault tolerance mechanisms to recover from node failures gracefully. Checkpoint RDDs/DataFrames at appropriate intervals to minimize recomputation and ensure resilience to failures.
- Scaling Out: If the failure occurs frequently, consider scaling out the cluster by adding additional nodes. If the cluster workload exceeds the capacity of the remaining nodes, also consider scaling out the cluster. This increases the cluster’s fault tolerance and capacity to handle failures.
- Data Recovery: If the failed node contains critical data or stateful information, recover the data from backups or replication mechanisms. Use distributed storage systems (e.g., HDFS, S3) with replication to ensure data durability and availability in the event of node failures.
- Manual Intervention: In some cases, manual intervention may be required to resolve the issue. This could involve restarting failed services, reconfiguring the cluster, or troubleshooting software/hardware issues.
- Post-Mortem Analysis: After resolving the immediate issue, conduct a post-mortem analysis. The goal is to identify the root cause of the node failure. Implement preventive measures to mitigate similar issues in the future.
By following these steps and implementing proactive measures, you can effectively handle node failures in your Spark cluster. This will ensure the resilience and reliability of your distributed data processing workflows.
02. Data Lake Vs Delta Lake?
Delta Lake and Data Lake are related concepts in the realm of big data storage and processing, but they serve different purposes and have distinct characteristics:
- Data Lake:
- A Data Lake is a centralized repository that allows you to store structured, semi-structured, and unstructured data at scale. It provides a storage solution for storing raw data in its native format without having to pre-define its schema.
- Data Lakes are typically implemented using distributed file systems like Hadoop Distributed File System (HDFS), cloud object storage (e.g., Amazon S3, Azure Data Lake Storage), or distributed storage systems like Apache Hudi or Apache Iceberg.
- Data Lake offers flexibility in data ingestion and supports various data processing frameworks (e.g., Apache Spark, Apache Flink) and analytics tools for querying, processing, and analyzing data.
- The data stored in a Data Lake can be used for various purposes. These purposes include data exploration, analytics, and machine learning. Data sharing across different teams and applications is also a key use.
- Delta Lake:
- Delta Lake is an open-source storage layer that enhances the reliability, performance, and scalability of data lakes. It adds ACID (Atomicity, Consistency, Isolation, Durability) transactions, schema enforcement, and data versioning capabilities on top of existing data lakes.
- Delta Lake is built on Apache Spark. It provides an optimized storage format for both batch and streaming data processing workloads. It leverages the Parquet file format for efficient storage. It also supports features like partition pruning and data skipping for faster query performance.
- Delta Lake enables data engineers and data scientists to build robust data pipelines. It ensures data quality, consistency, and reliability in data lake environments.
- With Delta Lake, you can perform operations like insert, update, delete, merge, and upsert on data lakes. This makes it suitable for use cases requiring real-time analytics. It is also useful for data warehousing and operational analytics.
In summary, Data Lake is a storage concept. It provides a scalable and cost-effective solution for storing large volumes of raw data. Delta Lake is a technology that adds transactional capabilities and reliability features on top of existing data lakes. This enables more advanced data processing and analytics workflows. Delta Lake can implement data lakes for various use cases. It also optimizes them for data warehousing and machine learning.
03. Managed Tables vs External Tables in the context of Databricks?
The concept of managed and external tables aligns closely with the broader database terminology. Unity Catalog is an integral part of Databricks. Here’s how managed and external tables are typically handled within Unity Catalog.
- Managed Tables:
- Managed tables in Unity Catalog are akin to managed tables in other database systems. When you create a managed table in Unity Catalog, the metadata (table schema, statistics) is managed within the platform. The data itself is also stored within the platform.
- The data for managed tables is stored in a managed storage layer provided by Databricks. This storage layer is typically backed by distributed file storage like Delta Lake or Apache Parquet files.
- Unity Catalog handles lifecycle management tasks such as data storage, data cleanup, and data consistency for managed tables. Dropping a managed table typically deletes both the metadata. It also deletes the associated data from the managed storage layer.
- When you want the platform to handle data storage and management transparently, managed tables in the Unity Catalog are helpful. This eliminates the need for manual intervention.
- External Tables:
- External tables in Unity Catalog, like external tables in other systems, are tables where Unity Catalog manages the metadata. The data resides externally, outside the platform’s control.
- In Unity Catalog, when you create an external table, you specify the location of the data. This can be cloud storage, like AWS S3 or Azure Blob Sto).
- Unity Catalog reads and queries the data from this external location without managing the data itself.
- Dropping an external table in Unity Catalog typically only removes the metadata associated with the table. The data in the external location remains untouched.
- The use of external tables allows for flexibility in accessing and querying data stored in different locations and formats without having to import it into Unity Cat
In summary, within Unity Catalog, managed tables are tables where both metadata and data are managed by the platform. External tables are tables where metadata is managed by the platform, but the data resides externally. The choice between managed and external tables depends on factors such as data storage location. It also depends on governance requirements. Additionally, data lifecycle management preferences influence the choice.
04. Can we merge two data frames in PySpark without using the JOIN?
Yes, you can use either union() or unionByName(). The union() merges rows vertically based on the position (both datasets should have the same schema and order). While unionByName(allowMissingColumns=True): Allows DataFrames with different schemas to be combined, with missing columns filled with null
union() example
from pyspark.sql import SparkSession
# Create a SparkSession
spark = SparkSession.builder \
.appName("ConcatenateDataFrames") \
.getOrCreate()
# Create two example DataFrames
df1 = spark.createDataFrame([(1, "Alice"), (2, "Bob"), (3, "Charlie")], ["id", "name"])
df2 = spark.createDataFrame([(4, "David"), (5, "Eve"), (6, "Frank")], ["id", "name"])
# Concatenate DataFrames using union
concatenated_df = df1.union(df2)
# Show the concatenated DataFrame
concatenated_df.show()
# Stop the SparkSession
spark.stop()
+---+-------+
| id| name|
+---+-------+
| 1| Alice|
| 2| Bob|
| 3|Charlie|
| 4| David|
| 5| Eve|
| 6| Frank|
+---+-------+
unionByName() example
from pyspark.sql import SparkSession
from pyspark.sql.functions import lit
# Create a SparkSession
spark = SparkSession.builder \
.appName("UnionByName Example") \
.getOrCreate()
# DataFrames with different schemas
df1 = spark.createDataFrame([(1, 'A'), (2, 'B')], ["id", "value"])
df2 = spark.createDataFrame([(3, 'C', 100), (4, 'D', 200)], ["id", "value", "extra"])
# Perform unionByName
df_union_by_name = df1_aligned.unionByName(df2, allowMissingColumns=True)
df_union_by_name.show()
Output
+---+-----+-----+
| id|value|extra|
+---+-----+-----+
| 1| A| null|
| 2| B| null|
| 3| C| 100|
| 4| D| 200|
+---+-----+-----+
05. What is DataSkewness?
Data skew is a condition in which a table’s data is unevenly distributed among partitions in the cluster. Data skew can severely downgrade the performance of queries, especially those with joins. Joins between big tables require shuffling data. The skew can lead to an extreme imbalance of work in the cluster. Click the link for optimization.
06. Can we retrieve the Data after truncating the external Table?
Yes. The truncate command only deletes data in the external table, leaving the underlying data sources intact. So even after truncating external tables, we can still access the data. We just need to reload the external tables again.
07. What is Unity Catalog in Databricks?
Unity Catalog provides centralized access control, auditing, lineage, and data discovery capabilities across Databricks workspaces. Click the link for architecture.
08. SparkSession Vs SparkContext?
- SparkContext is the entry point for low-level Spark APIs and RDD operations. It’s more focused on distributed computing and resource management.
- SparkSession is the entry point for higher-level Spark APIs like DataFrame and SQL operations. It’s designed for working with structured data and provides a more convenient and unified interface.
09. What are the various performance tuning techniques used in Databricks?
Tuning performance in PySpark involves optimizing various aspects of your Spark application to make it run faster and more efficiently. Here are several techniques you can use to improve the performance of your PySpark jobs:
- Partitioning: Properly partitioning your data can significantly improve performance by distributing the workload across executors. Use
repartition()orcoalesce(). Adjust the number of partitions according to the size of your data. Consider the available resources. - Data Serialization: Choose an appropriate serialization format based on your data characteristics and processing requirements. The default serialization format in PySpark is Java serialization (
org.apache.spark.serializer.JavaSerializer), but alternatives like Kryo (org.apache.spark.serializer.KryoSerializer) can offer better performance, especially for large-scale data processing. - Broadcast Variables: Use broadcast variables for small lookup tables or datasets that are used in join operations. Broadcasting these variables avoids unnecessary data shuffling and reduces network I/O.
- Caching and Persistence: Cache intermediate RDD/DataFrame results in memory using
cache()orpersist()to avoid recomputation when they are used multiple times in your workflow. However, be mindful of the available memory and the size of the cached data. - Data Locality: Minimize data shuffling by ensuring that data processing tasks are executed on nodes where the data resides. This can be achieved by repartitioning data based on key columns. Another method is using partitioning strategies that align with your processing logic.
- Optimized Transformations: Use optimized DataFrame and RDD transformations whenever possible. For example, prefer
select()overmap()for column projections, usefilter()to push down filters early in the execution plan, and leverage built-in functions (pyspark.sql.functions) for common data manipulation tasks. - Aggregate Pushdown: Push aggregation operations down to the data source whenever applicable. For instance, use built-in aggregation functions in SQL queries or DataFrame operations to leverage the underlying capabilities of the data source (e.g., Apache Parquet files).
- Resource Management: Configure Spark resource allocation parameters such as executor memory, executor cores, and driver memory appropriately based on the size of your data and the available cluster resources. Monitor resource utilization using Spark UI or monitoring tools to identify bottlenecks and adjust configurations accordingly.
- Parallelism Control: Adjust parallelism settings such as the number of partitions, shuffle partitions, and task concurrency to optimize resource utilization and prevent resource contention. Experiment with different settings to find the optimal configuration for your workload.
- Monitoring and Profiling: Monitor job execution metrics, such as task duration, data skewness, and resource utilization, using Spark UI or monitoring tools. Profile your Spark application using tools like
spark-submit --profileto identify performance hotspots and bottlenecks.
Apply these performance-tuning techniques. Continuously optimize your PySpark applications. You can achieve better performance and efficiency in your data processing workflows.
10. What’s Broadcast Join in Pyspark?
Broadcast join is a type of join optimization technique. It is used in distributed data processing frameworks like PySpark. The goal is to improve the performance of JOIN operations. This method is particularly effective. It works well when one of the DataFrames in the join operation is small. It should be small enough to fit entirely in memory on each executor node.
In a broadcast join, the smaller DataFrame is referred to as the “broadcasted” DataFrame. It is sent to all the worker nodes in the cluster. This allows each worker to perform the join locally. They don’t have to shuffle or redistribute data across the network. This can significantly reduce the amount of data movement and network traffic, leading to faster join performance.
Here’s how a broadcast join typically works in PySpark:
- Identify the smaller DataFrame: The DataFrame is determined to be smaller based on its size relative to available memory. This DataFrame is chosen to be broadcast.
- Broadcast the smaller DataFrame: The smaller DataFrame is broadcasted to all the worker nodes in the cluster. PySpark automatically determines whether a DataFrame should be broadcasted based on its size and available memory.
- Perform the join locally: Each worker node performs the join operation locally. It uses the broadcasted DataFrame. The DataFrame is available in memory. This eliminates the need for data shuffling or network communication.
- Finalize the join: The results of the join operation from each worker node are aggregated to produce the final result.
Broadcast joins are particularly effective when one of the DataFrames is significantly smaller than the other DataFrames. The smaller DataFrame can fit entirely in memory on each executor node. By avoiding data shuffling and network communication, broadcast joins can lead to substantial performance improvements. This is especially true for joinining operations involving small lookup tables or datasets.
from pyspark.sql.functions import broadcast
# Perform a broadcast join
joined_df = df1.join(broadcast(df2), "key")
Databricks Magic Commands
Here are examples of how each of the listed magic commands can be used in a Databricks notebook:
%run: Runs a Python file or a notebook.
%run my_script.py
%sh: Executes shell commands on the cluster nodes.
%sh ls -l
%fs: Allows you to interact with the Databricks file system.
%fs ls /path/to/directory
%sql: Allows you to run SQL queries.
%sql
SELECT * FROM table_name
%scala: Switches the notebook context to Scala.
%scala
println("Hello, Databricks!")
%python: Switches the notebook context to Python.
%python
print("Hello, Databricks!")
%md: Allows you to write Markdown text.
%md
This is a Markdown cell
%r: Switches the notebook context to R.
%r
print("Hello, Databricks!")
%lsmagic: Lists all the available magic commands.
%lsmagic
%jobs: Lists all the running jobs.
%jobs
%config: Allows you to set configuration options for the notebook.
%config max_rows = 100
%reload: Reloads the contents of a module.
%reload my_module
%pip: Allows you to install Python packages.
%pip install pandas
%load: Loads the contents of a file into a cell.
%load my_script.py
%matplotlib: Sets up the matplotlib backend.
%matplotlib inline
%who: Lists all the variables in the current scope.
%who
%env: Allows you to set environment variables.
%env MY_VARIABLE=value
These examples demonstrate how each magic command can be used in a Databricks notebook to perform various tasks and operations.







You must be logged in to post a comment.