Databricks: Essential Interview Questions for Data Engineers

Here are the Data engineer interview questions asked in top companies. Knowing these in advance helps you to crack your interview.

Databricks data Engineer Interview Questions

Databricks: Essential Interview Questions for Data Engineers

Databricks Interview Questions

1. How do you read all the files in PySaprk?

In PySpark, you can use the SparkContext to access the files in a source folder. Here’s a simple example of how to find all the files present in a source folder:

from pyspark import SparkContext
# Initialize SparkContext
sc = SparkContext("local", "FileSearchApp")
# Specify the source folder
source_folder = "path_to_your_source_folder"
# Use sc.wholeTextFiles to read all files in the source folder
files_rdd = sc.wholeTextFiles(source_folder)
# Extract the file paths from the RDD
file_paths = files_rdd.keys().collect()
# Print the file paths
for path in file_paths:
    print(path)
# Stop SparkContext
sc.stop()

Replace “path_to_your_source_folder” with the actual path to your source folder. This script will print the paths of all files in the specified folder. You can adjust it according to your needs, such as filtering files( on certain criteria (or) performing further processing on files).

2. In a Table there is only one column which is “Description” it is varchar Type. How to retrieve the last row, in MySQL?

If the Description column contains only alphabetical values and you want to retrieve the last row based on this column, you can still use it for ordering. However, please note that ordering alphabetical values not give you the “last” row to the insertion order unless you have some additional mechanism to ensure that the ordering corresponds to the insertion order.

Here’s how you can use the Description column for ordering:

##Working SQL
SELECT Description
FROM (
    SELECT Description, ROW_NUMBER() OVER (ORDER BY Description DESC) AS row_num
    FROM your_table
) AS numbered_rows
WHERE row_num = 1;

This query will return the Description of the last row in the table. Moreover, it is ordered by the Description column in descending alphabetical order. However, it’s important to note that alphabetical order might not correspond to the insertion order unless the values in the Description column were inserted in the desired order.

To retrieve the last row based on the insertion order, it’s usually best to use a column like a timestamp or an auto-incrementing primary key column, as they more reliably represent the insertion order.

Recommended Books

03. Group by Vs Partition by?

In SQL, GROUP BY and PARTITION BY are both used for organizing data, but they serve different purposes:

GROUP BY:

GROUP BY is used to aggregate data based on one or more columns.
It groups rows that have the same values. Like, getting the sum, count, average, etc., of grouped data.
It’s often used with aggregate functions such as COUNT, SUM, AVG, MIN, MAX, etc.

Example:

SELECT department, COUNT(*) AS employee_count
FROM employees
GROUP BY department;

In this example, all rows with the same department value are grouped together, and then the COUNT(*) function is applied to count the number of employees in each department.

PARTITION BY:

The use of PARTITION BY is to divide the result set into partitions to which the window function is applied separately.
It’s typically used with window functions like ROW_NUMBER, RANK, DENSE_RANK, LAG, LEAD, etc.
It’s often used for analytical calculations within groups, but without collapsing the result set into a single row per group.

Example:

SELECT
    department,
    employee_name,
    salary,
    ROW_NUMBER() OVER (PARTITION BY department ORDER BY salary DESC) AS rank_within_department
FROM employees;

In this example, the ROW_NUMBER() function is applied to each partition defined by the department, ordering employees within each department by their salary.

In summary, GROUP BY aggregates data and collapses multiple rows into summary rows. Whereas PARTITION BY is used for analytical functions within groups without collapsing the result set.

32 Complex SQL Queries

04. Steps to create Workflow in Databricks?

Creating a workflow in Databricks typically involves defining and scheduling a series of steps that perform data processing tasks. Here’s a general outline of the steps you would follow to create a workflow in Databricks:

Set up Databricks:

Make sure you have access to a Databricks workspace and cluster.
Log in to the Databricks workspace.

Create Notebooks:

Create one or more notebooks in the Databricks workspace.
Each notebook should contain the code for a specific task or step in your workflow.

Write Code in Notebooks:

Write code in each notebook to perform the desired data processing tasks.
You can use languages supported by Databricks such as Scala, Python, SQL, or R.

Organize Notebooks:

Organize your notebooks into logical groups if needed.
You can use folders within the Databricks workspace to organize notebooks.

Define Workflow Steps:

Decide on the sequence of steps/tasks in your workflow.
Each step will correspond to running a specific notebook or set of notebooks.

Create a Notebook for Orchestration:

Create a new notebook that will serve as the orchestrator for your workflow.
In this notebook, you will define the sequence of steps and schedule the workflow.

Define Workflow Logic:

Write code in the orchestrator notebook to define the workflow logic.
This involve calling the notebooks that contain the processing tasks in the desired sequence.

Schedule Workflow Execution:

Use Databricks Jobs to schedule the execution of the workflow.
Configure the schedule to run at the desired frequency (e.g., hourly, daily, weekly).

Monitor Workflow Execution:

Monitor the execution of the workflow using the Databricks Jobs interface.
Check logs and output to ensure that each step is completed successfully.

Iterate and Improve:

Review and refine your workflow as needed based on feedback and changing requirements.
Update notebooks and job schedules accordingly.

You can set up a workflow in Databricks to automate your data tasks.

05. In Databricks, how can you access a notebook from one account to another [Databricks account]?

Accessing a notebook from another Databricks account involves sharing (or) exporting/importing the notebook. Here’s how you can do it:

Sharing Notebook:

Share Notebook with Another User:

Open the notebook you want to share.
Click on the “Share” button at the top-right corner of the notebook interface.
Enter the email address of the user from the other Databricks account.
Choose the desired permissions (e.g., Can Edit, Can Run, Can Manage).
Click “Share”.

Access Shared Notebook:

The user from the other Databricks account will receive an email notification with a link to the shared notebook.
They can click on the link to access the notebook.
Alternatively, they can go to the “Shared” tab in the Databricks workspace to view all notebooks shared with them.

Exporting and Importing Notebook:

Export Notebook:
Open the notebook you want to export.
Click on the “File” menu.
Select “Export” and choose the desired format (e.g., DBC Archive, Source Notebook).
Save the exported notebook file to your local system.

Transfer Notebook:

Share the exported notebook file with the user from the other Databricks account through email, file sharing service, etc.

The user from the other Databricks account can import the notebook into their Databricks workspace.
They can click on the “Workspace” tab in the Databricks workspace.
Click on the downward arrow next to the folder where you want to import the notebook.
Select “Import” and choose the notebook file from their local system.
Click “Import”.

You can share or transfer notebooks between Databricks accounts so users from one account can access notebooks from another.

Srini

Data Engineer with deep AI and Generative AI expertise, crafting high-performance data pipelines in PySpark, Databricks, and SQL. Skilled in Python, AWS, and Linux—building scalable, cloud-native solutions for smart applications.