Here is a list of challenging interview questions narrated to Azure Databricks and were asked in the Mphasis interview.

Azure Databricks interview questions
Photo by Karolina Grabowska on Pexels.com

Tricky Azure Databricks Interview Questions

01. What is a cluster in Databricks?

A cluster is a set of computation resources (virtual machines) used to process data and run tasks. These clusters execute computations in a distributed manner. They scale computations in a parallelized manner, making them suitable for big data processing and analytics.

02. What is Runtime in Databricks?

  • Runtime is a versioned and pre-configured computing environment. It includes the Apache Spark version and components and libraries are needed( for running distributed data processing and analytics workloads).
  • Each runtime is linked with a specific version of Apache Spark and includes tools, packages, and dependencies.

03. How do we create a data frame using a CSV file in PySPark?

df = spark.read.csv(file_path, header=True, inferSchema=True)1

04. What is header=True?

header=True: Specifies that the first row of the CSV file should be used as the header. The dataFrame will be created with column names derived from the values in the header row.

05. What is header=False?

header=False (default): Specifies that there is no header row in the CSV file. And the DataFrame will be created with automatically generated column names (usually “_1”, “_2”, etc.).

06. How do we count a “Marks” column in Databricks PySpark?


data=(["N1", 20], ["N2", 30], ["N3", 40])
cols=("Sub", "Marks")

df=spark.createDataFrame(data, cols)
df.show()

df=df.groupBy("Sub").agg(count("Marks").alias("Countofmarks"))
display(df)

Output

+---+-----+
|Sub|Marks|
+---+-----+
| N1|   20|
| N2|   30|
| N3|   40|
+---+-----+

07. How do we write SQL query to calculate cumulative Sum 1 to 10?

We do it in two ways.

Method:1

SELECT
number,
SUM(number) OVER (ORDER BY number) AS cumulative_sum
FROM
your_table
WHERE
number BETWEEN 1 AND 10;

Method:2

SELECT
t1.number,
SUM(t2.number) AS cumulative_sum
FROM
your_table t1
JOIN
your_table t2
ON t1.number >= t2.number

WHERE
t1.number BETWEEN 1 AND 10
GROUP BY
t1.number
ORDER BY
t1.number;

Finally, method:1 is efficient than Method:2

08. How do we see the S3 bucket folder/file contents? using magic commands?

Here %fs is the magic command2. The ls shows the contents of the s3 file.

%fs ls s3://my-awesome-bucket/data/

09. How do we create a Delta Table?

Delta Lake is an open-source storage layer that brings ACID transactions to Apache Spark and big data workloads. It lets you create Delta tables. These are akin to regular Spark DataFrames. They give more features like ACID transactions, time travel, and more.

# Assuming 'df' is DataFrame
df.write.format("delta").save("/mnt/delta-table-path")

10. In PySpark, Header=True or Header=False, will the row total change?

No, there will not be any change in the row total.

11. How do we create a Job or task in Databricks?

In the “Workflows”, we create a job/task. Here’s a link.

12. Can we restart a Job in Databricks?

Yes

13. What is Delta Lake in Databricks?

Delta Lake is the optimized storage layer that provides the foundation for storing data and tables in the Databricks Lakehouse3

  1. Use inferSchema=True:
    – When you want Spark to automatically decide the correct data types for each column.
    – When Type accuracy is crucial in downstream data analysis or processing.
    – When you are okay with the slight performance trade-off for schema inference.
    Use inferSchema=False:
    – to explicitly define the schema later in the code, or when you already know the schema.
    – When performance is critical, and you want to avoid the overhead of schema inference.
    – When the data is expected to be consistently formatted, and you are comfortable treating all data as strings initially. ↩︎
  2. Magic commands in Databricks notebooks are an efficient way to execute specific commands within a notebook environment. Here’s a list of some commonly used Databricks magic commands:
    1. %fs
    File System commands: Interact with the Databricks File System (DBFS) and other storage systems like S3, Azure Blob, etc.
    Examples:%fs ls /mnt/my-mount/ – Lists files in a directory.
    %fs cp /path/to/source /path/to/destination – Copies files.
    %fs rm /path/to/file – Removes a file.
    2. %sql
    SQL commands: Let you run SQL queries on data.
    Example:%sql SELECT * FROM my_table LIMIT 10 – Runs a SQL query and displays the results.
    3. %python, %r, %scala, %sh
    Language-specific commands: Switches the interpreter to a specific language.
    Examples:%python print(“Hello from Python”) – Executes Python code.
    %r print(“Hello from R”) – Executes R code.
    %scala println(“Hello from Scala”) – Executes Scala code.
    %sh ls -la – Runs shell commands.
    4. %md
    Markdown: Let you write formatted text using Markdown syntax.
    Example :%md # This is a Markdown heading – Creates a heading.
    5. %run
    Run other notebooks: Runs another notebook within the current notebook.
    Example:%run /path/to/other_notebook – Executes all the cells in the specified notebook.
    6. %pip
    Install Python packages: Install Python packages directly in the notebook environment.
    Example: %pip install numpy – Install the NumPy package.
    7. %conda
    Conda environment management: Manage conda environments (if enabled).
    Example:%conda install pandas – Installs the Pandas package.
    8. %matplotlib
    Matplotlib integration: Set up how matplotlib plots are rendered.
    Example:%matplotlib inline – Displays matplotlib plots inline in the notebook.
    9. %scala and %python
    You switch interpreter( within SQL notebooks): We can run Scala or Python code (within SQL notebooks).
    Example:%python display(spark.range(100)) – Runs Python code in an SQL notebook.
    So these commands improve performance and productivity. ↩︎
  3. The lakehouse is a modern data architecture that combines the benefits of data lakes and data warehouses. It is a unified platform for all types of data. Like (structured, semi-structured, and unstructured) and support various workloads, from BI and SQL analytics to data science and machine learning. ↩︎