Optimize Azure Databricks: Best Practices for Performance, Efficiency, and Security

Optimizing Azure Databricks is crucial for data engineering and analytics. It can improve performance, efficiency, and security. This guide covers different ways to optimize Azure Databricks.

Table of contents

Cluster Configuration (Right-Sizing)

Start with a cluster size of 4 nodes (Standard_DS3_v2 instances) and monitor resource utilization.
Scale up to sizable instance types like Standard_DS4_v2 or increase the number of nodes if CPU or memory utilization steadily exceeds 70%.

Steps to implement programmatically (PySpark code):

# Example of setting Spark configurations programmatically
spark.conf.set("spark.executor.memory", "4g")
spark.conf.set("spark.executor.cores", "2")

Auto-Scaling

Configure auto-scaling to add up to 2 additional nodes when CPU utilization exceeds 70% for more than 5 minutes.
Remove nodes when CPU utilization drops below 30%.

Instance Pools

Create an instance pool with a minimum of 2 nodes and a maximum of 10 nodes.
Maintain a minimum idle instance count of 2 to keep the pool warm.

Optimized Storage Formats

Use Delta Lake or Apache Parquet file format.
Partition the data based on frequently queried columns like date or customer ID.

Steps to implement

Note: Partitioning data based on frequently queried columns can significantly improve query performance by reducing the amount of data scanned. This optimization is particularly beneficial when dealing with large datasets. Precisely, while writing the transformed dataframe to a target location, specify columns that are accessed frequently in the partitionBy(col1, col2,…) method.

Note: The underlying layer of the Delta Lake Table is Paquet files.

# Writing data to Delta Lake
df.write.format("delta").mode("overwrite").save("/path/to/delta_table")

# Reading data from Delta Lake
read_df = spark.read.format("delta").load("/path/to/delta_table")

PartitionBy Date

from pyspark.sql import SparkSession
from pyspark.sql.functions import col

# Initialize Spark session
spark = SparkSession.builder \
    .appName("Partitioning Example") \
    .getOrCreate()

# Read data
df = spark.read.csv("/path/to/data.csv", header=True)

# Convert date column to date type
df = df.withColumn("date_column", col("date_column").cast("date"))

# Write data partitioned by date
df.write.partitionBy("date_column").format("delta").mode("overwrite").save("/path/to/partitioned_data")

PartitionBy single column

# Read data
df = spark.read.csv("/path/to/data.csv", header=True)

# Write data partitioned by customer_id
df.write.partitionBy("customer_id").format("delta").mode("overwrite").save("/path/to/partitioned_data")

PartitionBy Multiple columns:

# Read data
df = spark.read.csv("/path/to/data.csv", header=True)

# Convert date column to date type
df = df.withColumn("date_column", col("date_column").cast("date"))

# Write data partitioned by date and customer_id
df.write.partitionBy("date_column", "customer_id").format("delta").mode("overwrite").save("/path/to/partitioned_data")

Partitioning

Partition data into 128 MB partitions for optimal performance.
Utilize dynamic partition pruning to minimize data scanned during query execution. Dynamic partition pruning will ensure that only relevant partitions are scanned during query execution, reducing the amount of data read and improving performance.

Partitioning

from pyspark.sql import SparkSession
from pyspark.sql.functions import col

# Initialize Spark session
spark = SparkSession.builder \
    .appName("Partitioning Example") \
    .config("spark.sql.files.maxPartitionBytes", "128m") \
    .getOrCreate()

# Read data
df = spark.read.csv("/path/to/data.csv", header=True)

# Convert date column to date type (if needed)
# df = df.withColumn("date_column", col("date_column").cast("date"))

# Write data partitioned by date into 128 MB partitions
df.write.partitionBy("date_column").format("delta").mode("overwrite").save("/path/to/partitioned_data")

Dynamic partition pruning

# Enable dynamic partition pruning
spark.conf.set("spark.sql.optimizer.dynamicPartitionPruning.enabled", "true")

# Read partitioned data
df_partitioned = spark.read.format("delta").load("/path/to/partitioned_data")

# Query with filter on partitioned column
filtered_df = df_partitioned.filter(col("date_column") == "2022-01-01")

# Perform aggregation or other operations on the filtered DataFrame
result_df = filtered_df.groupBy("customer_id").count()

# Show result
result_df.show()

Caching and Memoization

Cache frequently accessed data using Databricks’ caching API.
Set TTL for cached data based on freshness requirements.

Cache

from pyspark.sql import SparkSession

# Initialize Spark session
spark = SparkSession.builder \
    .appName("Caching Example") \
    .getOrCreate()

# Read data
df = spark.read.csv("/path/to/data.csv", header=True)

# Cache DataFrame
df.cache()

# Perform actions on cached DataFrame
df_count = df.count()
df.show()

# Uncache DataFrame when no longer needed
df.unpersist()

TTL

from pyspark.sql import SparkSession

# Initialize Spark session
spark = SparkSession.builder \
    .appName("TTL Example") \
    .getOrCreate()

# Read data
df = spark.read.csv("/path/to/data.csv", header=True)

# Cache DataFrame with TTL of 10 minutes (600 seconds)
df.persist(storageLevel="MEMORY_AND_DISK", expiresIn="600s")

# Perform actions on cached DataFrame
df_count = df.count()
df.show()

# DataFrame will be automatically uncached after 10 minutes

Query Optimization

Utilize Databricks’ query profiler to identify slow queries.
Optimize joins and transformations to avoid Cartesian products and unnecessary shuffling.

Steps to enable query profiler(using configuration)

# Enable query profiling
spark.conf.set("spark.databricks.queryWatch.enabled", "true")
# Run SQL query
df = spark.sql("SELECT * FROM table_name WHERE condition")

# Show query result
df.show()

Databricks UI:

Navigate to the Queries tab in the Databricks workspace.
Locate your query in the list of executed queries.
Click on the query to view its profile, which includes details like execution time, stages, tasks, and more.

Steps to enable query profiler(Programmatically)

from databricks import query_watch

# Retrieve query profile
profile = query_watch.get_query_watch_timeline().collect()

# Display query profile
for entry in profile:
    print(entry)

Analyzing Query Profile

Once you have the query profile, you can analyze it to identify areas for optimization:

Execution Time: Look for stages or tasks that are taking longer to execute.
Stages and Tasks: Analyze the stages and tasks to understand the execution flow and identify any bottlenecks.
Data Skew: Check for data skewness in the tasks, which can affect the performance.

Optimizing Queries Based on Profiling

Based on the insights from the query profile, you can optimize your queries by:

Tuning Spark Configurations: Adjusting Spark configurations like spark.sql.shuffle.partitions, spark.executor.memory, etc., to optimize performance.
Optimizing SQL Queries: Refining SQL queries by adding appropriate filters, joins, and aggregations to reduce data scanned and processed.

Cluster Policies

Configure clusters to terminate after 1 hour of inactivity.
Set a minimum cluster size of 1 worker node.

Monitoring and Logging

Set up alerts for CPU utilization exceeding 80% for over 15 minutes.
Monitor job execution duration and resource utilization using Databricks’ logs and metrics.

Job Scheduling

Schedule ETL jobs during off-peak hours to reduce costs and contention.

Network Optimization

Deploy resources in the same Azure region to minimize latency.
Use Azure ExpressRoute for dedicated, private network connectivity.

Security Optimization

Implement Azure AD-based authentication.
Encrypt data at rest using Azure Storage Service Encryption and in transit using SSL/TLS encryption. Utilize Azure Key Vault for managing encryption keys securely.

Conclusion

These finalized values and recommendations provide a comprehensive approach to optimizing Azure Databricks clusters for performance, efficiency, and security. Adjustments may be necessary based on specific workload characteristics and requirements. Regular monitoring and optimization are essential to ensure ongoing performance improvements and cost-effectiveness.

References

Microsoft Learn Data Engineering: https://www.classcentral.com/course/microsoft-learn-data-engineering-with-azure-databricks-62259
Azure Databricks Cook book: https://www.amazon.in/Azure-Databricks-Cookbook-Accelerate-Spark-based/dp/1789809711

Srini

Data Engineer with deep AI and Generative AI expertise, crafting high-performance data pipelines in PySpark, Databricks, and SQL. Skilled in Python, AWS, and Linux—building scalable, cloud-native solutions for smart applications.