Reading MySQL and Oracle Databases into Databricks: Step-by-Step Tutorial

In today’s data-driven world, integrating data from traditional databases like MySQL and Oracle into Databricks is a common requirement for analytics, AI, and machine learning pipelines.
Databricks, powered by Apache Spark, provides a unified platform for processing data at scale — but extracting data from on-premises or cloud-hosted databases efficiently requires careful planning.

This guide walks you through the end-to-end process of reading data from MySQL and Oracle databases into Databricks, covering connection setup, optimization techniques, schema evolution, and security best practices.

1. Understanding the Integration Flow

At a high level, the workflow follows these steps:

Set up JDBC connectivity to MySQL or Oracle.
Securely store database credentials using a secret manager (like Databricks Secrets, AWS Secrets Manager, or Azure Key Vault).
Read data using PySpark or Databricks SQL via JDBC.
Write data to Delta tables for efficient querying and incremental updates.

This process enables you to use Databricks as a central data platform, combining structured and unstructured data for advanced analytics.

2. Reading from MySQL into Databricks

MySQL is a common transactional database in many enterprise systems. Databricks supports reading MySQL data through its built-in JDBC connector.

Step 1: Define Connection Properties

You’ll need the JDBC URL, username, and password. It’s best practice to store credentials in Databricks Secrets rather than hardcoding them.

jdbc_hostname = "mysql-server.example.com"
jdbc_port = 3306
jdbc_database = "sales_db"

jdbc_url = f"jdbc:mysql://{jdbc_hostname}:{jdbc_port}/{jdbc_database}"

username = dbutils.secrets.get("mysql-scope", "username")
password = dbutils.secrets.get("mysql-scope", "password")

Step 2: Read Data Using JDBC

Use the PySpark read.jdbc() API for parallel and scalable reads.

df_mysql = (
    spark.read.format("jdbc")
    .option("url", jdbc_url)
    .option("dbtable", "orders")
    .option("user", username)
    .option("password", password)
    .option("driver", "com.mysql.cj.jdbc.Driver")
    .option("fetchsize", 50000)
    .load()
)

Step 3: Parallelize the Read (Performance Optimization)

For large MySQL tables, define partitioning columns to parallelize data retrieval.

df_mysql = (
    spark.read.format("jdbc")
    .option("url", jdbc_url)
    .option("dbtable", "orders")
    .option("user", username)
    .option("password", password)
    .option("driver", "com.mysql.cj.jdbc.Driver")
    .option("partitionColumn", "order_id")
    .option("lowerBound", 1)
    .option("upperBound", 1000000)
    .option("numPartitions", 16)
    .load()
)

This distributes reads across Spark executors, drastically improving throughput.

Step 4: Write to the Delta Table

Once the data is loaded, persist it as a Delta table for efficient analytics.

df_mysql.write.format("delta").mode("overwrite").saveAsTable("bronze.orders")

3. Reading from Oracle into Databricks

Oracle databases often contain mission-critical enterprise data. Databricks can read Oracle tables using the Oracle JDBC driver.

Step 1: Configure the JDBC Connection

oracle_host = "oracle-server.example.com"
oracle_port = 1521
oracle_sid = "orcl"
jdbc_url = f"jdbc:oracle:thin:@{oracle_host}:{oracle_port}:{oracle_sid}"

username = dbutils.secrets.get("oracle-scope", "username")
password = dbutils.secrets.get("oracle-scope", "password")

Step 2: Read from Oracle Using Partitioning

df_oracle = (
    spark.read.format("jdbc")
    .option("url", jdbc_url)
    .option("dbtable", "HR.EMPLOYEES")
    .option("user", username)
    .option("password", password)
    .option("driver", "oracle.jdbc.driver.OracleDriver")
    .option("partitionColumn", "EMPLOYEE_ID")
    .option("lowerBound", 1)
    .option("upperBound", 200000)
    .option("numPartitions", 32)
    .option("fetchsize", 50000)
    .load()
)

The fetch size and partitioning strategy are key to improving performance and avoiding timeouts.

Step 3: Handle Data Types and Schema Evolution

Oracle-specific data types like NUMBER, CLOB, and DATE may require explicit casting.

df_oracle = df_oracle.withColumn("SALARY", df_oracle["SALARY"].cast("double"))

If schema evolution occurs (new columns added in Oracle), enable automatic merge with Delta:

df_oracle.write.format("delta") \
    .mode("overwrite") \
    .option("mergeSchema", "true") \
    .saveAsTable("bronze.employees")

4. Best Practices for Secure and Scalable Data Loads

a. Use Secrets Management

Avoid storing credentials in notebooks. Use:

Databricks Secrets
AWS Secrets Manager
Azure Key Vault

This ensures compliance with enterprise security policies.

b. Incremental Loading

Instead of full reloads, implement incremental ingestion using a timestamp or ID column.

query = "(SELECT * FROM orders WHERE updated_at > SYSDATE - 1) tmp"
df_incremental = spark.read.jdbc(url=jdbc_url, table=query, properties=connection_props)

c. Use Delta Format for Storage

Writing to Delta Lake allows:

Schema evolution
Time travel
ACID transactions
Efficient Z-Ordering and data skipping

d. Monitor and Optimize

Use the Spark UI to monitor:

Task parallelism
Shuffle read/write
JDBC read performance

Partition tuning and fetch size adjustments often yield significant performance gains.

5. Automating and Scaling the Pipeline

For production workloads, consider automating ingestion using:

Delta Live Tables (DLT) for declarative ETL pipelines
Databricks Workflows for scheduled jobs
Auto Loader (for incremental CDC data from Oracle or MySQL logs)

DLT simplifies data quality enforcement and schema management while ensuring fault tolerance.

Example DLT pipeline configuration:

@dlt.table(
    name="silver.orders_cleaned",
    comment="Cleaned and deduplicated orders data from MySQL"
)
def load_orders():
    return (
        dlt.read("bronze.orders")
        .dropDuplicates(["order_id"])
        .filter("status IS NOT NULL")
    )

6. Troubleshooting Common Issues

Slow reads	Lack of partitioning	Define partitionColumn, numPartitions
Connection timeout	Network or large data fetch	Increase fetchsize, check JDBC connection pool
Schema mismatch	Source table schema changed	Use mergeSchema=True
Lost rows (Oracle)	Skewed partitions	Balance partition bounds, reduce partitions

7. Summary

Integrating MySQL and Oracle data into Databricks empowers enterprises to unify transactional and analytical workloads.
By using JDBC connectors, secure secrets management, parallelized reads, and Delta Lake storage, you can build scalable, secure, and reliable ingestion pipelines.

This approach ensures:

High performance through partitioned reads
Secure credentials via secret management
Resilient storage with Delta Lake
Automation via Databricks Workflows or DLT

Final Thoughts

Whether you’re migrating legacy databases to the cloud or powering modern analytics, Databricks provides a flexible, scalable foundation.
A well-structured ingestion strategy — following the practices above — ensures that your MySQL and Oracle data is always available, consistent, and ready for analytics.

Srini

Data Engineer with deep AI and Generative AI expertise, crafting high-performance data pipelines in PySpark, Databricks, and SQL. Skilled in Python, AWS, and Linux—building scalable, cloud-native solutions for smart applications.