In today’s data-driven world, integrating data from traditional databases like MySQL and Oracle into Databricks is a common requirement for analytics, AI, and machine learning pipelines.
Databricks, powered by Apache Spark, provides a unified platform for processing data at scale — but extracting data from on-premises or cloud-hosted databases efficiently requires careful planning.
This guide walks you through the end-to-end process of reading data from MySQL and Oracle databases into Databricks, covering connection setup, optimization techniques, schema evolution, and security best practices.
1. Understanding the Integration Flow
At a high level, the workflow follows these steps:
- Set up JDBC connectivity to MySQL or Oracle.
- Securely store database credentials using a secret manager (like Databricks Secrets, AWS Secrets Manager, or Azure Key Vault).
- Read data using PySpark or Databricks SQL via JDBC.
- Write data to Delta tables for efficient querying and incremental updates.
This process enables you to use Databricks as a central data platform, combining structured and unstructured data for advanced analytics.
2. Reading from MySQL into Databricks
MySQL is a common transactional database in many enterprise systems. Databricks supports reading MySQL data through its built-in JDBC connector.
Step 1: Define Connection Properties
You’ll need the JDBC URL, username, and password. It’s best practice to store credentials in Databricks Secrets rather than hardcoding them.
jdbc_hostname = "mysql-server.example.com"
jdbc_port = 3306
jdbc_database = "sales_db"
jdbc_url = f"jdbc:mysql://{jdbc_hostname}:{jdbc_port}/{jdbc_database}"
username = dbutils.secrets.get("mysql-scope", "username")
password = dbutils.secrets.get("mysql-scope", "password")
Step 2: Read Data Using JDBC
Use the PySpark read.jdbc() API for parallel and scalable reads.
df_mysql = (
spark.read.format("jdbc")
.option("url", jdbc_url)
.option("dbtable", "orders")
.option("user", username)
.option("password", password)
.option("driver", "com.mysql.cj.jdbc.Driver")
.option("fetchsize", 50000)
.load()
)
Step 3: Parallelize the Read (Performance Optimization)
For large MySQL tables, define partitioning columns to parallelize data retrieval.
df_mysql = (
spark.read.format("jdbc")
.option("url", jdbc_url)
.option("dbtable", "orders")
.option("user", username)
.option("password", password)
.option("driver", "com.mysql.cj.jdbc.Driver")
.option("partitionColumn", "order_id")
.option("lowerBound", 1)
.option("upperBound", 1000000)
.option("numPartitions", 16)
.load()
)
This distributes reads across Spark executors, drastically improving throughput.
Step 4: Write to the Delta Table
Once the data is loaded, persist it as a Delta table for efficient analytics.
df_mysql.write.format("delta").mode("overwrite").saveAsTable("bronze.orders")
3. Reading from Oracle into Databricks
Oracle databases often contain mission-critical enterprise data. Databricks can read Oracle tables using the Oracle JDBC driver.
Step 1: Configure the JDBC Connection
oracle_host = "oracle-server.example.com"
oracle_port = 1521
oracle_sid = "orcl"
jdbc_url = f"jdbc:oracle:thin:@{oracle_host}:{oracle_port}:{oracle_sid}"
username = dbutils.secrets.get("oracle-scope", "username")
password = dbutils.secrets.get("oracle-scope", "password")
Step 2: Read from Oracle Using Partitioning
df_oracle = (
spark.read.format("jdbc")
.option("url", jdbc_url)
.option("dbtable", "HR.EMPLOYEES")
.option("user", username)
.option("password", password)
.option("driver", "oracle.jdbc.driver.OracleDriver")
.option("partitionColumn", "EMPLOYEE_ID")
.option("lowerBound", 1)
.option("upperBound", 200000)
.option("numPartitions", 32)
.option("fetchsize", 50000)
.load()
)
The fetch size and partitioning strategy are key to improving performance and avoiding timeouts.
Step 3: Handle Data Types and Schema Evolution
Oracle-specific data types like NUMBER, CLOB, and DATE may require explicit casting.
df_oracle = df_oracle.withColumn("SALARY", df_oracle["SALARY"].cast("double"))
If schema evolution occurs (new columns added in Oracle), enable automatic merge with Delta:
df_oracle.write.format("delta") \
.mode("overwrite") \
.option("mergeSchema", "true") \
.saveAsTable("bronze.employees")
4. Best Practices for Secure and Scalable Data Loads
a. Use Secrets Management
Avoid storing credentials in notebooks. Use:
- Databricks Secrets
- AWS Secrets Manager
- Azure Key Vault
This ensures compliance with enterprise security policies.
b. Incremental Loading
Instead of full reloads, implement incremental ingestion using a timestamp or ID column.
query = "(SELECT * FROM orders WHERE updated_at > SYSDATE - 1) tmp"
df_incremental = spark.read.jdbc(url=jdbc_url, table=query, properties=connection_props)
c. Use Delta Format for Storage
Writing to Delta Lake allows:
- Schema evolution
- Time travel
- ACID transactions
- Efficient Z-Ordering and data skipping
d. Monitor and Optimize
Use the Spark UI to monitor:
- Task parallelism
- Shuffle read/write
- JDBC read performance
Partition tuning and fetch size adjustments often yield significant performance gains.
5. Automating and Scaling the Pipeline
For production workloads, consider automating ingestion using:
- Delta Live Tables (DLT) for declarative ETL pipelines
- Databricks Workflows for scheduled jobs
- Auto Loader (for incremental CDC data from Oracle or MySQL logs)
DLT simplifies data quality enforcement and schema management while ensuring fault tolerance.
Example DLT pipeline configuration:
@dlt.table(
name="silver.orders_cleaned",
comment="Cleaned and deduplicated orders data from MySQL"
)
def load_orders():
return (
dlt.read("bronze.orders")
.dropDuplicates(["order_id"])
.filter("status IS NOT NULL")
)
6. Troubleshooting Common Issues
| Slow reads | Lack of partitioning | Define partitionColumn, numPartitions |
| Connection timeout | Network or large data fetch | Increase fetchsize, check JDBC connection pool |
| Schema mismatch | Source table schema changed | Use mergeSchema=True |
| Lost rows (Oracle) | Skewed partitions | Balance partition bounds, reduce partitions |
7. Summary
Integrating MySQL and Oracle data into Databricks empowers enterprises to unify transactional and analytical workloads.
By using JDBC connectors, secure secrets management, parallelized reads, and Delta Lake storage, you can build scalable, secure, and reliable ingestion pipelines.
This approach ensures:
- High performance through partitioned reads
- Secure credentials via secret management
- Resilient storage with Delta Lake
- Automation via Databricks Workflows or DLT
Final Thoughts
Whether you’re migrating legacy databases to the cloud or powering modern analytics, Databricks provides a flexible, scalable foundation.
A well-structured ingestion strategy — following the practices above — ensures that your MySQL and Oracle data is always available, consistent, and ready for analytics.






