How to Extract Databricks Job Logs Using REST API, Convert to DataFrame, and Store in PostgreSQL. Here are the details.
Q1: How to Extract HTTPS Website Links Using Python re?
You can easily find all HTTPS links in a text using Python’s re module.
import re
text = """
Visit https://www.google.com and https://openai.com/research for details.
"""
pattern = r"https?://[^\s'\"]+"
https_links = re.findall(pattern, text)
print(https_links)
Output:
['https://www.google.com', 'https://openai.com/research']
🔍 Regex Breakdown:
| Part | Meaning |
|---|---|
r | Raw string — prevents escape issues |
https? | Matches http or https |
:// | Matches literal :// |
[^\s'"]+ | Matches characters that are not whitespace or quotes |
This pattern cleanly extracts both HTTP and HTTPS URLs from raw text.
🧩 Q2: What Happens in Delta Live Tables (DLT) When maxFilesPerTrigger and maxBytesPerTrigger Are Set?
In Databricks Delta Live Tables (DLT), these options control how much data is processed per micro-batch.
| Option | Meaning |
|---|---|
maxFilesPerTrigger | Maximum number of new files processed in each batch |
maxBytesPerTrigger | Maximum total size (in bytes) of new files processed in each batch |
When both are set, the more restrictive limit takes effect.
Example:
@dlt.table
def bronze_data():
return (
spark.readStream
.format("cloudFiles")
.option("cloudFiles.format", "json")
.option("maxFilesPerTrigger", 100)
.option("maxBytesPerTrigger", 104857600) # 100 MB
.load("s3://my-bucket/raw/")
)
This configuration ensures stable and incremental ingestion without overloading your Databricks cluster.
🧩 Q3: If maxFilesPerTrigger Is 10,000 and You Have 10,010 Files — What Happens?
If you set:
.option("maxFilesPerTrigger", 10000)
and 10,010 new files are found, Databricks will:
- Process 10,000 files in the current micro-batch.
- Record their checkpoints.
- Automatically process the remaining 10 files in the next trigger.
✅ No files are skipped or lost — DLT ensures checkpoint-based continuity.
🧩 Q4: How to Write Databricks Job Logs to PostgreSQL (pgAdmin)?
You can store Databricks job logs into PostgreSQL for monitoring and reporting.
Step 1 — Create a Table in PostgreSQL:
CREATE TABLE databricks_job_logs (
id SERIAL PRIMARY KEY,
job_id BIGINT,
run_id BIGINT,
job_name TEXT,
status TEXT,
start_time TIMESTAMP,
end_time TIMESTAMP,
duration_sec INT,
message TEXT,
inserted_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
Step 2 — Write Log Data from Databricks:
import psycopg2
from datetime import datetime
conn = psycopg2.connect(
host="your-postgres-host",
dbname="your_db",
user="your_user",
password="your_password"
)
cur = conn.cursor()
cur.execute("""
INSERT INTO databricks_job_logs
(job_id, run_id, job_name, status, start_time, end_time, duration_sec, message)
VALUES (%s, %s, %s, %s, %s, %s, %s, %s)
""", (123, 456, "daily_pipeline", "SUCCESS",
datetime(2025, 10, 4, 9, 0, 0),
datetime.now(),
120, "Job completed successfully."))
conn.commit()
cur.close()
conn.close()
✅ You can query logs directly in pgAdmin:
SELECT * FROM databricks_job_logs ORDER BY inserted_at DESC;
🧩 Q5: How to Get Databricks Job Logs Using REST API and Load into a DataFrame?
Step 1 — Call the REST API:
import requests, pandas as pd
url = "https://adb-<workspace-id>.<region>.databricks.net/api/2.1/jobs/runs/list"
headers = {"Authorization": f"Bearer {TOKEN}"}
response = requests.get(url, headers=headers)
data = response.json()
Step 2 — Convert to DataFrame:
df = pd.json_normalize(data["runs"])
df["start_time"] = pd.to_datetime(df["start_time"], unit="ms")
df["end_time"] = pd.to_datetime(df["end_time"], unit="ms")
df["duration_sec"] = (df["end_time"] - df["start_time"]).dt.total_seconds()
Step 3 — Convert to PySpark DataFrame (Optional):
spark_df = spark.createDataFrame(df)
spark_df.show()
Now you can easily analyze or write the job logs into your PostgreSQL table.
🧩 Q6: How to Access Databricks REST API Without Hardcoding Token?
There are three secure options:
| Method | Description | Best for |
|---|---|---|
| 🔸 Databricks Secret Scope | Store token securely using dbutils.secrets | Databricks notebooks |
| 🔸 Azure Managed Identity / AWS IAM Role | Automatically obtain tokens | Cloud-native jobs |
| 🔸 Environment Variable | Store token as OS variable | Local / CI/CD |
Example using Secret Scope:
TOKEN = dbutils.secrets.get(scope="dbx-api", key="token")
headers = {"Authorization": f"Bearer {TOKEN}"}
response = requests.get(f"{DATABRICKS_INSTANCE}/api/2.1/jobs/runs/list", headers=headers)
✅ No token exposure in notebooks or logs.
✅ Recommended for production Databricks jobs.
🏁 Conclusion
In this post, we covered:
- How to extract HTTPS links using regex.
- How DLT’s file and byte triggers behave.
- What happens when file counts exceed limits.
- How to log Databricks job data into PostgreSQL.
- How to securely access Databricks REST APIs and convert JSON to DataFrames.
These techniques are key to building secure, reliable, and auditable data pipelines in Databricks environments.






