Stop Manually Checking Databricks Jobs — Export Logs to PostgreSQL Instead

How to Extract Databricks Job Logs Using REST API, Convert to DataFrame, and Store in PostgreSQL. Here are the details.

Q1: How to Extract HTTPS Website Links Using Python `re`?

You can easily find all HTTPS links in a text using Python’s re module.

import re

text = """
Visit https://www.google.com and https://openai.com/research for details.
"""

pattern = r"https?://[^\s'\"]+"
https_links = re.findall(pattern, text)
print(https_links)

Output:

['https://www.google.com', 'https://openai.com/research']

🔍 Regex Breakdown:

Part	Meaning
`r`	Raw string — prevents escape issues
`https?`	Matches `http` or `https`
`://`	Matches literal `://`
`[^\s'"]+`	Matches characters that are not whitespace or quotes

This pattern cleanly extracts both HTTP and HTTPS URLs from raw text.

🧩 Q2: What Happens in Delta Live Tables (DLT) When `maxFilesPerTrigger` and `maxBytesPerTrigger` Are Set?

In Databricks Delta Live Tables (DLT), these options control how much data is processed per micro-batch.

Option	Meaning
`maxFilesPerTrigger`	Maximum number of new files processed in each batch
`maxBytesPerTrigger`	Maximum total size (in bytes) of new files processed in each batch

When both are set, the more restrictive limit takes effect.

Example:

@dlt.table
def bronze_data():
    return (
        spark.readStream
            .format("cloudFiles")
            .option("cloudFiles.format", "json")
            .option("maxFilesPerTrigger", 100)
            .option("maxBytesPerTrigger", 104857600)  # 100 MB
            .load("s3://my-bucket/raw/")
    )

This configuration ensures stable and incremental ingestion without overloading your Databricks cluster.

🧩 Q3: If `maxFilesPerTrigger` Is 10,000 and You Have 10,010 Files — What Happens?

If you set:

.option("maxFilesPerTrigger", 10000)

and 10,010 new files are found, Databricks will:

Process 10,000 files in the current micro-batch.
Record their checkpoints.
Automatically process the remaining 10 files in the next trigger.

✅ No files are skipped or lost — DLT ensures checkpoint-based continuity.

🧩 Q4: How to Write Databricks Job Logs to PostgreSQL (pgAdmin)?

You can store Databricks job logs into PostgreSQL for monitoring and reporting.

Step 1 — Create a Table in PostgreSQL:

CREATE TABLE databricks_job_logs (
    id SERIAL PRIMARY KEY,
    job_id BIGINT,
    run_id BIGINT,
    job_name TEXT,
    status TEXT,
    start_time TIMESTAMP,
    end_time TIMESTAMP,
    duration_sec INT,
    message TEXT,
    inserted_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

Step 2 — Write Log Data from Databricks:

import psycopg2
from datetime import datetime

conn = psycopg2.connect(
    host="your-postgres-host",
    dbname="your_db",
    user="your_user",
    password="your_password"
)

cur = conn.cursor()

cur.execute("""
    INSERT INTO databricks_job_logs
    (job_id, run_id, job_name, status, start_time, end_time, duration_sec, message)
    VALUES (%s, %s, %s, %s, %s, %s, %s, %s)
""", (123, 456, "daily_pipeline", "SUCCESS",
      datetime(2025, 10, 4, 9, 0, 0),
      datetime.now(),
      120, "Job completed successfully."))

conn.commit()
cur.close()
conn.close()

✅ You can query logs directly in pgAdmin:

SELECT * FROM databricks_job_logs ORDER BY inserted_at DESC;

🧩 Q5: How to Get Databricks Job Logs Using REST API and Load into a DataFrame?

Step 1 — Call the REST API:

import requests, pandas as pd

url = "https://adb-<workspace-id>.<region>.databricks.net/api/2.1/jobs/runs/list"
headers = {"Authorization": f"Bearer {TOKEN}"}
response = requests.get(url, headers=headers)
data = response.json()

Step 2 — Convert to DataFrame:

df = pd.json_normalize(data["runs"])
df["start_time"] = pd.to_datetime(df["start_time"], unit="ms")
df["end_time"] = pd.to_datetime(df["end_time"], unit="ms")
df["duration_sec"] = (df["end_time"] - df["start_time"]).dt.total_seconds()

Step 3 — Convert to PySpark DataFrame (Optional):

spark_df = spark.createDataFrame(df)
spark_df.show()

Now you can easily analyze or write the job logs into your PostgreSQL table.

🧩 Q6: How to Access Databricks REST API Without Hardcoding Token?

There are three secure options:

Method	Description	Best for
🔸 Databricks Secret Scope	Store token securely using `dbutils.secrets`	Databricks notebooks
🔸 Azure Managed Identity / AWS IAM Role	Automatically obtain tokens	Cloud-native jobs
🔸 Environment Variable	Store token as OS variable	Local / CI/CD

Example using Secret Scope:

TOKEN = dbutils.secrets.get(scope="dbx-api", key="token")
headers = {"Authorization": f"Bearer {TOKEN}"}
response = requests.get(f"{DATABRICKS_INSTANCE}/api/2.1/jobs/runs/list", headers=headers)

✅ No token exposure in notebooks or logs.
✅ Recommended for production Databricks jobs.

🏁 Conclusion

In this post, we covered:

How to extract HTTPS links using regex.
How DLT’s file and byte triggers behave.
What happens when file counts exceed limits.
How to log Databricks job data into PostgreSQL.
How to securely access Databricks REST APIs and convert JSON to DataFrames.

These techniques are key to building secure, reliable, and auditable data pipelines in Databricks environments.

Srini

Data Engineer with deep AI and Generative AI expertise, crafting high-performance data pipelines in PySpark, Databricks, and SQL. Skilled in Python, AWS, and Linux—building scalable, cloud-native solutions for smart applications.

Latest Posts

Why DELETE with Subqueries Fails in PySpark SQL (And How to Fix It)

January 1, 2026
GitHub Features & Settings Explained: The Ultimate GitHub Options Guide

December 28, 2025
Ingesting Data from AWS S3 into Databricks with Auto Loader: Building a Medallion Architecture

December 18, 2025
Building Scalable Data Pipelines with dlt-meta: A Metadata-Driven Approach on Databricks

December 9, 2025

Stop Manually Checking Databricks Jobs — Export Logs to PostgreSQL Instead

Q1: How to Extract HTTPS Website Links Using Python re?

🔍 Regex Breakdown:

🧩 Q2: What Happens in Delta Live Tables (DLT) When maxFilesPerTrigger and maxBytesPerTrigger Are Set?

🧩 Q3: If maxFilesPerTrigger Is 10,000 and You Have 10,010 Files — What Happens?