How to Extract Databricks Job Logs Using REST API, Convert to DataFrame, and Store in PostgreSQL. Here are the details.

Q1: How to Extract HTTPS Website Links Using Python re?

You can easily find all HTTPS links in a text using Python’s re module.

import re

text = """
Visit https://www.google.com and https://openai.com/research for details.
"""

pattern = r"https?://[^\s'\"]+"
https_links = re.findall(pattern, text)
print(https_links)

Output:

['https://www.google.com', 'https://openai.com/research']

🔍 Regex Breakdown:

PartMeaning
rRaw string — prevents escape issues
https?Matches http or https
://Matches literal ://
[^\s'"]+Matches characters that are not whitespace or quotes

This pattern cleanly extracts both HTTP and HTTPS URLs from raw text.

🧩 Q2: What Happens in Delta Live Tables (DLT) When maxFilesPerTrigger and maxBytesPerTrigger Are Set?

In Databricks Delta Live Tables (DLT), these options control how much data is processed per micro-batch.

OptionMeaning
maxFilesPerTriggerMaximum number of new files processed in each batch
maxBytesPerTriggerMaximum total size (in bytes) of new files processed in each batch

When both are set, the more restrictive limit takes effect.

Example:

@dlt.table
def bronze_data():
    return (
        spark.readStream
            .format("cloudFiles")
            .option("cloudFiles.format", "json")
            .option("maxFilesPerTrigger", 100)
            .option("maxBytesPerTrigger", 104857600)  # 100 MB
            .load("s3://my-bucket/raw/")
    )

This configuration ensures stable and incremental ingestion without overloading your Databricks cluster.

🧩 Q3: If maxFilesPerTrigger Is 10,000 and You Have 10,010 Files — What Happens?

If you set:

.option("maxFilesPerTrigger", 10000)

and 10,010 new files are found, Databricks will:

  1. Process 10,000 files in the current micro-batch.
  2. Record their checkpoints.
  3. Automatically process the remaining 10 files in the next trigger.

No files are skipped or lost — DLT ensures checkpoint-based continuity.

🧩 Q4: How to Write Databricks Job Logs to PostgreSQL (pgAdmin)?

You can store Databricks job logs into PostgreSQL for monitoring and reporting.

Step 1 — Create a Table in PostgreSQL:

CREATE TABLE databricks_job_logs (
    id SERIAL PRIMARY KEY,
    job_id BIGINT,
    run_id BIGINT,
    job_name TEXT,
    status TEXT,
    start_time TIMESTAMP,
    end_time TIMESTAMP,
    duration_sec INT,
    message TEXT,
    inserted_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

Step 2 — Write Log Data from Databricks:

import psycopg2
from datetime import datetime

conn = psycopg2.connect(
    host="your-postgres-host",
    dbname="your_db",
    user="your_user",
    password="your_password"
)

cur = conn.cursor()

cur.execute("""
    INSERT INTO databricks_job_logs
    (job_id, run_id, job_name, status, start_time, end_time, duration_sec, message)
    VALUES (%s, %s, %s, %s, %s, %s, %s, %s)
""", (123, 456, "daily_pipeline", "SUCCESS",
      datetime(2025, 10, 4, 9, 0, 0),
      datetime.now(),
      120, "Job completed successfully."))

conn.commit()
cur.close()
conn.close()

✅ You can query logs directly in pgAdmin:

SELECT * FROM databricks_job_logs ORDER BY inserted_at DESC;

🧩 Q5: How to Get Databricks Job Logs Using REST API and Load into a DataFrame?

Step 1 — Call the REST API:

import requests, pandas as pd

url = "https://adb-<workspace-id>.<region>.databricks.net/api/2.1/jobs/runs/list"
headers = {"Authorization": f"Bearer {TOKEN}"}
response = requests.get(url, headers=headers)
data = response.json()

Step 2 — Convert to DataFrame:

df = pd.json_normalize(data["runs"])
df["start_time"] = pd.to_datetime(df["start_time"], unit="ms")
df["end_time"] = pd.to_datetime(df["end_time"], unit="ms")
df["duration_sec"] = (df["end_time"] - df["start_time"]).dt.total_seconds()

Step 3 — Convert to PySpark DataFrame (Optional):

spark_df = spark.createDataFrame(df)
spark_df.show()

Now you can easily analyze or write the job logs into your PostgreSQL table.

🧩 Q6: How to Access Databricks REST API Without Hardcoding Token?

There are three secure options:

MethodDescriptionBest for
🔸 Databricks Secret ScopeStore token securely using dbutils.secretsDatabricks notebooks
🔸 Azure Managed Identity / AWS IAM RoleAutomatically obtain tokensCloud-native jobs
🔸 Environment VariableStore token as OS variableLocal / CI/CD

Example using Secret Scope:

TOKEN = dbutils.secrets.get(scope="dbx-api", key="token")
headers = {"Authorization": f"Bearer {TOKEN}"}
response = requests.get(f"{DATABRICKS_INSTANCE}/api/2.1/jobs/runs/list", headers=headers)

✅ No token exposure in notebooks or logs.
✅ Recommended for production Databricks jobs.

🏁 Conclusion

In this post, we covered:

  • How to extract HTTPS links using regex.
  • How DLT’s file and byte triggers behave.
  • What happens when file counts exceed limits.
  • How to log Databricks job data into PostgreSQL.
  • How to securely access Databricks REST APIs and convert JSON to DataFrames.

These techniques are key to building secure, reliable, and auditable data pipelines in Databricks environments.