PySpark DBUtils: How to Use Each Command Effectively

PySparkDatabricks Utilities (DBUtils) is a powerful tool within Databricks that provides various functionalities to interact with your Databricks environment, such as accessing files in DBFS, interacting with clusters, and managing libraries.

Photo by Tim Gouw on Pexels.com

Table of contents

  1. PySpark DBUtils common commands
    1. dbutils.fs
    2. dbutils.library
    3. dbutils.widgets
    4. dbutils.secrets
    5. dbutils.cluster
    6. dbutils.jobs
    7. dbutils.notebook
    8. dbutils.widgets

PySpark DBUtils common commands

dbutils.fs

This module allows you to interact with the Databricks File System (DBFS). Common commands include:

  • dbutils.fs.ls(path): List files in a directory.
  • dbutils.fs.cp(src, dst): Copy files from source to destination.
  • dbutils.fs.rm(path, recurse=True): Remove a file or directory.
  • dbutils.fs.mkdirs(path): Create directories.
  • dbutils.fs.mount(source, mount_point): Mount an external storage system to a mount point in DBFS.
  • dbutils.fs.unmount(mount_point): Unmount a mounted storage system.

dbutils.library

This module provides utilities for managing libraries.

  • dbutils.library.install(package, repo=None, maven_coords=None): Install a library.
  • dbutils.library.list(): List installed libraries.
  • dbutils.library.restartPython(): Restart the Python environment to make installed libraries available.

dbutils.widgets

This module provides utilities for creating widgets in notebooks.

  • dbutils.widgets.text(name, defaultValue, label): Create a text widget.
  • dbutils.widgets.dropdown(name, defaultValue, values, label): Create a dropdown widget.
  • dbutils.widgets.get(name): Get the value of a widget.

dbutils.secrets

This module provides utilities for accessing secrets stored in Databricks.

  • dbutils.secrets.get(scope, key): Get the value of a secret.
  • dbutils.secrets.set(scope, key, value): This command sets a secret in Databricks Secrets.
  • dbutils.secrets.listScopes(): List all secret scopes.
  • dbutils.secrets.list(scope): List all secrets within a specific scope.
  • dbutils.secrets.help(): This command provides help and documentation for working with secrets in Databricks.

dbutils.cluster

This module provides utilities for interacting with clusters.

  • dbutils.cluster.list(): List all clusters.
  • dbutils.cluster.restart(cluster_id): Restart a cluster.
  • dbutils.cluster.resize(cluster_id, num_workers): Resize the number of workers in a cluster.
  • dbutils.cluster.terminate(): This command terminates a cluster, stopping all associated Spark jobs and releasing the resources.

    dbutils.jobs

    This module provides utilities for managing jobs in Databricks.

    • dbutils.jobs.run_now(notebook_path, timeout_seconds=None, arguments=None): Run a notebook job immediately.
    • dbutils.jobs.submit() methods: Submit jobs programmatically with various configurations such as Python, Spark Jar, or Spark Submit tasks.

      dbutils.notebook

      • dbutils.notebook.exit(result): This command terminates the current notebook with a specified result.
      • dbutils.notebook.run(path, timeout_seconds=None, arguments=None, …): This is used to run another notebook from the current notebook.
      • dbutils.notebook.help(): This command shows the help for the notebook utilities.
      • dbutils.notebook.runNotebook(): This command allows you to run a notebook from another notebook with additional parameters such as timeout, passing arguments, and configuring the cluster.
      • dbutils.notebook.list(): This command lists all the notebooks available in the workspace.
      • dbutils.notebook.rename(old_path, new_path): This command renames a notebook.
      • dbutils.notebook.export(): This command allows you to export a notebook to a specified file format, such as HTML or DBC archive.
      • dbutils.notebook.pin(): This command pins a notebook, making it easily accessible from the workspace sidebar.
      • dbutils.notebook.runAll(): This command runs all the cells in the current notebook.
      • dbutils.notebook.getContext(): This command retrieves the context information of the current notebook, including the notebook ID, path, and user information.
      • dbutils.notebook.exitAll(): This command terminates all running cells in the current notebook.
      • dbutils.notebook.list(): This command lists all the notebooks available in the workspace.
      • dbutils.notebook.drop(): This command deletes the specified notebook from the workspace.
      • dbutils.notebook.import(): This command imports a notebook into the workspace from a specified source, such as a file or URL.
      • dbutils.notebook.export(): This command exports the specified notebook to a specified destination, such as a file or URL.
      • butils.notebook.suspend(): This command suspends the execution of the current notebook, allowing it to be resumed later.
      • dbutils.notebook.resume(): This command resumes the execution of a suspended notebook.
      • dbutils.notebook.save(): This command saves the changes made to the current notebook.
      • dbutils.notebook.clear(): This command clears the outputs of all cells in the current notebook.

      dbutils.widgets

      • dbutils.widgets.text(): Besides creating text widgets, you can also create password widgets for securely accepting passwords from users.
      • dbutils.widgets.multiselect(): This command allows users to select multiple options from a list.
      • dbutils.widgets.remove(): This command removes a widget from the notebook.
      • dbutils.widgets.help(): This command provides help and documentation for working with widgets in Databricks notebooks.

      These commands cover functionalities for managing notebooks, widgets, and notebook execution within Databricks. Depending on your specific use case, you may find these commands helpful for your workflow.

      Author: Srini

      Experienced Data Engineer, having skills in PySpark, Databricks, Python SQL, AWS, Linux, and Mainframe