Skip to content

Development Environment

This guide describes the setup of the development environment for nessy.

Local

Locally, the Framework should be developed within the .devcontainer that is defined inside the repository. This ensures, that all developers use the same setup and prevents the CI pipeline from failing due to missing pre-commit checks.

The recommended way to run the devcontainer is from within WSL

Also check the CLOE Docs on devcontainers!

This can be done by following these steps:

  1. Install a WSL distribution (unless already done): wsl --install Ubuntu
  2. Clone the repository: git clone https://<PLACE_YOUR_PAT>@dev.azure.com/initions-consulting/CLOE/_git/cloe-nessy-py1

    cloning the repository might work with the git credential manager, then a PAT is not needed.

  3. Open the repository from within WSL in vscode: cd path/to/repository && code .
  4. The first time you do this, WSL will automatically install the vscode server in WSL
  5. In vscode install the ms-vscode-remote.remote-containers extension (unless already done)
  6. Reopen the repository from the extension: [cmd] + [shift] + [p] and then Reopen in Container
  7. The first time you do this, you might be asked, whether Docker should be installed in WSL

The devcontainer defines an optimal working environment for Python projects. This includes pre-configured uv, pre-commit and python configurations. Of course you are welcome to change the configuration, if anything is missing or not optimal!

Databricks Connect

Connect to Databricks from VSCode

It is possible to connect VSCode to a databricks workspace, upload code to a workspace, and execute it on a cluster. To simplify this connection, Databricks Connect offers a python library and a VSCode extension. The following steps are necessary to establish a connection to a DBX Workspace.

Prerequisites

  • The workspace that should be connected must be Unity Catalog enabled.
  • The DBX connect python library and VSCode extension should automatically be installed with the nessy devcontainer.

Authentication

We recommend using OAuth to authenticate to databricks since it is the most convenient option. This can be done with the following steps:

  1. Create a Databricks configuration profile file.

    Within the devcontainer, the Profile is automatically created

  2. This can be done either by following the Databricks VSCode Extension and click on Configure Databricks and then Open Databricks Config File.
  3. when running commands directly in the shell, e.g. make test, you need to also set a value for cluster_id in the config file.
  4. Alternatively, one can manually create a file .databrickscfg under the path /home/vscode/ in the devcontainer.
  5. For OAuth, the file must include a name for the environment, the databricks host, and a cluster id. The latter can also be manually assigned through the DBX Connect extension. The file should look like this:
  6. If the Profile is different to DEFAULT, make sure to set the DATABRICKS_CONFIG_PROFILE environment variable to the appropriate name.
[DEFAULT]
host      = https://adb-XXXXXXXXXXXXXXXXX.azuredatabricks.net/
auth_type = databricks-cli
cluster_id = XXXX-XXXXXX-XXXXXXX

The file may include more than one environment. The Databricks connect VSCode extension also offers convenience features through the GUI to switch between environments and to choose the right authentication method.

Switch to the terminal and use the databricks CLI to authenticate to the workspace specified in the configuration profile using OAuth.2 This is done with the following command:

databricks auth login

Enter the name of your profile equal to the environment name you specified above (DEFAULT in the example above). The CLI will return a link that you should open in a browser to finish authentication. You should see a website confirming that authentication was successful which you can close afterwards.

Next, click on Configure Databricks again in the VSCode Extension. You should then be able to click on your Profile Name and then on OAuth. You're all set.

If everything worked, you should see a small popup stating that connection to DBX was successfully established. The Databricks connect extension should show the cluster and workspace information. It is also possible to interact with Databricks, e.g. to start a cluster, through the extension.

Uploading and executing code from VSCode to a DBX Cluster

After the connection was established successfully, it is possible to run code and access Unity Catalog through databricks. Code can be executed either directly on the cluster, or it can be executed as a workflow.

When navigating to a python file, the Run Python File "Play" symbol at the top right of the GUI now includes the two options to Upload and Run File on Databricks, and Run File as Workflow on Databricks.

Troubleshooting

  • If the CLI command (databricks) cannot be found in the terminal, you might need to add the path to the databricks CLI to the PATH variable. Since the CLI is installed together with the vscode extension, you can find it here: /home/vscode/.vscode-server/extensions/databricks.databricks-<VERSION_NUMBER>/bin/databricks. To add it to your Path, run
  ls /home/vscode/.vscode-server/extensions/ # to find the correct version number
  # then replace <VERSION_NUMBER> with the correct version number an run:s
  export PATH=$PATH:/home/vscode/.vscode-server/extensions/databricks.databricks-<VERSION_NUMBER>/bin/
  • If you run into a "sync error" or similar, this could be because no path is specified to where the files should be uploaded within the connected workspace. This can be done in the project.json file in the .databricks. The file should look similar to this:
{
  "host": "https://adb-XXXXXXXXXXXXXXXXXXX.azuredatabricks.net/",
  "authType": "databricks-cli",
  "databricksPath": "/home/vscode/.vscode-server/extensions/databricks.databricks-1.3.0-linux-x64/bin/databricks",
  "clusterId": "XXXX-XXXXXX-XXXXXXX",
  "workspacePath": "/Users/firstname.lastname@initions-consulting.com/.ide/cloe-nessy-py-f5a0aa5e"
}

The databricksPath value should be the path within the workspace where code should be synchronized to.

Remote

To work with "real" data, a Databricks Workspace, Unity Catalog-Catalog and Storage Account have been prepared. These can also be used when defining integration tests, e.g. tests that require a Spark session.

If you are lacking permissions, please approach David Achilles.

Resources:

Development Tools

This section provides an overview of the tools that are used in the development of nessy. The tools are pre-configured in the devcontainer and should be used from there.

Note

The CLOE Documentation explains typing, devcontainers and general python conventions to follow, which is therefore not repeated here.

A detailed guide on uv can be found here.

Makefile

The Makefile is used to define common tasks that are executed during the development process, e.g.:

  • make test: Runs the tests.
  • make test_local: Runs the tests that are not marked with databricks or spark.
  • make cov: Runs the tests and reports the coverage.
  • make lint: Lints the code by running mypy and pre-commit.
  • make doc: Builds and serves the documentation.

  1. To create a Personal Access Token (PAT), you can refer to the Microsoft docs

  2. See troubleshooting if the CLI cannot be found.