Development Environment¶
This guide describes the setup of the development environment for nessy.
Local¶
Locally, the Framework should be developed within the .devcontainer that is
defined inside the repository. This ensures, that all developers use the same
setup and prevents the CI pipeline from failing due to missing pre-commit
checks.
The recommended way to run the devcontainer is from within WSL
Also check the CLOE Docs on devcontainers!
This can be done by following these steps:
- Install a WSL distribution (unless already done): wsl --install Ubuntu
- Clone the repository: git clone https://<PLACE_YOUR_PAT>@dev.azure.com/initions-consulting/CLOE/_git/cloe-nessy-py1cloning the repository might work with the git credential manager, then a PAT is not needed. 
- Open the repository from within WSL in vscode: cd path/to/repository && code .
- The first time you do this, WSL will automatically install the vscode server in WSL
- In vscode install the ms-vscode-remote.remote-containersextension (unless already done)
- Reopen the repository from the extension: [cmd] + [shift] + [p]and thenReopen in Container
- The first time you do this, you might be asked, whether Docker should be installed in WSL
The devcontainer defines an optimal working environment for Python projects. This includes pre-configured uv, pre-commit and python configurations. Of course you are welcome to change the configuration, if anything is missing or not optimal!
Databricks Connect¶
Connect to Databricks from VSCode¶
It is possible to connect VSCode to a databricks workspace, upload code to a workspace, and execute it on a cluster. To simplify this connection, Databricks Connect offers a python library and a VSCode extension. The following steps are necessary to establish a connection to a DBX Workspace.
Prerequisites¶
- The workspace that should be connected must be Unity Catalog enabled.
- The DBX connect python library and VSCode extension should automatically be installed with the nessy devcontainer.
Authentication¶
We recommend using OAuth to authenticate to databricks since it is the most convenient option. This can be done with the following steps:
- Create a Databricks configuration
   profile
   file.Within the devcontainer, the Profile is automatically created 
- This can be done either by following the Databricks VSCode Extension and
     click on Configure Databricksand thenOpen Databricks Config File.
- when running commands directly in the shell, e.g. make test, you need to also set a value forcluster_idin the config file.
- Alternatively, one can manually create a file .databrickscfgunder the path/home/vscode/in the devcontainer.
- For OAuth, the file must include a name for the environment, the databricks host, and a cluster id. The latter can also be manually assigned through the DBX Connect extension. The file should look like this:
- If the Profile is different to DEFAULT, make sure to set theDATABRICKS_CONFIG_PROFILEenvironment variable to the appropriate name.
[DEFAULT]
host      = https://adb-XXXXXXXXXXXXXXXXX.azuredatabricks.net/
auth_type = databricks-cli
cluster_id = XXXX-XXXXXX-XXXXXXX
The file may include more than one environment. The Databricks connect VSCode extension also offers convenience features through the GUI to switch between environments and to choose the right authentication method.
Switch to the terminal and use the databricks CLI to authenticate to the workspace specified in the configuration profile using OAuth.2 This is done with the following command:
Enter the name of your profile equal to the environment name you specified above
(DEFAULT in the example above). The CLI will return a link that you should
open in a browser to finish authentication. You should see a website confirming
that authentication was successful which you can close afterwards.
Next, click on Configure Databricks again in the VSCode Extension. You should
then be able to click on your Profile Name and then on OAuth. You're all set.
If everything worked, you should see a small popup stating that connection to DBX was successfully established. The Databricks connect extension should show the cluster and workspace information. It is also possible to interact with Databricks, e.g. to start a cluster, through the extension.
Uploading and executing code from VSCode to a DBX Cluster¶
After the connection was established successfully, it is possible to run code and access Unity Catalog through databricks. Code can be executed either directly on the cluster, or it can be executed as a workflow.
When navigating to a python file, the Run Python File "Play" symbol at the
top right of the GUI now includes the two options to Upload and Run File on
Databricks, and Run File as Workflow on Databricks.
Troubleshooting¶
- If the CLI command (databricks) cannot be found in the terminal, you might need to add the path to the databricks CLI to thePATHvariable. Since the CLI is installed together with the vscode extension, you can find it here:/home/vscode/.vscode-server/extensions/databricks.databricks-<VERSION_NUMBER>/bin/databricks. To add it to your Path, run
  ls /home/vscode/.vscode-server/extensions/ # to find the correct version number
  # then replace <VERSION_NUMBER> with the correct version number an run:s
  export PATH=$PATH:/home/vscode/.vscode-server/extensions/databricks.databricks-<VERSION_NUMBER>/bin/
- If you run into a "sync error" or similar, this could be because no path is
  specified to where the files should be uploaded within the connected
  workspace. This can be done in the project.jsonfile in the.databricks. The file should look similar to this:
{
  "host": "https://adb-XXXXXXXXXXXXXXXXXXX.azuredatabricks.net/",
  "authType": "databricks-cli",
  "databricksPath": "/home/vscode/.vscode-server/extensions/databricks.databricks-1.3.0-linux-x64/bin/databricks",
  "clusterId": "XXXX-XXXXXX-XXXXXXX",
  "workspacePath": "/Users/firstname.lastname@initions-consulting.com/.ide/cloe-nessy-py-f5a0aa5e"
}
The databricksPath value should be the path within the workspace where
code should be synchronized to.
Remote¶
To work with "real" data, a Databricks Workspace, Unity Catalog-Catalog and Storage Account have been prepared. These can also be used when defining integration tests, e.g. tests that require a Spark session.
If you are lacking permissions, please approach David Achilles.
Resources:
- Databricks Workspace: ics-dp-05-nessy-euw-ws
- Catalog: nessy_dp05
- Storage Account: icsdp05nessysa
Development Tools¶
This section provides an overview of the tools that are used in the development of nessy. The tools are pre-configured in the devcontainer and should be used from there.
Note
The CLOE Documentation explains typing, devcontainers and general python conventions to follow, which is therefore not repeated here.
A detailed guide on uv can be found here.
Makefile¶
The Makefile is used to define common tasks that are executed during the development process, e.g.:
- make test: Runs the tests.
- make test_local: Runs the tests that are not marked with databricksorspark.
- make cov: Runs the tests and reports the coverage.
- make lint: Lints the code by running mypyandpre-commit.
- make doc: Builds and serves the documentation.
- 
To create a Personal Access Token (PAT), you can refer to the Microsoft docs. ↩ 
- 
See troubleshooting if the CLI cannot be found. ↩