How To: Using a local Spark environment¶

You can use a local Spark environment with Nessy. Currently it only works with pyspark and local filesystem, a standalone Spark Cluster using S3 file system does not yet work!

Prerequisites¶

You first need to have Python 3.11 or 3.12 and Java 11 on your local machine.

Create a virtual environment

uv venv

and activate it

source .venv/bin/activate

install dependencies

uv pip install cloe_nessy delta-spark

Delta Spark vs Databricks Connect Conflict

Important: Do not install delta-spark when using databricks-connect - the PySpark versions conflict and will cause dependency resolution issues.

For local Spark development: Use cloe_nessy & local-spark but deinstall databricks-connect
For Databricks Connect development: Use the default installation cloe_nessy (databricks-connect is included in dev dependencies)

Never install both simultaneously.

Running Nessy¶

To run nessy with Delta table support, you have to apply some Spark configurations. You can do that using the NESSY_SPARK_CONFIG environment variable.

Local SparkRemote Spark

export NESSY_SPARK_CONFIG='{"spark.master":"local[*]",
                        "spark.driver.bindAddress":"0.0.0.0",
                        "spark.ui.bindAddress":"0.0.0.0",
                        "spark.sql.extensions": "io.delta.sql.DeltaSparkSessionExtension",
                        "spark.sql.catalog.spark_catalog": "io.unitycatalog.spark.UCSingleCatalog",
                        "spark.sql.catalog.unity": "io.unitycatalog.spark.UCSingleCatalog",
                        "spark.sql.catalog.unity.uri": "http://localhost:28392",
                        "spark.sql.catalog.unity.token": "",
                        "spark.sql.defaultCatalog": "unity",
                        "spark.sql.sources.default": "delta",
                        "spark.jars.packages": "io.delta:delta-spark_2.12:3.2.1,io.unitycatalog:unitycatalog-spark_2.12:0.2.1,io.unitycatalog:unitycatalog-client:0.2.1"
                        }'

Make sure that spark.sql.catalog.unity.uri points to the local unity catalog instance. You can easily create one using Platys and docker compose. Also adapt the spark.jars.packages with the correct version you want to use.

export NESSY_SPARK_CONFIG='{"remote": "TODO"}'

Devcontainer

A devcontainer for the development with a local spark setup is available in the Dataplatform repository.

With that you can run your Nessy pipeline using

python3 pipeline.py

Nessy will automatically detect that a local Spark environment is used and create a SparkSession using the configuration specified in NESSY_SPARK_CONFIG.