How To: Using a local Spark environment¶
You can use a local Spark environment with Nessy. Currently it only works with pyspark and local filesystem, a standalone Spark Cluster using S3 file system does not yet work!
Prerequisites¶
You first need to have Python 3.11 or 3.12 and Java 11 on your local machine.
Create a virtual environment
and activate it
install dependencies
Delta Spark vs Databricks Connect Conflict
Important: Do not install delta-spark when using databricks-connect
- the PySpark versions conflict and will cause dependency resolution issues.
- For local Spark development: Use cloe_nessy&local-sparkbut deinstalldatabricks-connect
- For Databricks Connect development: Use the default installation cloe_nessy(databricks-connect is included in dev dependencies)
Never install both simultaneously.
Running Nessy¶
To run nessy with Delta table support, you have to apply some Spark
configurations. You can do that using the NESSY_SPARK_CONFIG environment
variable.
export NESSY_SPARK_CONFIG='{"spark.master":"local[*]",
                        "spark.driver.bindAddress":"0.0.0.0",
                        "spark.ui.bindAddress":"0.0.0.0",
                        "spark.sql.extensions": "io.delta.sql.DeltaSparkSessionExtension",
                        "spark.sql.catalog.spark_catalog": "io.unitycatalog.spark.UCSingleCatalog",
                        "spark.sql.catalog.unity": "io.unitycatalog.spark.UCSingleCatalog",
                        "spark.sql.catalog.unity.uri": "http://localhost:28392",
                        "spark.sql.catalog.unity.token": "",
                        "spark.sql.defaultCatalog": "unity",
                        "spark.sql.sources.default": "delta",
                        "spark.jars.packages": "io.delta:delta-spark_2.12:3.2.1,io.unitycatalog:unitycatalog-spark_2.12:0.2.1,io.unitycatalog:unitycatalog-client:0.2.1"
                        }'
Make sure that spark.sql.catalog.unity.uri points to the local unity catalog
instance. You can easily create one using
Platys and docker compose. Also adapt
the spark.jars.packages with the correct version you want to use.
With that you can run your Nessy pipeline using
Nessy will automatically detect that a local Spark environment is used and
create a SparkSession using the configuration specified in
NESSY_SPARK_CONFIG.