Getting Started¶

This guide is meant to provide an easy entry point to start working with nessy.

Introduction¶

Why develop a Framework¶

🧠 Swarm Intelligence: Solving problems benefits all.
⚙️ Efficiency: Frameworks eliminate repetitive coding.
📏 Consistency: Framework use standardizes code for easier maintenance.
🤝 Collaboration: Shared frameworks foster knowledge sharing and teamwork.
✨ Quality: Collective testing and maintenance improve code quality.
💡 Knowledge Utilization: Leverage collective talent in our organization.
🌱 Evolutionary Development: Extend the Framework with new capabilities as needed.
🔒 Security: Implement security best practices across all projects.
📚 Documentation: Comprehensive documentation helps onboard new developers quickly.
🔄 Reusability: Promote code reuse across multiple projects.
🛠 Tooling: Integrate with a suite of development tools for enhanced productivity.
🌐 Interoperability: Ensure compatibility with other systems and frameworks.

Nessy Scope¶

Nessy is designed as a toolbox for Spark ETL and ELT workloads. While its main purpose is to be used as a comprehensive framework, you can also choose to use individual tools from the toolbox. For example, the APIClient is a useful tool that implements many best practices and reduces overhead, such as handling pagination and OAuth authentication in your code.

The workloads can be run on any platform supporting Spark, such as Databricks or Fabric, while, at the moment, Databricks is considered the first class citizen due to the amount of projects that have used Nessy in a Databricks environment and the maturity of the platform in general.

Nessy will support you with operations such as reading, transforming or writing data. The goal is to reduce the overhead of implementing best practices, while at the same time giving the developer the freedom to write code as needed in complex project situations.

Known Limitations¶

Streaming: While streaming is supported, it has limitations. The Nessy pipeline must not materialize the DataFrame before writing the stream. A workaround using the "Change Data Feed" for delta loads is available, but it does of course not cover all real-time streaming use cases.

What's outside of the scope¶

Customer or project specific solutions: The framework is aimed to provide solutions following general best practices, it is not intended to implement every use case for every customer. Of course it will be extended where the requirement might make sense for more than one project.
Table lifecycle management: While Nessy supports creation of tables, as part of the workflow, things like schema evolution are out of Scope. Although, once a common solution is created, Nessy should reuse the Table definitions from that solution to allow reusability across the boundaries of the framework.
Infrastructure: Nessy relies on some infrastructure to function properly, but creating the infrastructure is outside the scope of the package.

Installation¶

Please find detailed instructions on how to install the Nessy package in the installation section.

Nessy is published as a python package and can for example be installed from the public initions Azure DevOps PyPi stream:

pip install cloe_nessy # specify the version if needed

Basic Concepts¶

Nessy pipelines are essentially like cooking recipes. The pipeline defines a descriptive name and a list of steps to perform. Each steps describes what to do with the input context (ingredients) and produces an output. Between the steps, the ingredients are passed around for further processing.

To break the metaphor, a Nessy pipeline might define many recipes that are being worked on in parallel. Let's have a closer look at the individual parts of a Nessy Pipeline.

Pipeline¶

Besides, the name, every pipeline defines a list of steps (actually an OrderedDict 😉, but that doesn't matter for our tutorial.). The steps are ordered based on dependencies and can be executed in parallel.

The name is used to identify the pipeline for logging purposes and should be human readable and speaking. It should ideally be unique within the scope of a project.

PipelineStep¶

A step is executed as part of a pipeline. It also defines a name, that must be unique within the pipeline, that is used to identify the step.

The step defines an action. The behavior of the action can be configured by defining an options dictionary.

Additionally, the definition of a PipelineStep allows fine-grained control over the context, the Pipeline is executed in.

PipelineAction¶

An action represents an operation performed on a context. For example, the READ_FILES action allows you to read files from a specified location. You can specify details like the location to check, whether to search recursively, and the file extension. The WRITE_CATALOG_TABLE action writes data from the input context to a specified table.

The options are passed to an action as key word arguments as part of the step definition.

PipelineContext¶

Actions use a context as input and produce a context as output. The context includes:

a data object, which is a DataFrame.
a table_metadata object, which is a Nessy Table.
a runtime_info dictionary, used to store runtime information like read files.

Quick Start: Run your first Pipeline¶

Pipelines can be defined either in code or using YAML format. Defining pipelines in code is generally unnecessary. The PipelineParsingService simplifies the process of creating a valid Pipeline object.

This section provides an overview and some boilerplate code to help you define your first pipeline. For a comprehensive reference on pipeline definitions, please refer to the Pipeline Metadata page.

Here is an example of a basic pipeline:

name: My Pipeline
steps:
    My Step:
        action: READ_FILES
        options:  #(1)!
            location: path/to/my/files

The available options depend on the Action. Refer to the reference for details on configuring specific actions.

The Pipeline is given the name My Pipeline and defines a single PipelineStep My Step. The step defines the PipelineAction READ_FILES, which reads files from a location.

The Pipeline definition must now be parsed, which is the job ob the PipelineParsingService. This is usually the only Nessy component which would be imported into your Notebook. After parsing the Pipeline object, you can invoke the run() method on it, to execute the Pipeline. This can be done as shown below.

from cloe_nessy.pipeline import PipelineParsingService

yaml_str = """
name: My Pipeline
steps:
    My Step:
        action: READ_FILES
        options:
            location: path/to/my/files
"""

p = PipelineParsingService.parse(yaml_str=yaml_str)
p.run()

Pipeline Definition Location

The Pipeline definition can be included directly in your notebook, as shown above, or it can be defined in a separate file, such as one stored in cloud storage (e.g. accessible via a Volume). In this case, the PipelineParsingService's parse method supports a path keyword argument. The optimal location for the file depends on the requirements of the project regarding, e.g. versioning and CI/CD.

Debugging¶

Your pipeline might not work perfectly the first time you run it. It's common to take an iterative approach. Fortunately, you can inspect each step of the pipeline. To check the data an action used or produced, you can do this:

input_df = p.steps.get("Failing Steps Name").context.data # input data - context attribute
output_df = p.steps.get("Failing Steps Name").result.data # output data - result attribute
# do whatever with the DataFrame, e.g. display it

A simple debugging process might look like this:

Identify the last working step.
Get that step's input data.
Debug the operation on the data manually.
Update the pipeline definition.

CI/CD for Pipeline Definitions¶

Pipeline definitions are typically designed to be environment-independent. This can be achieved by referencing dynamic values in the pipeline definition:

environment variables: {{env:<Environment Variable Name}}
secrets from a secret scope: {{<Secret Scope Name>:<Secret name>}}

This step definition can be used in one environment (due to the reference to the catalog name):

Read Excel File:
    action: READ_EXCEL
    options:
        path: /Volumes/my_dev_catalog/

This simple change however, allows the same pipeline definition to be used across all environments:

Read Excel File:
    action: READ_EXCEL
    options:
        path: /Volumes/{{env:CATALOG_NAME}}/  # mind the dynamic value

Configuration¶

Some features of the Framework require configuration, this is done through environment variables and currently only used for the logging module. A guide on how to configure logging can be found here.