Metadata guide¶
Introduction¶
This guide presents the different forms of metadata used by nessy. It is structured by - object: the object which the metadata describes. This is currently tables, schemas, and views. - the necessity of the metadata: either required or optional.
In the following, metadata and necessity are presented by object.
Schema metadata¶
Schema metadata reference¶
| Field | Required? | Metadata Type | Description | Example | 
|---|---|---|---|---|
| catalog | true | string | The catalog in which the schema is located. | my_catalog | 
| name | true | string | Name of schema within a catalog. | my_schema | 
| storage_path | false | string | Storage path of the schema. Acts as a fallback for table storage paths. | "abfss://container@storageaccount.company.core.windows.net/schema_folder/schemas/" | 
| business_properties | false | dictionary | A dictionary containing case-specific metadata required by business or other stakeholders. | |
| comment | false | string | A comment added to the schema. | What a wonderful schema to store data. | 
Example Schema metadata¶
catalog: "my_catalog"
name: "raw"
storage_path: "abfss://mycontainer@mystorageaccount.blob.core.windows.net/schemas/"
Info
The definition of the schema allows for dynamic values by addressing them in the format of {{env:NAME_OF_ENV_VARIABLE}}.
This allows for the definition of environment-specific values, such as the storage path or catalog name.
Table metadata¶
Table metadata reference¶
| Field | Required? | Metadata Type | Description | 
|---|---|---|---|
| name | true | string | Name of the table. | 
| columns | true | list of columns | A list of columns containing column metadata for each column. | 
| is_external | false | boolean | Indicates whether the table is managed in DBX or external. | 
| partition_by | false | list | List of columns by which the table is partitioned or clustered (depending on the attribute liquid_clustering). | 
| liquid_clustering | false | boolean | Defines whether to to cluster the table by the partition_by columns (liquid, multi-dimensional clustering for a Delta Lake table) or to partition by the partition_by columns. Required if partition_by is not None. | 
| composite_primary_key | false | list | List of columns which denote unique rows. | 
| properties | false | dictionary | DBX delta table properties as described here | 
| constraints | false | dictionary | One or more constraint definitions as dictionary of constraint names, SQL code to filter on one or multiple columns, and a description. | 
| foreign_keys | false | list | A list of dictionaries containing foreign_key_columns, parent_table, parent_columns, and foreign_key_options for each foreign key in the table. | 
| storage_path | false | string | Storage path of the table. Acts as overwrite for schema storage path. | 
| business_properties | false | dictionary | A dictionary containing case-specific metadata required by business or other stakeholders. | 
| comment | false | string | A comment on the table. | 
| data_source_format | false | string | The format of the data, e.g. delta or parquet. | 
Foreign key metadata reference¶
| Field | Required? | Metadata Type | Description | 
|---|---|---|---|
| foreign_key_columns | false | string | The name of the columns in the current table that are foreign keys referencing columns in another table. | 
| parent_table | false | string | The name of the parent table containing the column being referenced by the foreign keys. | 
| parent_columns | false | string | The name of the columns in the parent table being referenced by the foreign keys. | 
| foreign_key_option | false | list of strings | The options for the foreign key constraint, such as "MATCH FULL", "ON UPDATE NO ACTION" or "ON DELETE NO ACTION". For details, see the Databricks documentation. | 
Column metadata reference¶
| Field | Required? | Metadata Type | Description | 
|---|---|---|---|
| name | true | string | Physical column name. | 
| data_type | true | string | The data types supported in Azure Databricks. | 
| nullable | true | boolean | Whether the column allows null values. By default it should not. | 
| default_value | false | In case of non-nullable columns, the default value according to data modeling standards. | |
| generated | false | string | Optional generated expression for auto-generation of column values. | 
| business_properties | false | dictionary | A dictionary containing case-specific metadata required by business or other stakeholders. | 
| comment | false | string | A comment on the column. | 
Example table metadata¶
name: customers
is_external : false
partition_by:
    - "region"
liquid_clustering: true
data_source_format: "delta"
properties:
    "delta.autoOptimize.optimizeWrite": "true"
    "delta.enableDeletionVectors": "true"
constraints:
  realisticBirthConstraint:
      expression: "(birthDate > '1900-01-01')"
      description: "Ensures birthDate is realistic"
foreign_keys:
    - foreign_key_columns:
        - "service_level"
      parent_table: "service_level_lookup"
      parent_columns:
        - "service_level_name"
      foreign_key_option:
        - "MATCH FULL"
columns:
    - name: "customer_id"
      data_type: "long"
      nullable: false
      generated: "GENERATED ALWAYS AS IDENTITY"
    - name: "customer_name"
      data_type: "string"
      nullable: false
      default_value: "Max Mustermann"
business_properties:
  responsible_manager: "Dr. Michaela Mustermensch"
External vs. Managed Tables¶
Managed and external tables can be configured in multiple ways in the metadata. The following table provides an overview of the different possibilities:
| is_external | table_storage_path | Result | 
|---|---|---|
| false | not set | managed table | 
| true | not set | external table derived from schema_storage_path | 
| true | set | external table based on table_storage_path | 
| false | set | no valid table |