Skip to content

Snowflake Crawler

Introduction

The Snowflake crawler can crawl a Snowflake tenant and retrieve metadata of tables only(!). It then saves these metadata to a CLOE compliant Database metadata format. Alternatively it can also update an existing Database metadata format.

Use cases might be if there is an existing database solution in the project that should be used by CLOE. It can also be used in conjunction with the Snowflake role concept to automatically onboard new schemas/databases.

Requirements

  • DevOps Pipeline (cloud execution) OR Python (local execution)
    • See downloads section for example
  • Snowflake technical user

    • With privileges to call get_ddl function on objects
  • Grant Build Service Contributor

Flow

The exact crawler flow might differ based on use case and infrastructure.

snowflake-crawler-flow.png

Arguments

Argument Optional Description
output_json_path Path where to output the final json to
snowflake-user The snowflake user the crawler should use. If not set is expected as CLOE_SNOWFLAKE_USER env variable.
snowflake-password The snowflake password the crawler should use. If not set is expected as CLOE_SNOWFLAKE_PASSWORD env variable.
snowflake-account The snowflake account the crawler should use. If not set is expected as CLOE_SNOWFLAKE_ACCOUNT env variable.
snowflake-warehouse The snowflake warehouse the crawler should use. If not set is expected as CLOE_SNOWFLAKE_WAREHOUSE env variable.
snowflake-role The snowflake role the crawler should use. Can also be set with the CLOE_SNOWFLAKE_ROLE env variable. If not set uses users default role.
existing-repository-json-path Reference an existing Repository json to update.
ignore-columns Ignore columns of tables and just retrieve information about the table itself.
database-filter Filters databases based on defined filter. Is used as Snowflake wildcard pattern in SHOW DATABASES. If no filter defined all databases are retrieved. Can be used to limit databases to environment.
database-name-replace Replaces parts of a name with the CLOE env placeholder. Can be regex. Can be used to remove the environment part in a database name.