Windpark Tutorial¶

Welcome to the CLOE tutorial. This tutorial will be the starting point of your CLOE journey. In this tutorial, you will learn the basics of the ETL process realized with the help of the CLOE modules.

This tutorial is targeted at initions employees new to CLOE specifically.

Modeling windpark data¶

Prerequisites¶

There is a list of steps you need to take to start this tutorial:

Follow the steps in Environment Setup to set up your Azure DevOps workspace.
- Take a look at the Setup-Guide and read it carefully.
- Step 2 is not necessary for us, since CLOE brings its own templates when working within the initions tenant. But keep in mind, that you can always add your templates later, if you feel the need to modify the templates.
- When you reach Step 4, follow the instructions for SQL Orchestrator and Modeler and pay attention to the subsequent configuration.
  - SQL Orchestrator
  - Modeler
- In Step 5 you are supposed to create branches for each environment. Keep in mind that for this tutorial, having only one environment branch is sufficient. Feel free to create multiple branches for practice e.g. dev and prod.
Have either SSMS or dbeaver installed to preview your data.
In the initions OneNote Notizbuch für Mitarbeiter follow the instructions on the page Einarbeitung und Tutorials > SQL Environment > Installation SQL Server - Local to set up a local SQL server for this tutorial.
- For testing and debugging your CLOE model, the local installation of SQL Server is sufficient. However:
- When setting up the orchestration for your CLOE model you will need to have access to an Azure SQL Database > Talk to your Teamlead to have an instance created for you.
Download the latest CLOE GUI

Goal¶

The goal of this tutorial project is to create an ETL process from the source data provided below and to load the data into a data warehouse using this data model.

Raw Data¶

The data provided is a cleaned-up and modified version of the governmental data of windturbines in the county of Schleswig-Holstein in Germany. The link provides the information for the columns. If you are not fluent in german please use the DeepL translation tool. The coordinate columns from the source have been replaced by:

Latitude: latitude coordinate of the facility
Longitude: longitude coordinate of the facility

A second source from Wikipedia governing the windparks. Here the information on wind turbines to wind parks and their respective location can be linked.

Download the raw data from the Downloads. In the data folder you need the following two files:

tblsrc_parks.csv for the windpark information
tblsrc_turbines.csvm for the wind turbine information

Keep in mind those two files are altered versions for the sake of this tutorial. They might differ from the original source.

The Data Model¶

New employees will have the task of creating this data model themselves. In case you missed this step, here is the data model that will be used in this tutorial.

ETL Step by Step with CLOE¶

CLOE is a flexible application. It can be used independently of the database/system you are working with. CLOE will only know the information you provide as part of the metadata model. Thus you need to define the structure of the database, schemas, tables, and columns you are working with. This will be done in the repository.

After defining the tables, the next action is to import the data into the stage layer of the SQL server.

Now it is time to transform the data and move it from the stage layer to the core layer. For this step, we use CLOE DataFlows. In general, CLOE has two flows to choose from:

Dataflows are used to move and curate data from the source tables in the stage layer into their respective tables in the core layer. They cover 90 % of all use cases you encounter in your project.
CustomDataFlows are used for highly specified queries that can not be generalized (like CTEs). We will not use them in this tutorial.

Hands-on¶

This section will guide your steps toward creating the ETL process. Please follow this tutorial step by step for maximum experience gain.

The CLOEGUI¶

The CLOEGUI is the app you use to define the metadata. The data will be stored in .json files however it is convenient and more efficient to navigate those files using a frontend tool. In our case the CLOEGUI. For more info on how the CLOEGUI is structured you can have a look at the Reference.

Getting Started¶

Lets start to build your first Jobs in CLOE! To make your life easier, remember to click Validate once in a while or after each step in the top right corner. Doing this will tell you, if your model contains any errors thus saving you from pushing faulty metadata to your repository.

Define Repository¶

To define all relevant metadata, open cloegui.exe and navigate in the ribbon menu to repository. Add a new database via the button above the grid menu. First, click on Repository in the tree menu and then on Databases. Navigate now to the Add New button in the grid menu. Now you can edit the Database Name by double clicking on the Database Name in the grid. Enter the name windpark. If you are using an Azure SQL Database that has been set up for you, make sure to give the database in your CLOE model the same name as it is called in Azure. Then navigate to the Database in the tree menu and add two new schemas to the model: stage and core.

Navigate down to the stage layer and add two tables:

tblsrc_park
tblsrc_turbine

Within tblsrc_park add the following columns to the grid:

_ID
Name
Turbine_BK
Park_Leistung_MW
Typ_WKA
Ort
Land_Kreis
Koordinaten
Projektierer_Betreiber
Bemerkungen

Use NVARCHAR in the Data Type field and 255 as the DataType Length. For _ID use INT IDENTITY(1,1) for the Data Type and set IsKey to 1.

Now create the second stage table tblsrc_turbine and add the following columns:

_ID
TYP
HERSTELLER
TURBINE_BK
KREIS
ORT
NABENHOEHE
ROTORDURCHMESSER
NENNLEISTUNG_kW
GENEHMIGT_AM
INBETRIEBNAHME
STATUS
BST_NR
ANL_NR
DATENDATUM
DATENQUELLE
LATITUDE
LONGITUDE

Use NVARCHAR in the Data Type field and 255 as the DataType Length. For _ID use INT IDENTITY(1,1) for the Data Type and set IsKey to 1.

The column _ID in both table definitions has to be added to ensure that the source tables are always fetched in the same order.

If everything was done correctly, CLOEGUI should look like this:

Your Task

Your job is now to insert the information from the data model into the core schema! Use the datatypes provided by the model. For the DECIMAL datatype fill DataType NumericScale with 18 and DataType Precision with 5. Hint: DCR & DMR stand for date create record and date modified record

Lets give you a little head start! For our core schema we need all the table from the data model with that said, go ahead and create entries for the those tables.

tbl_windpark
tbl_wind_turbine_model
tbl_location
tbl_windpark_windturbine

With the tables created we need to setup the columns of the tables. Let's do one together!

Select tbl_windpark from the tree menu and start by adding all the columns according to the data model. You should now have the following columns added:

Table Column Name	IsKey	IsNullable	DataType	DataType Length	DataType NumericScale	DataType Precision
ID	yes	no	INT IDENTITY(1,1)
BK	no	no	NVARCHAR	255
park_name	no	no	NVARCHAR	255
operator_name	no	no	NVARCHAR	255
coordinate_lattitude	no	no	NVARCHAR	255
coordinate_longitude	no	no	NVARCHAR	255
FK_location_BK	no	no	NVARCHAR	255
FK_location_ID	no	no	INT
ETL_DCR	no	no	DATETIME		0	0
ETL_DMR	no	no	DATETIME
ETL_Load_Datetime	no	no	DATETIME
ETL_Deleted_Date	no	no	DATE
ETL_IsDeleted	no	no	INT

Take note of the way the foreign keys are named here. By default the foreign key BK will always have to be named according to this convention: FK_{{ lookupname }}_BK. This behavior of CLOE can be changed, this however is not part of this tutorial. Now take your time and define the three tables left.

Your finished core schema should look like this:

Now you need to insert the metadata from CLOE into the actual database. The datatypes should match the data model.

Good Job you successfully inserted your first metadata into CLOEGUI! 🥳

Note!

In reality crawlers are often used to gather and format the metadata. E. g. when dealing with a SnowFlake as a source.

It's time to set up your SQL Server. On your server, you need to define your database, schemas, and tables according to your metadata model. A handy way of doing that is to use CLOEs function to generate DDL from your meta data. For that select your database in the tree menu of the Repository and click on DDL. The DDL can now be used to create your database with all schemas and tables. Setup the following schemas:

stage
core
cloe_dwh

Set up the stage and core layer tables according to the data model. You need to fill the stage layer tables with the data using a "Flat File Import" in SSMS. For this just right-click the database in your SSMS application and select Tasks > "Import Flat File..." and follow the instructions. The cloe_dwh schema is needed for DataFlows to work properly. Define the following tables for each DQ tier (1, 2 and 3) in this schema:

DQ Tables
1	`--8<-- "./includes/tutorial.dq_tables.sql"`

Any data quality violations will be logged in these tables by default.

Also make sure to download the DACPAC for your chosen database and deploy it to your database to prepare your database for working with CLOE. To do that right-click the database in SSMS and choose Tasks > Upgrade Data-tier Application. From here follow the instructions to deploy the DACPAC.

Your database on the server should look something like this and is ready to accept your queries.

Now it's time to fill the core layer tables using DataFlows.

Fill tbl_wind_turbine_model¶

Now, in the CLOEGUI switch to the Modeler tab and expand "DataFlows. Create a new DataFlow and name it "Wind Turbine". Select the SQL template "MERGE with distinct (Field2Field UPDATE)(DataFlow)" and choose the table that should be filled by the DataFlow. In Sink Table choose tbl_wind_turbine_model. Your grid will now look like this:

It was mentioned above that DQ1 and DQ2 are activated by default. That is the case, the options are just hidden in the "Columns" menu above the grid. Expand it to find the following columns: Now you can be sure the DQ1 and DQ2, as well as their logging logDQ1 and logDQ2, are activated.

Remember!

Most of the default settings are hidden this way in CLOEGUI. Just make sure to check for hidden columns and you will most likely find your required column.

In the tree, expand the dataflow "Wind Turbine" by clicking on the arrow and select Source Tables. In this part of the modeler, you specify the source tables from which your core table tbl_wind_turbine_model will be filled. Moving on to "Column Mappings" you need to tell CLOE which column from the source table will be transferred to which columns of the sink table. Additionally, you specify the BK, the Null handling, and the conversion error handling.

Null Handling¶

If you encounter a Null value in your column, you can specify a standard value to substitute the Null with. It is common for any numeric column to use "0" as a default and for text fields to use "#".

BK Generation¶

The BK is simply defined by flagging a column with the BK Column Order field as "1". It is a numeric value because it is possible to concatenate multiple columns into one composite BK. The number defines the sequence in which the BK's will be concatenated. We will use TYP_BK as the single BK though.

Conversion Errors¶

DataFlows use the DQ views to pass on the data from the stage into the core layer. The core layer will be filled only with rows from the source table that passed all enabled data quality checks. In our case, CLOE first checks if the BK in the table is unique and passes the data to the DQ1 view. Next, the type conversion defined in the Column Mapping of the DataFlow will be applied. If it fails, the rows will be isolated into the cloe_dwh.DQ2_Log table. Additionally, you can choose which value to pass on to the core layer. On Convert Error Value will be the value associated with the row in the core table.

Set up the column mappings¶

Fill the mappings according to the data model. The result should look like this:

Note that CLOE will always write all BKs into a specific column that is defined in the SQL-Templates. In most cases this column is called BK. So when marking a column as BK, make sure that there actually is a column called BK in core. A specific mapping for the BK columns is not necessary, as CLOE will deal with BKs automatically, but you can of course define a mapping for the BK, which is especially helpfull when dealing with composite BKs where you need to retain the information of each part of the BK. Take a look at this example for reference.

Once you are done mapping all the columns and setting the BKs it is time to have CLOE create your first SQL Statements. There is a preview feature built in to CLOE that lets you generate all SQL directly from the CLOEGUI.

For that select "DataFlows" in the Modeler tab and click on PREVIEW in the menu ribbon above the grid. CLOE will now generate all of the SQL Statements for your DataFlows. If you want to preview the SQL for a specific DataFlow only, you can just select the DataFlow from the grid and press PREVIEW again.

Use the pipeline output¶

Next thing to do is to commit and push all the changes to the DevOps repository. This will trigger the CLOE backend to generate the required SQL for you as well. CLOEGUI has a built-in function to do that, but you can also commit via the VSCode git interface by first committing the changes and then pressing the sync button. Remember to write meaningful commit messages! ;)

If you have set up your pipeline before and added a trigger to the pipeline according to configurePipelineTrigger.md link, your pipeline should run automatically. If you have not set up your pipelines yet, please do so now by following these configurePipelineTrigger.md.

On the cloe_einarbeitung DevOps navigate to Pipelines and select your pipeline. The orchestrator pipeline will have failed. No worries, we will take care of the orchestration later. Click on the successful pipeline and click on the latest commit (on top of the list) and find the Related property with 1 published; 1 consumed. There lie the artifacts (results that come from a pipeline usually some files with code). Click on this section and find your finished code by expanding the cloemodeler. You will find two files.

dq_views_ddls.sql is the definition of the DQ views that ensure the logging of rows that violate the data quality checks.
rendered_pipe_queries.sql is the SQL code generated by CLOE. This code is identical to the one created with the Preview feature of the CLOEGUI.

Go ahead and execute the SQL commands from both files in SSMS starting with dq_views_ddls.sql. Now query the core.tbl_wind_turbine_model table to observe the result of your successful pipeline!

Congrats! You successfully generated a complicated SQL query just using the CLOEGUI and your model metadata!! 🥳🎉

Remember!

This might seem like a lot of work for just one query. But remember, that we have done all the initial setup here as well. The key value of CLOE is, that it does not matter wether you have to create 10 or 1.000 DataFlows. Once set up, editing a query is as quick as adjusting your settings and pushing your model to the repository. Everything else is taken care of automatically.

Fill tbl_location¶

Now it's your turn to fill out the metadata for the DataFlow to fill the core table tbl_location. Use tblsrc_turbines as a sink table and map the field "KREIS" to "district" and "ORT" to "community" Use the same SQL template "MERGE with distinct (Field2Field UPDATE)(DataFlow)" as before. Use the combination "KREIS"-"ORT" for the BK.

Fill tbl_windpark¶

Next select tbl_windpark and fill in the column mapping. This table has a look-up, which we will cover next.

Lookup¶

The general lookup is described in the DataFlow Reference for further reading. The general lookup schema looks like this:

The translation of the generic names to our case is to find in the following table:

Generic Column Name	Example Column Names	Associated Table
Lookup Parameter	: "KREIS"-"ORT" :	tblsrc_park
Lookup Return Column	: ID :	tbl_location
Lookup Column	: BK :	tbl_location
Lookup Table	: :	tbl_location
Sink Table	: :	tbl_windpark
Sink Column	: FK_location_ID :	tbl_windpark

To define the lookup, navigate to the Lookups section in the tree. Add a new lookup called "location". As with the BK columns, calling a lookup the same as the table that you want to retrieve a value from is the usual choice in CLOE. Again, this behavior can be changed in the metadata.

Navigate to ReturnColumnMapping and fill out the information according to the table above:

The last step is to define the lookup parameters. Navigate to the LookupParameters in the grid and fill it out as follows:

Question

Note that in this case we have are using a composite BK for the lookup. Thus, the Lookup Parameter is a composite key as well. We can tell CLOE to construct this BK by defining the Parameter Column Order the same way we did with the BK Column Order in the tbl_location above. The Parameter Column are the columns associated with the composite BK.

Now it is time to use the Validate and the Preview feature again and test the script on your SQL Server. If everything works without any errors, your model can be saved and commited to your repo. This way we ensure that the pipeline won't fail and that the script produced will execute successfully.

Create Jobs¶

Now that we have a way of converting our CLOE model to valid SQL, let's set up the orchestration of those statements to have them run on a given schedule, in a given order and without the need to manually interact with the process. For that we will use Jobs and Batches. A job in this case represents the execution of a SQL query. Those queries are automatically generated by the DataFlows. To create a new Job click on Jobs in the ribbon. In the tree menu you should now see the following entries:

Connections
db2fs
fs2db
ExecSQL

You can ignore db2fs and fs2db for now, since we will not be using those. You can read more about that in the metadata docs. We will be focusing on ExecSQL which, as the name implies, allows you to run SQL commands. Before we can create any Jobs we have to add a new connection to our metadata. A connection holds the metadata for a source system as well as its type and information on how to connect to that source e. g. a secret name.

With that out of the way, click on Connections and then on Add new. For the connection name you can choose anything you like; we will go with Azure SQL DB this time and for Connection systemType you have to select SQL Server >= 2016 or Azure SQL Server depending on your setup. Your connection is now ready to go.

Now you can give your ExecSQL a name and specify the connection from above. In this case we will be calling our ExecSQL Wind_Park. Once you have named the ExecSQL and selected Azure SQL DB as the connection, you can expand the tree menu by one level and click on Queries. Here you are able to add multiple Queries to your Job. In most cases it should be enough to have only one Query specified in here.

We could now go ahead and create all of the Jobs by hand and then copy-paste the Queries from the pipeline results into the correct ExecSQL, but this would be way to tedious, wouldn't it? CLOE has a solution for that though.

Take a look at your pipeline and click on Run pipeline. Here you have the option to either choose sql_single or json_single, this determines the format of the pipeline output. CLOE will now generate the ExecSQL Jobs and their SQL Queries from the DataFlows defined in your model.

Remember setting up the modeler pipeline? When preparing the .yaml file for the pipeline you had to specify a commitToBranch, which is used by CLOE to determine where it should commit the results of the pipeline run to when choosing json_single. The file produced is published to that Branch after the pipeline has run successfully. Just update your local branch and all of the Jobs will appear in CLOEGUI. We only need to define the connection and update the Jobs created by CLOE.

To test that out run your pipeline again from DevOps and select json_single as well as Auto commit results. Once the pipeline has finished you can simply do a git pull to update your local repo and then reload your project in the CLOEGUI. Your Jobs should now be updated and looking like this.

To finish the setup of the Jobs, we now need to tell each DataFlow or CustomDataFlow in the Modeler, which Job the are represented by. So go ahead and switch to the Modeler. All you need to do now, is to click on DataFlows and in the main area select the corresponding Jobs. Take a look at this image for reference: Now that is everything done regarding Jobs! Now let us see, how we can orchestrate the Jobs.

Orchestrate your Jobs¶

Now all that is left to do, is to tell CLOE in which order and when our Jobs have to be executed. The reason for that being, is that some tables can only be filled with data correctly if one or more tables have been populated before. Look at tbl_windpark_windturbine for example. In this case tbl_windpark as well as tbl_windturbine_model have to be populated before the lookups will work. That is where the Orchestration comes into play.

Start by selecting Orchestration in the navigation menu and then click on Batches in the tree menu. This is where all Batches will be defined. A Batch is just a collection of Batchsteps that share a specific schedule. Each Batchstep references a given Job thus enabling CLOE to run the commands stored in the Jobs at a certain point in time.

With Batches selected click on Add new and give the newly created Batch a name. In our case this can be anything like windpark. Since the goal of orchestration also is to run all Jobs automatically at a specific point in time repeatedly, we need to specify a Cronjob for that as well. Cronjobs are defined with the following pattern, see also crontab.guru:

Minute *	Hour *	Day (of Month) *	Month *	Day (of Week) *	Meaning
0	0	*	*	*	Every day at 00:00
5	4	*	*	sun	Every Sunday at 04:05
0	22	*	*	1-5	Monday-Friday at 22:00
15	3	1	8	*	On Augusts' first day at 03:15

You can choose anything you like for this tutorial e. g. 0 0 * * *. As said before each Batch can include multiple Batchsteps. By expanding windpark and clicking on Batchsteps we can start to define those. As we have four Jobs in our Model we should also create four Batchsteps. Now go ahead and add those. Within each Batchsteps you can define which Job this Batchstep refers to and if it has any other Batchsteps that it dependends on. Dependencies are added by selecting Batchstep_Dependencies in the tree menu.

Once done, your Batch should look like this.

The Batchstep responsible for tbl_windpark_windturbine will have two dependencies just like in this picture.

With the Batch created and all Batchsteps and Batchstep_Dependencies set up, you are finally done! Everything is good to go and by commiting your changes to your Repo the orchestrator pipeline should run successfully. It will now connect to the database and deploy your Jobs as Stored Procedures.

Connect to your database using SSMS or DBeaver and you should find all of your Jobs under Programmability > Stored Procedures

Last thing to do is to...