Examples¶
This section provides practical examples of CLOE Synthetic Data Generator configurations for common use cases. Each example demonstrates specific patterns and best practices.
Quick Examples by Domain¶
User Management System¶
A simple user table with essential fields:
user_data.yaml
name: "User Management Data"
target:
catalog: "main"
schema: "app_data"
table: "users"
write_mode: "overwrite"
num_records: 1000
batch_size: 500
columns:
- name: "user_id"
data_type: "string"
nullable: false
faker_function: "uuid4"
- name: "username"
data_type: "string"
nullable: false
faker_function: "user_name"
- name: "email"
data_type: "string"
nullable: false
faker_function: "email"
- name: "first_name"
data_type: "string"
nullable: false
faker_function: "first_name"
- name: "last_name"
data_type: "string"
nullable: false
faker_function: "last_name"
- name: "age"
data_type: "integer"
nullable: true
faker_function: "random_int"
faker_options:
min: 18
max: 80
- name: "is_active"
data_type: "boolean"
nullable: false
faker_function: "boolean"
faker_options:
chance_of_getting_true: 85
- name: "created_at"
data_type: "timestamp"
nullable: false
faker_function: "date_time_between"
faker_options:
start_date: "-2y"
end_date: "now"
E-commerce Product Catalog¶
Product data with SKUs, pricing, and categories:
product_catalog.yaml
name: "Product Catalog Data"
target:
catalog: "ecommerce"
schema: "inventory"
table: "products"
write_mode: "overwrite"
num_records: 5000
batch_size: 1000
columns:
- name: "product_id"
data_type: "string"
nullable: false
faker_function: "uuid4"
- name: "sku"
data_type: "string"
nullable: false
faker_function: "bothify"
faker_options:
text: "PRD-####-???"
- name: "product_name"
data_type: "string"
nullable: false
faker_function: "catch_phrase"
- name: "category"
data_type: "string"
nullable: false
faker_function: "random_element"
faker_options:
elements: ["Electronics", "Clothing", "Home & Garden", "Books", "Sports"]
- name: "price"
data_type: "double"
nullable: false
faker_function: "pyfloat"
faker_options:
left_digits: 3
right_digits: 2
positive: true
min_value: 5.00
max_value: 999.99
- name: "in_stock"
data_type: "boolean"
nullable: false
faker_function: "boolean"
faker_options:
chance_of_getting_true: 90
- name: "rating"
data_type: "double"
nullable: true
faker_function: "pyfloat"
faker_options:
left_digits: 1
right_digits: 1
positive: true
min_value: 1.0
max_value: 5.0
- name: "created_at"
data_type: "timestamp"
nullable: false
faker_function: "date_time_between"
faker_options:
start_date: "-1y"
end_date: "now"
Employee HR Data¶
Human resources data with departments and roles:
employee_data.yaml
name: "Employee HR Data"
target:
catalog: "main"
schema: "hr"
table: "employees"
write_mode: "overwrite"
num_records: 2000
batch_size: 500
columns:
- name: "employee_id"
data_type: "string"
nullable: false
faker_function: "bothify"
faker_options:
text: "EMP-######"
- name: "first_name"
data_type: "string"
nullable: false
faker_function: "first_name"
- name: "last_name"
data_type: "string"
nullable: false
faker_function: "last_name"
- name: "email"
data_type: "string"
nullable: false
faker_function: "email"
- name: "department"
data_type: "string"
nullable: false
faker_function: "random_element"
faker_options:
elements: ["Engineering", "Sales", "Marketing", "HR", "Finance", "Operations"]
- name: "job_title"
data_type: "string"
nullable: false
faker_function: "job"
- name: "salary"
data_type: "double"
nullable: false
faker_function: "pyfloat"
faker_options:
left_digits: 6
right_digits: 2
positive: true
min_value: 35000.00
max_value: 200000.00
- name: "hire_date"
data_type: "date"
nullable: false
faker_function: "date_between"
faker_options:
start_date: "-10y"
end_date: "today"
- name: "is_manager"
data_type: "boolean"
nullable: false
faker_function: "boolean"
faker_options:
chance_of_getting_true: 15
Financial Transactions¶
Banking transaction data with account information:
transactions.yaml
name: "Financial Transactions"
target:
catalog: "finance"
schema: "banking"
table: "transactions"
write_mode: "overwrite"
num_records: 10000
batch_size: 2000
columns:
- name: "transaction_id"
data_type: "string"
nullable: false
faker_function: "uuid4"
- name: "account_number"
data_type: "string"
nullable: false
faker_function: "bothify"
faker_options:
text: "####-####-####"
- name: "transaction_type"
data_type: "string"
nullable: false
faker_function: "random_element"
faker_options:
elements: ["deposit", "withdrawal", "transfer", "payment"]
- name: "amount"
data_type: "double"
nullable: false
faker_function: "pyfloat"
faker_options:
left_digits: 6
right_digits: 2
positive: false # Allow negative values for withdrawals
min_value: -5000.00
max_value: 10000.00
- name: "currency"
data_type: "string"
nullable: false
faker_function: "currency_code"
- name: "description"
data_type: "string"
nullable: true
faker_function: "sentence"
faker_options:
nb_words: 6
- name: "transaction_date"
data_type: "timestamp"
nullable: false
faker_function: "date_time_between"
faker_options:
start_date: "-90d"
end_date: "now"
- name: "is_approved"
data_type: "boolean"
nullable: false
faker_function: "boolean"
faker_options:
chance_of_getting_true: 95
Common Patterns¶
Weighted Distributions¶
Create realistic data distributions using weighted choices:
columns:
- name: "subscription_tier"
data_type: "string"
faker_function: "random_choices"
faker_options:
elements: ["free", "basic", "premium", "enterprise"]
weights: [60, 25, 12, 3] # 60% free, 25% basic, 12% premium, 3% enterprise
length: 1
Custom Identifier Patterns¶
Generate domain-specific identifiers:
columns:
# Order numbers: ORD-2024-001234
- name: "order_number"
data_type: "string"
faker_function: "bothify"
faker_options:
text: "ORD-####-######"
# License plates: ABC-1234
- name: "license_plate"
data_type: "string"
faker_function: "bothify"
faker_options:
text: "???-####"
# Product codes: PROD-A1B2C3
- name: "product_code"
data_type: "string"
faker_function: "hexify"
faker_options:
text: "PROD-^^^^^^"
upper: true
Realistic Date Ranges¶
Generate dates that make business sense:
columns:
# Account creation (past 3 years)
- name: "account_created"
data_type: "timestamp"
faker_function: "date_time_between"
faker_options:
start_date: "-3y"
end_date: "now"
# Last login (within last 30 days for active users)
- name: "last_login"
data_type: "timestamp"
faker_function: "date_time_between"
faker_options:
start_date: "-30d"
end_date: "now"
# Birth dates (18-65 years old)
- name: "birth_date"
data_type: "date"
faker_function: "date_between"
faker_options:
start_date: "-65y"
end_date: "-18y"
Correlated Boolean Fields¶
Create logical relationships between boolean fields:
columns:
# 90% of accounts are active
- name: "is_active"
data_type: "boolean"
faker_function: "boolean"
faker_options:
chance_of_getting_true: 90
# Only 20% of accounts are premium (subset of active)
- name: "is_premium"
data_type: "boolean"
faker_function: "boolean"
faker_options:
chance_of_getting_true: 20
# Most accounts are email verified (85%)
- name: "email_verified"
data_type: "boolean"
faker_function: "boolean"
faker_options:
chance_of_getting_true: 85
Testing and Development Workflows¶
Small Test Dataset¶
For development and testing:
name: "Test Dataset"
target:
catalog: "dev"
schema: "testing"
table: "sample_data"
write_mode: "overwrite"
num_records: 100 # Small dataset for quick testing
batch_size: 50 # Small batches for fast iteration
columns:
- name: "id"
data_type: "string"
nullable: false
faker_function: "uuid4"
- name: "name"
data_type: "string"
nullable: false
faker_function: "name"
- name: "status"
data_type: "string"
nullable: false
faker_function: "random_element"
faker_options:
elements: ["active", "inactive"]
Large Production Dataset¶
For production-like data volumes:
name: "Production Scale Dataset"
target:
catalog: "main"
schema: "analytics"
table: "large_dataset"
write_mode: "overwrite"
num_records: 1000000 # 1 million records
batch_size: 10000 # Large batches for efficiency
columns:
- name: "id"
data_type: "string"
nullable: false
faker_function: "uuid4" # Fast function for large datasets
- name: "category"
data_type: "string"
nullable: false
faker_function: "random_element" # Fast categorical data
faker_options:
elements: ["A", "B", "C", "D"]
- name: "value"
data_type: "integer"
nullable: false
faker_function: "random_int" # Fast numeric generation
faker_options:
min: 1
max: 1000
Column Dependencies¶
Generate realistic data relationships using column dependencies:
employee_with_dependencies.yaml
name: "Employee Data with Dependencies"
target:
catalog: "main"
schema: "hr"
table: "employees"
write_mode: "overwrite"
num_records: 1000
batch_size: 500
columns:
- name: "employee_id"
data_type: "string"
nullable: false
faker_function: "uuid4"
- name: "department"
data_type: "string"
nullable: false
faker_function: "random_element"
faker_options:
elements: ["Engineering", "Sales", "Marketing", "HR"]
# Job title depends on department
- name: "job_title"
data_type: "string"
nullable: false
depends_on: "department"
reference_mapping:
"Engineering": ["Senior Engineer", "Software Engineer", "DevOps Engineer"]
"Sales": ["Sales Director", "Account Manager", "Sales Representative"]
"Marketing": ["Marketing Manager", "Content Creator", "SEO Specialist"]
"HR": ["HR Manager", "Recruiter", "HR Coordinator"]
- name: "salary"
data_type: "double"
nullable: false
depends_on: "job_title"
reference_mapping:
"Senior Engineer": [120000, 140000, 160000]
"Software Engineer": [80000, 100000, 120000]
"DevOps Engineer": [90000, 110000, 130000]
"Sales Director": [150000, 180000, 200000]
"Account Manager": [70000, 85000, 95000]
"Sales Representative": [45000, 55000, 65000]
"Marketing Manager": [75000, 90000, 105000]
"Content Creator": [50000, 60000, 70000]
"SEO Specialist": [55000, 65000, 75000]
"HR Manager": [80000, 95000, 110000]
"Recruiter": [55000, 65000, 75000]
"HR Coordinator": [45000, 55000, 60000]
Next Steps¶
For more complex scenarios:
- 📚 Configuration Guide - Complete YAML reference
- 🔗 Column Dependencies - Advanced dependency patterns
- 🎭 Faker Integration - Advanced Faker patterns
- 🔍 Table Discovery - Auto-generate configurations
- 🌐 Faker Documentation - Full provider library
Tips for Creating Examples¶
- Start Simple: Begin with basic data types and add complexity gradually
- Use Realistic Patterns: Match real-world data distributions and formats
- Consider Performance: For large datasets, prefer faster Faker functions
- Test Incrementally: Start with small
num_recordsand increase gradually - Document Business Logic: Add comments explaining domain-specific choices