Configuration Guide¶
This guide covers everything you need to know about creating and configuring YAML files for synthetic data generation.
Configuration File Structure¶
Every configuration file follows this basic structure:
name: "Descriptive name for your configuration"
target:
catalog: "catalog_name"
schema: "schema_name"
table: "table_name"
write_mode: "overwrite"
num_records: 1000
batch_size: 1000
columns:
- name: "column_name"
data_type: "string"
nullable: true
faker_function: "faker_method"
faker_options: {}
Configuration Sections¶
Target Table Configuration¶
The target section defines where your generated data will be written:
target:
catalog: "main" # Unity Catalog name
schema: "hr_data" # Schema within the catalog
table: "employees" # Table name
write_mode: "overwrite" # How to handle existing data
Write Modes¶
| Mode | Description | Use Case |
|---|---|---|
overwrite |
Replace all existing data | Development, testing, complete refresh |
append |
Add new data to existing table | Incremental data generation |
error |
Fail if table already exists | Safety check for new tables |
ignore |
Skip if table already exists | Safe re-runs |
Choosing Write Mode
- Use
overwritefor development and testing environments - Use
appendwhen you want to add more data incrementally - Use
errorfor production safety when creating new tables
Generation Settings¶
Control how much data is generated and processed:
num_records: 10000 # Total number of records to generate
batch_size: 1000 # Process in batches of this size
Batch Size Considerations
- Larger batch sizes are more memory efficient
- Smaller batch sizes provide better progress feedback
- Default batch size of 1000 works well for most use cases
Column Definitions¶
Each column is defined with these properties:
Required Properties¶
- name: "column_name" # Column name in the target table
data_type: "string" # Spark SQL data type
faker_function: "email" # Faker method to use
Optional Properties¶
- name: "column_name"
data_type: "string"
nullable: true # Allow NULL values (default: true)
faker_function: "email"
faker_options: # Options passed to Faker method
domain: "company.com"
description: "User email" # Column description (optional)
depends_on: "parent_column" # Column dependency (optional)
reference_mapping: # Value mapping for dependencies (optional)
"parent_val1": "child_val1"
"parent_val2": ["child_val1", "child_val2"]
reference_table: # External table reference (optional)
catalog: "ref_catalog"
schema: "ref_schema"
table: "ref_table"
key_column: "column_name"
Supported Data Types¶
CLOE supports all common Spark SQL data types:
Basic Types¶
| Type | Description | Example Values |
|---|---|---|
string |
Text data | "John Doe", "example@email.com" |
integer |
32-bit integers | 42, -123 |
long |
64-bit integers | 1234567890123 |
double |
Double precision floats | 3.14159, 123.456 |
float |
Single precision floats | 3.14, 123.45 |
boolean |
True/false values | true, false |
Date and Time Types¶
| Type | Description | Example Values |
|---|---|---|
date |
Date only | "2024-01-15" |
timestamp |
Date and time | "2024-01-15 14:30:00" |
Decimal Type¶
| Type | Description | Example Values |
|---|---|---|
decimal |
Precise decimal numbers | 123.45, 999.99 |
Type Conversion
CLOE automatically converts Python data types from Faker to appropriate Spark SQL types. For example, Faker's date_time() is automatically converted to Spark's timestamp type.
Faker Integration¶
CLOE uses the Faker library to generate realistic data. You can use any Faker provider method.
Basic Faker Functions¶
Common faker functions for different types of data:
Personal Information¶
# Names
- name: "first_name"
data_type: "string"
faker_function: "first_name"
- name: "last_name"
data_type: "string"
faker_function: "last_name"
- name: "full_name"
data_type: "string"
faker_function: "name"
# Contact Information
- name: "email"
data_type: "string"
faker_function: "email"
- name: "phone"
data_type: "string"
faker_function: "phone_number"
Business Data¶
- name: "company_name"
data_type: "string"
faker_function: "company"
- name: "job_title"
data_type: "string"
faker_function: "job"
- name: "department"
data_type: "string"
faker_function: "random_element"
faker_options:
elements: ["Engineering", "Sales", "Marketing", "HR"]
Numbers and IDs¶
- name: "user_id"
data_type: "string"
faker_function: "uuid4"
- name: "age"
data_type: "integer"
faker_function: "random_int"
faker_options:
min: 18
max: 80
- name: "salary"
data_type: "double"
faker_function: "random_number"
faker_options:
digits: 5
Dates and Times¶
- name: "birth_date"
data_type: "date"
faker_function: "date_between"
faker_options:
start_date: "-65y" # 65 years ago
end_date: "-18y" # 18 years ago
- name: "created_at"
data_type: "timestamp"
faker_function: "date_time_between"
faker_options:
start_date: "-1y" # 1 year ago
end_date: "now" # Current time
Advanced Faker Options¶
Most Faker methods accept options to customize the generated data:
String Length Control¶
- name: "description"
data_type: "string"
faker_function: "text"
faker_options:
max_nb_chars: 200 # Maximum 200 characters
Localization¶
- name: "local_phone"
data_type: "string"
faker_function: "phone_number"
faker_options:
locale: "en_US" # US phone format
Custom Choices¶
- name: "status"
data_type: "string"
faker_function: "random_element"
faker_options:
elements: ["active", "inactive", "pending", "suspended"]
- name: "priority"
data_type: "string"
faker_function: "random_choices"
faker_options:
elements: ["low", "medium", "high", "critical"]
length: 1 # Return single choice
Numeric Ranges¶
- name: "score"
data_type: "integer"
faker_function: "random_int"
faker_options:
min: 0
max: 100
- name: "price"
data_type: "double"
faker_function: "pyfloat"
faker_options:
left_digits: 3 # 3 digits before decimal
right_digits: 2 # 2 digits after decimal
positive: true # Only positive numbers
Working with NULL Values¶
Control null value generation:
- name: "optional_field"
data_type: "string"
nullable: true # Allow nulls (default: true)
faker_function: "word"
NULL Generation
When nullable: true, CLOE automatically generates NULL values for approximately 10% of records. This percentage is not currently configurable but provides realistic null distributions.
Validation and Testing¶
Always validate your configuration before generating large datasets:
# Validate configuration syntax and structure
cloe-synthetic-data-generator validate-config my_config.yaml
# Generate a small sample first
cloe-synthetic-data-generator generate --config my_config.yaml --num-records 10
Configuration Testing
- Start with small record counts (10-100) to verify your configuration works
- Use the validation command to catch errors early
- Check generated data samples to ensure they meet your expectations
- Gradually increase record counts for performance testing
Next Steps¶
- 🔗 Column Dependencies - Learn about intra-table and inter-table dependencies
- 🔍 Explore Table Discovery - Auto-generate configurations from existing tables
- ⚡ CLI Reference - Learn about all CLI commands and options
- 🎭 Faker Integration - Deep dive into Faker capabilities
- 📊 Advanced Examples - See complex real-world configuration examples