Getting Started¶
This guide will help you set up and run your first synthetic data generation with CLOE Synthetic Data Generator.
Prerequisites¶
Before you begin, make sure you have:
- Python 3.12+: The library requires Python 3.12 or later
- Databricks Access: Valid Databricks workspace with Unity Catalog enabled
- Databricks Connect: Configured connection to your Databricks workspace
Databricks Connect Setup
If you haven't set up Databricks Connect yet, follow the official Databricks Connect documentation to configure your environment.
Installation¶
CLOE Synthetic Data Generator is available on PyPI and can be installed using your preferred Python package manager:
Verify Installation¶
Test your installation by checking the available commands:
# Check if the CLI is available
cloe-synthetic-data-generator --help
# Test Databricks connection
test-connection
Expected Output
If everything is set up correctly, you should see:
- The help message for the CLI tool
- A successful connection message from the test-connection command
Your First Data Generation¶
The easiest way to get started is to use the discovery feature to automatically generate configurations from existing tables in your Databricks workspace.
Step 1: Discover Existing Tables¶
First, let's discover tables in your workspace and generate configurations automatically:
# Discover all tables in a catalog and schema
cloe-synthetic-data-generator discover \
--catalog main \
--schema default \
--num-records 100 \
--output-dir ./my_configs
Choosing Catalog and Schema
Replace main and default with your actual catalog and schema names. If you're unsure what's available, check your Databricks workspace or ask your administrator.
This command will:
- π Connect to your Databricks workspace
- π Scan all tables in the specified catalog and schema
- π§ Analyze column names and types to suggest appropriate Faker functions
- π Generate YAML configuration files in the
./my_configsdirectory
Step 2: Review Generated Configurations¶
After discovery completes, you'll see output like:
Discovered Tables in main.default
βββββββββββββββ¬ββββββββββ¬βββββββββββββββββββββββ
β Table Name β Columns β Full Path β
βββββββββββββββΌββββββββββΌβββββββββββββββββββββββ€
β users β 6 β main.default.users β
β orders β 8 β main.default.orders β
βββββββββββββββ΄ββββββββββ΄βββββββββββββββββββββββ
π Successfully discovered 2 tables and generated 2 YAML configuration files
List the generated configurations:
# See what configurations were created
cloe-synthetic-data-generator list-configs ./my_configs --verbose
Step 3: Examine and Customize a Configuration¶
Let's look at one of the generated configurations:
You'll see something like:
name: Users Data Generation
num_records: 100
columns:
- name: user_id
data_type: string
nullable: false
faker_function: uuid4 # Auto-detected as ID field
- name: first_name
data_type: string
nullable: false
faker_function: first_name # Auto-detected from column name
- name: email
data_type: string
nullable: true
faker_function: email # Auto-detected from column name
- name: created_at
data_type: timestamp
nullable: false
faker_function: date_time_between
faker_options:
start_date: -1y
end_date: now
target:
catalog: main
schema: default
table: users
write_mode: overwrite
Smart Detection
Notice how the discovery process automatically:
- Detected
user_idas an ID field and suggesteduuid4 - Recognized
first_nameand suggested the appropriate Faker function - Identified
emailcolumns and suggestedemailgeneration - Set realistic date ranges for timestamp fields
Step 4: Validate Your Configuration¶
Before generating data, validate the discovered configuration:
Validation Success
You should see a green checkmark and details about your configuration including column definitions.
Step 5: Generate the Data¶
Now generate synthetic data using the discovered configuration:
What Happens Next
The tool will:
- π Connect to your Databricks workspace
- π Generate 100 rows of fake data using the auto-detected Faker functions
- π Convert the data to a Spark DataFrame
- π Write the data to your Unity Catalog table
- β Verify the write was successful
Step 6: Verify Your Data¶
Step 6: Verify Your Data¶
You can verify the data was created by querying it in Databricks:
Step 7: Customize for Your Needs (Optional)¶
The auto-generated configuration provides a great starting point, but you can customize it:
# Edit the generated file to customize Faker options
- name: "email"
data_type: "string"
nullable: false
faker_function: "email"
faker_options:
domain: "yourcompany.com" # Use your company domain
- name: "age"
data_type: "integer"
nullable: true
faker_function: "random_int"
faker_options:
min: 25 # Adjust age range
max: 65 # for your use case
Discovery for New Tables
If you don't have existing tables to discover from, you can:
- Create a simple table with just column names and types in Databricks
- Use the discovery feature to generate the initial configuration
- Drop the empty table and use the generated config to create realistic data
Alternative: Manual Configuration¶
If you prefer to create configurations manually or don't have existing tables to discover from, you can still create YAML files manually. See the Configuration Guide for detailed instructions.
Understanding the Output¶
When you run the generate command, you'll see rich console output showing:
- Configuration Details: Summary of your target table and settings
- Progress Indicators: Real-time progress of each step
- Sample Data: Preview of the generated data
- Success Confirmation: Final confirmation with record counts
Example output:
Configuration: Users Data Generation
βββββββββββββββ¬βββββββββββββββββββββββββββββββββ
β Property β Value β
βββββββββββββββΌβββββββββββββββββββββββββββββββββ€
β Target Tableβ main.default.users β
β Columns β 6 β
β Records β 100 β
β Write Mode β overwrite β
βββββββββββββββ΄βββββββββββββββββββββββββββββββββ
β
Connected to Databricks!
β
Completed successfully!
π Successfully generated 100 records and wrote to main.default.users
Common CLI Options¶
Here are some useful command-line options to get you started:
Override Record Count¶
# Generate 500 records instead of the configured 100
cloe-synthetic-data-generator generate --config ./my_configs/main_default_users_config.yaml --num-records 500
Verbose Logging¶
# Enable detailed logging for troubleshooting
cloe-synthetic-data-generator generate --config ./my_configs/main_default_users_config.yaml --verbose
Process Multiple Configurations¶
# Generate data for all discovered configurations at once
cloe-synthetic-data-generator generate --config-dir ./my_configs/
Discover Specific Tables¶
# Use regex to discover only specific tables
cloe-synthetic-data-generator discover \
--catalog main \
--schema default \
--table-regex "user.*|customer.*" \
--output-dir ./my_configs
Next Steps¶
Now that you've successfully generated your first synthetic dataset:
- π Learn about Configuration Options - Explore all available configuration options
- π Discover Existing Tables - Auto-generate configs from existing tables
- β‘ CLI Reference - Explore all CLI commands and options
- π Faker Integration - Learn about advanced Faker usage
Troubleshooting¶
Common Issues¶
Connection Issues
Problem: Failed to connect to Databricks
Solution:
- Verify your Databricks Connect configuration
- Check your workspace URL and access token
- Ensure you have Unity Catalog access
- Run
test-connectionto diagnose connection issues
Permission Issues
Problem: Permission denied to write to catalog/schema
Solution:
- Verify you have CREATE TABLE permissions in the target catalog/schema
- Check with your Databricks administrator about Unity Catalog permissions
- Try using a different catalog/schema you have access to
Configuration Errors
Problem: Configuration validation failed
Solution:
- Use
validate-configcommand to check your YAML syntax - Ensure all required fields are present
- Check data types match supported Spark SQL types
- Verify faker function names are correct
Getting Help¶
If you encounter issues:
- Check the logs: Use
--verboseflag for detailed error information - Validate configuration: Use
validate-configcommand - Test connection: Use
test-connectioncommand - Review examples: Check the sample configurations in the repository