CLOE Synthetic Data Generator¶
-
Generate Fake Data
Create realistic synthetic data for your Databricks Unity Catalog tables with ease
-
YAML Configuration
Define data generation rules declaratively using simple YAML configuration files
-
CLI Commands
Powerful command-line interface for data generation, validation, and table discovery
-
Auto-Discovery
Automatically discover existing tables and generate configuration files
What is CLOE Synthetic Data Generator?¶
CLOE Synthetic Data Generator is a powerful Python library that helps you create realistic synthetic data for your Databricks Unity Catalog tables. It's designed to solve common challenges in data engineering and analytics:
Problem it Solves¶
Common Data Challenges
- Development & Testing: Need realistic data for development environments without exposing sensitive production data
- Data Privacy: Generate synthetic data that maintains statistical properties while protecting privacy
- Performance Testing: Create large datasets to test query performance and data pipeline scalability
- Demo & Training: Generate consistent, realistic datasets for demonstrations and training purposes
Key Features¶
- 🎭 Faker Integration: Leverage the powerful Faker library with 100+ data providers
- 📄 YAML Configuration: Define data generation rules using simple, declarative YAML files
- 🏗️ Type-Safe Schema: Pydantic v2 models ensure your configurations are valid
- 🎯 Unity Catalog Support: Direct integration with Databricks Unity Catalog
- 🔧 Flexible Data Types: Support for all common Spark SQL data types
- 🚀 CLI Interface: Easy-to-use command-line tools for common operations
- 🔍 Auto-Discovery: Automatically discover existing tables and generate configurations
Architecture Overview¶
graph LR
A[YAML Config] --> B[Config Loader]
B --> C[Data Generator]
C --> D[Faker Library]
C --> E[Pandas DataFrame]
E --> F[Spark DataFrame]
F --> G[Unity Catalog]
H[Table Discovery] --> I[Auto Config Generation]
I --> A
Quick Example¶
Here's a simple example to get you started:
- Create a configuration file (
user_data.yaml):
name: "User Data Generation"
target:
catalog: "main"
schema: "test_data"
table: "users"
write_mode: "overwrite"
num_records: 1000
columns:
- name: "user_id"
data_type: "string"
nullable: false
faker_function: "uuid4"
- name: "first_name"
data_type: "string"
nullable: false
faker_function: "first_name"
- name: "email"
data_type: "string"
nullable: false
faker_function: "email"
- Generate the data:
That's it! Your Unity Catalog table will be populated with 1000 rows of realistic user data.
What's Next?¶
-
Install the library and run your first data generation
-
Learn how to create and customize YAML configuration files
-
Explore all available command-line options and features