Skip to content

CLOE Synthetic Data Generator

  • Generate Fake Data


    Create realistic synthetic data for your Databricks Unity Catalog tables with ease

    Getting Started

  • YAML Configuration


    Define data generation rules declaratively using simple YAML configuration files

    Configuration Guide

  • CLI Commands


    Powerful command-line interface for data generation, validation, and table discovery

    CLI Reference

  • Auto-Discovery


    Automatically discover existing tables and generate configuration files

    Table Discovery

What is CLOE Synthetic Data Generator?

CLOE Synthetic Data Generator is a powerful Python library that helps you create realistic synthetic data for your Databricks Unity Catalog tables. It's designed to solve common challenges in data engineering and analytics:

Problem it Solves

Common Data Challenges

  • Development & Testing: Need realistic data for development environments without exposing sensitive production data
  • Data Privacy: Generate synthetic data that maintains statistical properties while protecting privacy
  • Performance Testing: Create large datasets to test query performance and data pipeline scalability
  • Demo & Training: Generate consistent, realistic datasets for demonstrations and training purposes

Key Features

  • 🎭 Faker Integration: Leverage the powerful Faker library with 100+ data providers
  • 📄 YAML Configuration: Define data generation rules using simple, declarative YAML files
  • 🏗️ Type-Safe Schema: Pydantic v2 models ensure your configurations are valid
  • 🎯 Unity Catalog Support: Direct integration with Databricks Unity Catalog
  • 🔧 Flexible Data Types: Support for all common Spark SQL data types
  • 🚀 CLI Interface: Easy-to-use command-line tools for common operations
  • 🔍 Auto-Discovery: Automatically discover existing tables and generate configurations

Architecture Overview

graph LR
    A[YAML Config] --> B[Config Loader]
    B --> C[Data Generator]
    C --> D[Faker Library]
    C --> E[Pandas DataFrame]
    E --> F[Spark DataFrame]
    F --> G[Unity Catalog]

    H[Table Discovery] --> I[Auto Config Generation]
    I --> A

Quick Example

Here's a simple example to get you started:

  1. Create a configuration file (user_data.yaml):
name: "User Data Generation"
target:
  catalog: "main"
  schema: "test_data"
  table: "users"
  write_mode: "overwrite"

num_records: 1000

columns:
  - name: "user_id"
    data_type: "string"
    nullable: false
    faker_function: "uuid4"

  - name: "first_name"
    data_type: "string"
    nullable: false
    faker_function: "first_name"

  - name: "email"
    data_type: "string"
    nullable: false
    faker_function: "email"
  1. Generate the data:
cloe-synthetic-data-generator generate --config user_data.yaml

That's it! Your Unity Catalog table will be populated with 1000 rows of realistic user data.

What's Next?