Faker Integration¶

CLOE Synthetic Data Generator leverages the powerful Faker library to generate realistic synthetic data. This guide covers the essential concepts and patterns for using Faker effectively.

What is Faker?¶

Faker is a Python library that generates fake data for you. It provides over 100 data providers covering everything from personal information to business data, addresses, and more.

Complete Faker Documentation

For the full list of available providers and methods, visit the official Faker documentation. This guide covers the core concepts - refer to the official docs for specific provider details.

Basic Usage¶

Simple Functions¶

Most Faker functions work without any options:

columns:
  - name: "first_name"
    data_type: "string"
    faker_function: "first_name"    # Generates: "John", "Sarah", "Michael"

  - name: "email"
    data_type: "string"
    faker_function: "email"         # Generates: "john@example.com"

  - name: "company"
    data_type: "string"
    faker_function: "company"       # Generates: "Smith Inc", "Johnson LLC"

Functions with Options¶

Customize output using faker_options:

columns:
  - name: "age"
    data_type: "integer"
    faker_function: "random_int"
    faker_options:
      min: 18
      max: 80

  - name: "description"
    data_type: "string"
    faker_function: "text"
    faker_options:
      max_nb_chars: 200

  - name: "salary"
    data_type: "double"
    faker_function: "pyfloat"
    faker_options:
      left_digits: 5
      right_digits: 2
      positive: true
      min_value: 30000
      max_value: 150000

Core Faker Patterns¶

Custom Choices¶

Create domain-specific data using random_element:

columns:
  - name: "status"
    data_type: "string"
    faker_function: "random_element"
    faker_options:
      elements: ["active", "inactive", "pending", "suspended"]

  - name: "priority"
    data_type: "string"
    faker_function: "random_element"
    faker_options:
      elements: ["low", "medium", "high", "critical"]

Pattern-Based Generation¶

Use bothify and hexify for specific patterns:

columns:
  # Product SKUs: "PRD-1234-ABC"
  - name: "sku"
    data_type: "string"
    faker_function: "bothify"
    faker_options:
      text: "PRD-####-???"

  # License keys: "XXXX-XXXX-XXXX-XXXX"
  - name: "license_key"
    data_type: "string"
    faker_function: "hexify"
    faker_options:
      text: "^^^^-^^^^-^^^^-^^^^"
      upper: true

Date and Time Patterns¶

Generate dates within specific ranges:

columns:
  - name: "created_at"
    data_type: "timestamp"
    faker_function: "date_time_between"
    faker_options:
      start_date: "-1y"    # 1 year ago
      end_date: "now"      # Current time

  - name: "birth_date"
    data_type: "date"
    faker_function: "date_between"
    faker_options:
      start_date: "-65y"   # 65 years ago
      end_date: "-18y"     # 18 years ago

Boolean with Probability¶

Control boolean distribution:

columns:
  # 80% chance of being true
  - name: "is_active"
    data_type: "boolean"
    faker_function: "boolean"
    faker_options:
      chance_of_getting_true: 80

  # 20% chance of being true  
  - name: "is_premium"
    data_type: "boolean"
    faker_function: "boolean"
    faker_options:
      chance_of_getting_true: 20

Common Providers by Category¶

Provider Categories

Faker organizes providers into categories. Here are the most commonly used ones:

Personal Data: first_name, last_name, name, email, phone_number, ssn

Address: address, city, state, country, zipcode, latitude, longitude

Business: company, job, catch_phrase, bs

Finance: credit_card_number, iban, currency_code

Internet: user_name, domain_name, url, ipv4, mac_address

Text: sentence, paragraph, text, words

Identifiers: uuid4, ean13, isbn13

Dates: date, date_between, date_time, date_time_between

For detailed documentation of each provider and their options, see the Faker Providers Documentation.

Localization¶

Faker supports multiple locales for region-specific data:

columns:
  # US-specific phone numbers
  - name: "us_phone"
    data_type: "string"
    faker_function: "phone_number"
    faker_options:
      locale: "en_US"

  # German addresses  
  - name: "de_address"
    data_type: "string"
    faker_function: "address"
    faker_options:
      locale: "de_DE"

Available locales include en_US, de_DE, fr_FR, es_ES, ja_JP, and many more. See the Faker Localization Documentation for the complete list.

Advanced Patterns¶

Weighted Random Choices¶

Use different probabilities for choices:

columns:
  - name: "plan_type"
    data_type: "string"
    faker_function: "random_choices"
    faker_options:
      elements: ["free", "basic", "premium", "enterprise"]
      weights: [50, 30, 15, 5]    # 50% free, 30% basic, etc.
      length: 1

Complex Numeric Patterns¶

Generate numbers with specific characteristics:

columns:
  # Prices with 2 decimal places ($1.00 to $999.99)
  - name: "price"
    data_type: "double"
    faker_function: "pyfloat"
    faker_options:
      left_digits: 3
      right_digits: 2
      positive: true
      min_value: 1.00
      max_value: 999.99

  # Percentages (0.0 to 1.0)
  - name: "completion_rate"
    data_type: "double"
    faker_function: "pyfloat"
    faker_options:
      left_digits: 1
      right_digits: 3
      positive: true
      min_value: 0.0
      max_value: 1.0

Performance Tips¶

Function Performance¶

Some Faker functions are faster than others:

Fast: uuid4, random_int, boolean, random_element
Medium: first_name, email, company, address
Slower: paragraph, text (with large content), complex pattern matching

For very large datasets (millions of records), prefer faster functions when possible.

Best Practices¶

Use simple functions for high-volume generation
Increase batch_size in your configuration for better performance
Cache repeated patterns by using random_element with predefined lists
Monitor memory usage when generating large text content

Complete Example¶

Here's a realistic configuration combining multiple patterns:

name: "User Management System"
target:
  catalog: "main"
  schema: "user_data"
  table: "users"
  write_mode: "overwrite"

num_records: 10000
batch_size: 1000

columns:
  # Unique identifier
  - name: "user_id"
    data_type: "string"
    nullable: false
    faker_function: "uuid4"

  # Personal information
  - name: "first_name"
    data_type: "string"
    nullable: false
    faker_function: "first_name"

  - name: "last_name"
    data_type: "string"
    nullable: false
    faker_function: "last_name"

  - name: "email"
    data_type: "string"
    nullable: false
    faker_function: "email"

  # Demographics
  - name: "age"
    data_type: "integer"
    nullable: true
    faker_function: "random_int"
    faker_options:
      min: 18
      max: 80

  - name: "country"
    data_type: "string"
    nullable: false
    faker_function: "country_code"

  # Account details
  - name: "account_type"
    data_type: "string"
    nullable: false
    faker_function: "random_element"
    faker_options:
      elements: ["free", "premium", "enterprise"]

  - name: "is_active"
    data_type: "boolean"
    nullable: false
    faker_function: "boolean"
    faker_options:
      chance_of_getting_true: 85

  # Timestamps
  - name: "created_at"
    data_type: "timestamp"
    nullable: false
    faker_function: "date_time_between"
    faker_options:
      start_date: "-2y"
      end_date: "now"

  - name: "last_login"
    data_type: "timestamp"
    nullable: true
    faker_function: "date_time_between"
    faker_options:
      start_date: "-30d"
      end_date: "now"

Next Steps¶

📚 Configuration Guide - Learn about complete YAML structure
🔍 Table Discovery - Auto-discover and generate configs
📊 Examples - See domain-specific examples
🌐 Faker Documentation - Complete provider reference