Delta Loader Overview¶
The DeltaLoader is a powerful tool for performing incremental data loads on
Delta tables. It provides multiple strategies for reading data incrementally,
ensuring efficient data processing while maintaining complete audit trails
through automated metadata management.
Key Capabilities
- Multiple strategies: Choose between CDF and timestamp-based loading
- Automatic metadata tracking: Seamless continuation across load operations
- Flexible configuration: Customize behavior for different use cases
How Delta Loading Works¶
Delta loading enables you to process only the data that has changed since your last load; this dramatically improves performance and reduces resource consumption for large datasets.
Core Workflow¶
- Configure: Choose your loading strategy and set options
- Create: Use DeltaLoaderFactoryto instantiate the appropriate loader
- Load: Call get_data()to retrieve incremental data
- Process: Apply transformations and business logic
- Write: Update your target tables
- Commit: Mark the load as processed to update metadata
| Component | Purpose | 
|---|---|
| DeltaLoadOption | Configuration object defining strategy and options | 
| DeltaLoaderFactory | Factory for creating appropriate loader instances | 
| DeltaLoader | Base interface for all loading strategies | 
| Metadata Table | Tracks loading progress and state | 
- Performance: Process only changed data
- Reliability: Automatic progress tracking
- Scalability: Handle large datasets efficiently
Available Strategies¶
Choose the right delta loading strategy based on your data characteristics and requirements.
Best for: Tables with frequent updates, deletes, and complex change patterns
The DeltaCDFLoader leverages Delta Lake's Change Data Feed to capture all table changes at the transaction level.
Key Features:
- Tracks change types INSERT&UPDATE
- Version-based progress tracking
- Automatic deduplication support
- Handles complex change scenarios
Requirements:
When to Use CDF
- Data undergoes frequent updates
- You need to capture all change types
- Source table supports Change Data Feed
- Complex merge operations are required
Best for: Time-series data and append-only scenarios
The DeltaTimestampLoader filters data based on timestamp columns to identify new records.
Key Features:
- Timestamp-based filtering
- Ideal for time-series data
- Simple append operations
- Custom time range support
Requirements:
- One or more timestamp columns in your data
- Timestamps should be monotonically increasing for new records
When to Use Timestamp
- Working with time-series data
- Append-only data patterns
- No updates to historical records
- Simple incremental processing needs
Reference Documentation
For detailed API documentation, see the Delta Loader Reference.
Additional Resources
- Scenarios Guide: Detailed examples of both strategies in action