read_files
            ReadFilesAction
¶
    
              Bases: PipelineAction
Reads files from a specified location.
If an extension is provided, all files with the given extension will be read
using the FileReader. If no
extension is provided, the spark_format must be set, and all files in the
location will be read using a DataFrameReader with the specified format.
Example
Read Files:
    action: READ_FILES
    options:
        location: json_file_folder/
        search_subdirs: True
        spark_format: JSON
Define Spark Format
Use the spark_format option to specify the format with which
to read the files. Supported formats are e.g., CSV, JSON,
PARQUET, TEXT, and XML.
Read Files:
    action: READ_FILES
    options:
        location: csv_file_folder/
        search_subdirs: True
        extension: csv
Define Extension
Use the extension option to specify the extension of the files
to read. If not specified, the spark_format will be derived from
the extension.
Read Files:
    action: READ_FILES
    options:
        location: file_folder/
        extension: abc_custom_extension  # specifies the files to read
        spark_format: CSV  # specifies the format to read the files with
Define both Extension & Spark Format
Use the extension option to specify the extension of the files
to read. Additionally, use the spark_format option to specify
the format with which to read the files.
Read Delta Files:
    action: READ_FILES
    options:
        location: /path/to/delta/table
        spark_format: delta
    delta_load_options:
        strategy: CDF
        delta_load_identifier: my_delta_files_load
        strategy_options:
            deduplication_columns: ["id"]
            enable_full_load: false
Delta Loading for Files
Use delta_load_options when reading Delta Lake tables to enable
incremental loading. This works with both CDF and timestamp strategies.
Source code in src/cloe_nessy/pipeline/actions/read_files.py
                | 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 |  | 
            run(context, *, location=None, search_subdirs=False, extension=None, spark_format=None, schema=None, add_metadata_column=True, options=None, delta_load_options=None, **_)
  
      staticmethod
  
¶
    Reads files from a specified location.
Parameters:
| Name | Type | Description | Default | 
|---|---|---|---|
| context | PipelineContext | The context in which this Action is executed. | required | 
| location | str | None | The location from which to read files. | None | 
| search_subdirs | bool | Recursively search subdirectories for files if an extension is provided. | False | 
| extension | str | None | The file extension to filter files by. | None | 
| spark_format | str | None | The format to use for reading the files. If not provided, it will be deferred from the file extension. | None | 
| schema | str | None | The schema of the data. If None, schema is obtained from the context metadata. | None | 
| add_metadata_column | bool | Whether to include the  | True | 
| options | dict[str, str] | None | Additional options passed to the reader. | None | 
| delta_load_options | dict[Any, Any] | DeltaLoadOptions | None | Options for delta loading, if applicable. When provided for Delta format files, enables incremental loading using delta loader strategies. | None | 
Raises:
| Type | Description | 
|---|---|
| ValueError | If neither  | 
Returns:
| Type | Description | 
|---|---|
| PipelineContext | The context after the Action has been executed, containing the read data as a DataFrame. | 
Source code in src/cloe_nessy/pipeline/actions/read_files.py
              | 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 |  |