You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The filesystem source provides the building blocks to load data from files. This section explains how you can customize the filesystem source for your use case.
8
8
9
-
## Standalone Filesystem Resource
9
+
## Standalone filesystem resource
10
10
11
11
You can use the [standalone filesystem](../../../general-usage/resource#declare-a-standalone-resource) resource to list files in cloud storage or a local filesystem. This allows you to customize file readers or manage files using [fsspec](https://filesystem-spec.readthedocs.io/en/latest/index.html).
12
12
@@ -30,7 +30,7 @@ The filesystem ensures consistent file representation across bucket types and of
30
30
- File content is typically not loaded (you can control it with the `extract_content` parameter of the filesystem resource). Instead, full file info and methods to access content are available.
31
31
- Users can request an authenticated [fsspec AbstractFileSystem](https://filesystem-spec.readthedocs.io/en/latest/_modules/fsspec/spec.html#AbstractFileSystem) instance.
32
32
33
-
#### `FileItem` fields:
33
+
#### `FileItem` fields
34
34
35
35
-`file_url` - complete URL of the file (e.g. `s3://bucket-name/path/file`). This field serves as a primary key.
36
36
-`file_name` - name of the file from the bucket URL.
@@ -52,13 +52,13 @@ When using a nested or recursive glob pattern, `relative_path` will include the
52
52
-`open()` - method which provides a file object when opened.
53
53
-`filesystem` - field, which gives access to authorized `AbstractFilesystem` with standard fsspec methods.
54
54
55
-
## Create Your Own Transformer
55
+
## Create your own transformer
56
56
57
57
Although the `filesystem` resource yields the files from cloud storage or a local filesystem, you need to apply a transformer resource to retrieve the records from files. `dlt` natively supports three file types: `csv`, `parquet`, and `jsonl` (more details in [filesystem transformer resource](../filesystem/basic#2-choose-the-right-transformer-resource)).
58
58
59
59
But you can easily create your own. In order to do this, you just need a function that takes as input a `FileItemDict` iterator and yields a list of records (recommended for performance) or individual records.
60
60
61
-
### Example: Read Data from Excel Files
61
+
### Example: read data from Excel files
62
62
63
63
The code below sets up a pipeline that reads from an Excel file using a standalone transformer:
You can use any third-party library to parse an `xml` file (e.g., [BeautifulSoup](https://pypi.org/project/beautifulsoup4/), [pandas](https://pandas.pydata.org/docs/reference/api/pandas.read_xml.html)). In the following example, we will be using the [xmltodict](https://pypi.org/project/xmltodict/) Python library.
You can get an fsspec client from the filesystem resource after it was extracted, i.e., in order to delete processed files, etc. The filesystem module contains a convenient method `fsspec_from_resource` that can be used as follows:
Filesystem source is a generic source that allows loading files from remote locations (AWS S3, Google Cloud Storage, Google Drive, Azure) or the local filesystem seamlessly. Filesystem source natively supports `csv`, `parquet`, and `jsonl` files and allows customization for loading any type of structured files.
9
+
Filesystem source allows loading files from remote locations (AWS S3, Google Cloud Storage, Google Drive, Azure) or the local filesystem seamlessly. Filesystem source natively supports `csv`, `parquet`, and `jsonl` files and allows customization for loading any type of structured files.
10
10
11
11
To load unstructured data (`.pdf`, `.txt`, e-mail), please refer to the [unstructured data source](https://github.com/dlt-hub/verified-sources/tree/master/sources/unstructured_data).
12
12
13
+
## How Filesystem source works?
14
+
15
+
The Filesystem source doesn't just give you an easy way to load data from both remote and local files — it also comes with a powerful set of tools that let you customize the loading process to fit your specific needs.
16
+
17
+
Filesystem source loads data in two steps:
18
+
1. It [accesses the files](#1-initialize-a-filesystem-resource) in your remote or local file storage without actually reading the content yet. At this point, you can [filter files by metadata or name](#6-filter-files). You can also set up [incremental loading](#5-incremental-loading) to load only new files.
19
+
2.[The transformer](#2-choose-the-right-transformer-resource) reads the files' content and yields the records. At this step, you can filter out the actual data, enrich records with metadata from files, or [perform incremental loading](#load-new-records-based-on-a-specific-column) based on the file content.
20
+
13
21
## Quick example
14
22
15
23
```py
@@ -228,10 +236,10 @@ and default credentials. To learn more about adding credentials to your pipeline
228
236
## Usage
229
237
230
238
The filesystem source is quite unique since it provides you with building blocks for loading data from files.
231
-
First, it iterates over files in the storage and then process each file to yield the records.
239
+
First, it iterates over files in the storage and then processes each file to yield the records.
232
240
Usually, you need two resources:
233
241
234
-
1. The filesystem resource enumerates files in a selected bucket using a glob pattern, returning details as `FileInfo` in customizable page sizes.
242
+
1. The `filesystem` resource enumerates files in a selected bucket using a glob pattern, returning details as `FileInfo` in customizable page sizes.
235
243
2. One of the available transformer resources to process each file in a specific transforming function and yield the records.
To load only new CSV files with [incremental loading](../../../general-usage/incremental-loading):
343
+
Here are a few simple ways to load your data incrementally:
344
+
345
+
1.[Load files based on modification date](#load-files-based-on-modification-date). Only load files that have been updated since the last time `dlt` processed them. `dlt` checks the files' metadata (like the modification date) and skips those that haven't changed.
346
+
2.[Load new records based on a specific column](#load-new-records-based-on-a-specific-column). You can load only the new or updated records by looking at a specific column, like `updated_at`. Unlike the first method, this approach would read all files every time and then filter the records which was updated.
347
+
3.[Combine loading only updated files and records](#combine-loading-only-updated-files-and-records). Finally, you can combine both methods. It could be useful if new records could be added to existing files, so you not only want to filter the modified files, but modified records as well.
327
348
328
-
```py
329
-
import dlt
330
-
from dlt.sources.filesystem import filesystem, read_csv
349
+
#### Load files based on modification date
350
+
For example, to load only new CSV files with [incremental loading](../../../general-usage/incremental-loading) you can use `apply_hints` method.
331
351
332
-
# This configuration will only consider new csv files
0 commit comments