Skip to content

Commit 0add817

Browse files
committed
Rebase on the latest devel
1 parent 277613f commit 0add817

File tree

5 files changed

+148
-40
lines changed

5 files changed

+148
-40
lines changed

docs/website/docs/dlt-ecosystem/verified-sources/filesystem/advanced.md

Lines changed: 8 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -1,12 +1,12 @@
11
---
2-
title: Advanced filesystem usage
2+
title: Advanced Filesystem Usage
33
description: Use filesystem source as a building block
44
keywords: [readers source and filesystem, files, filesystem, readers source, cloud storage]
55
---
66

77
The filesystem source provides the building blocks to load data from files. This section explains how you can customize the filesystem source for your use case.
88

9-
## Standalone Filesystem Resource
9+
## Standalone filesystem resource
1010

1111
You can use the [standalone filesystem](../../../general-usage/resource#declare-a-standalone-resource) resource to list files in cloud storage or a local filesystem. This allows you to customize file readers or manage files using [fsspec](https://filesystem-spec.readthedocs.io/en/latest/index.html).
1212

@@ -30,7 +30,7 @@ The filesystem ensures consistent file representation across bucket types and of
3030
- File content is typically not loaded (you can control it with the `extract_content` parameter of the filesystem resource). Instead, full file info and methods to access content are available.
3131
- Users can request an authenticated [fsspec AbstractFileSystem](https://filesystem-spec.readthedocs.io/en/latest/_modules/fsspec/spec.html#AbstractFileSystem) instance.
3232

33-
#### `FileItem` fields:
33+
#### `FileItem` fields
3434

3535
- `file_url` - complete URL of the file (e.g. `s3://bucket-name/path/file`). This field serves as a primary key.
3636
- `file_name` - name of the file from the bucket URL.
@@ -52,13 +52,13 @@ When using a nested or recursive glob pattern, `relative_path` will include the
5252
- `open()` - method which provides a file object when opened.
5353
- `filesystem` - field, which gives access to authorized `AbstractFilesystem` with standard fsspec methods.
5454

55-
## Create Your Own Transformer
55+
## Create your own transformer
5656

5757
Although the `filesystem` resource yields the files from cloud storage or a local filesystem, you need to apply a transformer resource to retrieve the records from files. `dlt` natively supports three file types: `csv`, `parquet`, and `jsonl` (more details in [filesystem transformer resource](../filesystem/basic#2-choose-the-right-transformer-resource)).
5858

5959
But you can easily create your own. In order to do this, you just need a function that takes as input a `FileItemDict` iterator and yields a list of records (recommended for performance) or individual records.
6060

61-
### Example: Read Data from Excel Files
61+
### Example: read data from Excel files
6262

6363
The code below sets up a pipeline that reads from an Excel file using a standalone transformer:
6464

@@ -97,7 +97,7 @@ load_info = pipeline.run(example_xls.with_name("example_xls_data"))
9797
print(load_info)
9898
```
9999

100-
### Example: Read Data from XML Files
100+
### Example: read data from XML files
101101

102102
You can use any third-party library to parse an `xml` file (e.g., [BeautifulSoup](https://pypi.org/project/beautifulsoup4/), [pandas](https://pandas.pydata.org/docs/reference/api/pandas.read_xml.html)). In the following example, we will be using the [xmltodict](https://pypi.org/project/xmltodict/) Python library.
103103

@@ -135,7 +135,7 @@ load_info = pipeline.run(example_xml.with_name("example_xml_data"))
135135
print(load_info)
136136
```
137137

138-
## Clean Files After Loading
138+
## Clean files after loading
139139

140140
You can get an fsspec client from the filesystem resource after it was extracted, i.e., in order to delete processed files, etc. The filesystem module contains a convenient method `fsspec_from_resource` that can be used as follows:
141141

@@ -153,7 +153,7 @@ fs_client = fsspec_from_resource(gs_resource)
153153
fs_client.ls("ci-test-bucket/standard_source/samples")
154154
```
155155

156-
## Copy Files Locally
156+
## Copy files locally
157157

158158
To copy files locally, add a step in the filesystem resource and then load the listing to the database:
159159

@@ -162,7 +162,6 @@ import os
162162

163163
import dlt
164164
from dlt.common.storages.fsspec_filesystem import FileItemDict
165-
from dlt.common.typing import TDataItems
166165
from dlt.sources.filesystem import filesystem
167166

168167
def _copy(item: FileItemDict) -> FileItemDict:

docs/website/docs/dlt-ecosystem/verified-sources/filesystem/basic.md

Lines changed: 134 additions & 26 deletions
Original file line numberDiff line numberDiff line change
@@ -6,10 +6,18 @@ keywords: [readers source and filesystem, files, filesystem, readers source, clo
66
import Header from '../_source-info-header.md';
77
<Header/>
88

9-
Filesystem source is a generic source that allows loading files from remote locations (AWS S3, Google Cloud Storage, Google Drive, Azure) or the local filesystem seamlessly. Filesystem source natively supports `csv`, `parquet`, and `jsonl` files and allows customization for loading any type of structured files.
9+
Filesystem source allows loading files from remote locations (AWS S3, Google Cloud Storage, Google Drive, Azure) or the local filesystem seamlessly. Filesystem source natively supports `csv`, `parquet`, and `jsonl` files and allows customization for loading any type of structured files.
1010

1111
To load unstructured data (`.pdf`, `.txt`, e-mail), please refer to the [unstructured data source](https://github.com/dlt-hub/verified-sources/tree/master/sources/unstructured_data).
1212

13+
## How Filesystem source works?
14+
15+
The Filesystem source doesn't just give you an easy way to load data from both remote and local files — it also comes with a powerful set of tools that let you customize the loading process to fit your specific needs.
16+
17+
Filesystem source loads data in two steps:
18+
1. It [accesses the files](#1-initialize-a-filesystem-resource) in your remote or local file storage without actually reading the content yet. At this point, you can [filter files by metadata or name](#6-filter-files). You can also set up [incremental loading](#5-incremental-loading) to load only new files.
19+
2. [The transformer](#2-choose-the-right-transformer-resource) reads the files' content and yields the records. At this step, you can filter out the actual data, enrich records with metadata from files, or [perform incremental loading](#load-new-records-based-on-a-specific-column) based on the file content.
20+
1321
## Quick example
1422

1523
```py
@@ -228,10 +236,10 @@ and default credentials. To learn more about adding credentials to your pipeline
228236
## Usage
229237

230238
The filesystem source is quite unique since it provides you with building blocks for loading data from files.
231-
First, it iterates over files in the storage and then process each file to yield the records.
239+
First, it iterates over files in the storage and then processes each file to yield the records.
232240
Usually, you need two resources:
233241

234-
1. The filesystem resource enumerates files in a selected bucket using a glob pattern, returning details as `FileInfo` in customizable page sizes.
242+
1. The `filesystem` resource enumerates files in a selected bucket using a glob pattern, returning details as `FileInfo` in customizable page sizes.
235243
2. One of the available transformer resources to process each file in a specific transforming function and yield the records.
236244

237245
### 1. Initialize a `filesystem` resource
@@ -246,11 +254,21 @@ filesystem_source = filesystem(
246254
)
247255
```
248256
or taken from the config:
249-
```py
250-
from dlt.sources.filesystem import filesystem
251257

252-
filesystem_source = filesystem()
253-
```
258+
* python code:
259+
260+
```py
261+
from dlt.sources.filesystem import filesystem
262+
263+
filesystem_source = filesystem()
264+
```
265+
266+
* configuration file:
267+
```toml
268+
[sources.filesystem]
269+
bucket_url="file://Users/admin/Documents/csv_files"
270+
file_glob="*.csv"
271+
```
254272

255273
Full list of `filesystem` resource parameters:
256274

@@ -263,7 +281,7 @@ Full list of `filesystem` resource parameters:
263281
### 2. Choose the right transformer resource
264282

265283
The current implementation of the filesystem source natively supports three file types: `csv`, `parquet`, and `jsonl`.
266-
You can apply any of the above or create your own [transformer](advanced#create-your-own-transformer). To apply the selected transformer
284+
You can apply any of the above or [create your own transformer](advanced#create-your-own-transformer). To apply the selected transformer
267285
resource, use pipe notation `|`:
268286

269287
```py
@@ -277,10 +295,10 @@ filesystem_pipe = filesystem(
277295

278296
#### Available transformers
279297

280-
- `read_csv()`
281-
- `read_jsonl()`
282-
- `read_parquet()`
283-
- `read_csv_duckdb()`
298+
- `read_csv()` - process `csv` files using `pandas`
299+
- `read_jsonl()` - process `jsonl` files chuck by chunk
300+
- `read_parquet()` - process `parquet` files using `pyarrow`
301+
- `read_csv_duckdb()` - this transformer process `csv` files using DuckDB, which usually shows better performance, than `pandas`.
284302

285303
:::tip
286304
We advise that you give each resource a
@@ -318,27 +336,117 @@ filesystem_pipe.apply_hints(write_disposition="merge", merge_key="date")
318336
pipeline = dlt.pipeline(pipeline_name="my_pipeline", destination="duckdb")
319337
load_info = pipeline.run(filesystem_pipe.with_name("table_name"))
320338
print(load_info)
321-
print(pipeline.last_trace.last_normalize_info)
322339
```
323340

324341
### 5. Incremental loading
325342

326-
To load only new CSV files with [incremental loading](../../../general-usage/incremental-loading):
343+
Here are a few simple ways to load your data incrementally:
344+
345+
1. [Load files based on modification date](#load-files-based-on-modification-date). Only load files that have been updated since the last time `dlt` processed them. `dlt` checks the files' metadata (like the modification date) and skips those that haven't changed.
346+
2. [Load new records based on a specific column](#load-new-records-based-on-a-specific-column). You can load only the new or updated records by looking at a specific column, like `updated_at`. Unlike the first method, this approach would read all files every time and then filter the records which was updated.
347+
3. [Combine loading only updated files and records](#combine-loading-only-updated-files-and-records). Finally, you can combine both methods. It could be useful if new records could be added to existing files, so you not only want to filter the modified files, but modified records as well.
327348

328-
```py
329-
import dlt
330-
from dlt.sources.filesystem import filesystem, read_csv
349+
#### Load files based on modification date
350+
For example, to load only new CSV files with [incremental loading](../../../general-usage/incremental-loading) you can use `apply_hints` method.
331351

332-
# This configuration will only consider new csv files
333-
new_files = filesystem(bucket_url="s3://bucket_name", file_glob="directory/*.csv")
334-
# add incremental on modification time
335-
new_files.apply_hints(incremental=dlt.sources.incremental("modification_date"))
352+
```py
353+
import dlt
354+
from dlt.sources.filesystem import filesystem, read_csv
355+
356+
# This configuration will only consider new csv files
357+
new_files = filesystem(bucket_url="s3://bucket_name", file_glob="directory/*.csv")
358+
# add incremental on modification time
359+
new_files.apply_hints(incremental=dlt.sources.incremental("modification_date"))
360+
361+
pipeline = dlt.pipeline(pipeline_name="my_pipeline", destination="duckdb")
362+
load_info = pipeline.run((new_files | read_csv()).with_name("csv_files"))
363+
print(load_info)
364+
```
365+
366+
#### Load new records based on a specific column
367+
368+
In this example we load only new records based on the field called `updated_at`. This method may be useful if you are not able to
369+
filter files by modification date because for example, all files are modified each time new record is appeared.
370+
```py
371+
import dlt
372+
from dlt.sources.filesystem import filesystem, read_csv
373+
374+
# We consider all csv files
375+
all_files = filesystem(bucket_url="s3://bucket_name", file_glob="directory/*.csv")
376+
377+
# But filter out only updated records
378+
filesystem_pipe = (all_files | read_csv()).apply_hints(incremental=dlt.sources.incremental("updated_at"))
379+
pipeline = dlt.pipeline(pipeline_name="my_pipeline", destination="duckdb")
380+
load_info = pipeline.run(filesystem_pipe)
381+
print(load_info)
382+
```
383+
384+
#### Combine loading only updated files and records
385+
386+
```py
387+
import dlt
388+
from dlt.sources.filesystem import filesystem, read_csv
336389

337-
pipeline = dlt.pipeline(pipeline_name="my_pipeline", destination="duckdb")
338-
load_info = pipeline.run((new_files | read_csv()).with_name("csv_files"))
339-
print(load_info)
340-
print(pipeline.last_trace.last_normalize_info)
341-
```
390+
# This configuration will only consider modified csv files
391+
new_files = filesystem(bucket_url="s3://bucket_name", file_glob="directory/*.csv")
392+
new_files.apply_hints(incremental=dlt.sources.incremental("modification_date"))
393+
394+
# And in each modified file we filter out only updated records
395+
filesystem_pipe = (new_files | read_csv()).apply_hints(incremental=dlt.sources.incremental("updated_at"))
396+
pipeline = dlt.pipeline(pipeline_name="my_pipeline", destination="duckdb")
397+
load_info = pipeline.run(filesystem_pipe)
398+
print(load_info)
399+
```
400+
401+
### 6. Filter files
402+
403+
If you need to filter out files based on their metadata, you can easily do this using the `add_filter` method.
404+
Within your filtering function, you'll have access to [any field](advanced#fileitem-fields) of the `FileItem` representation.
405+
406+
#### Filter by name
407+
To filter only files that have `London` and `Berlin` in their names, you can do the following:
408+
```py
409+
import dlt
410+
from dlt.sources.filesystem import filesystem, read_csv
411+
412+
# Filter files accessing file_name field
413+
filtered_files = filesystem(bucket_url="s3://bucket_name", file_glob="directory/*.csv")
414+
filtered_files.add_filter(lambda item: ("London" in item.file_name) or ("Berlin" in item.file_name))
415+
416+
filesystem_pipe = (filtered_files | read_csv())
417+
pipeline = dlt.pipeline(pipeline_name="my_pipeline", destination="duckdb")
418+
load_info = pipeline.run(filesystem_pipe)
419+
print(load_info)
420+
```
421+
422+
:::tip
423+
You could also use `file_glob` to filter files by names. It works very well in simple cases, for example, filtering by extention:
424+
```py
425+
from dlt.sources.filesystem import filesystem
426+
427+
filtered_files = filesystem(bucket_url="s3://bucket_name", file_glob="**/*.json")
428+
```
429+
:::
430+
431+
#### Filter by size
432+
433+
If for some reason you only want to load small files, you can also do that:
434+
435+
```py
436+
import dlt
437+
from dlt.sources.filesystem import filesystem, read_csv
438+
439+
MAX_SIZE_IN_BYTES = 10
440+
441+
# Filter files accessing size_in_bytes field
442+
filtered_files = filesystem(bucket_url="s3://bucket_name", file_glob="directory/*.csv")
443+
filtered_files.add_filter(lambda item: item.size_in_bytes < MAX_SIZE_IN_BYTES)
444+
445+
filesystem_pipe = (filtered_files | read_csv())
446+
pipeline = dlt.pipeline(pipeline_name="my_pipeline", destination="duckdb")
447+
load_info = pipeline.run(filesystem_pipe)
448+
print(load_info)
449+
```
342450

343451
## Troubleshooting
344452

docs/website/docs/dlt-ecosystem/verified-sources/filesystem/index.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,17 +1,17 @@
11
---
22
title: Filesystem & Buckets
3-
description: dlt verified source for Filesystem & Buckets
3+
description: dlt-verified source for Filesystem & Buckets
44
keywords: [readers source and filesystem, files, filesystem, readers source, cloud storage]
55
---
66

7-
The Filesystem source is a generic source that allows seamless loading files from the following locations:
7+
The Filesystem source allows seamless loading of files from the following locations:
88
* AWS S3
99
* Google Cloud Storage
1010
* Google Drive
1111
* Azure
1212
* local filesystem
1313

14-
The Filesystem source natively supports `csv`, `parquet`, and `jsonl` files, and allows customization for loading any type of structured files.
14+
The Filesystem source natively supports `csv`, `parquet`, and `jsonl` files and allows customization for loading any type of structured files.
1515

1616
import DocCardList from '@theme/DocCardList';
1717

docs/website/docs/dlt-ecosystem/verified-sources/index.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,7 @@ Planning to use `dlt` in production and need a source that isn't listed? We're h
1212
### Core sources
1313

1414
<DocCardList items={useCurrentSidebarCategory().items.filter(
15-
item => item.label === '30+ SQL Databases' || item.label === 'REST API generic source' || item.label === 'Filesystem'
15+
item => item.label === '30+ SQL Databases' || item.label === 'REST API generic source' || item.label === 'Filesystem & buckets'
1616
)} />
1717

1818
### Verified sources
@@ -24,7 +24,7 @@ If you couldn't find a source implementation, you can easily create your own, ch
2424
:::
2525

2626
<DocCardList items={useCurrentSidebarCategory().items.filter(
27-
item => item.label !== '30+ SQL Databases' && item.label !== 'REST API generic source'&& item.label !== 'Filesystem'
27+
item => item.label !== '30+ SQL Databases' && item.label !== 'REST API generic source'&& item.label !== 'Filesystem & buckets'
2828
)} />
2929

3030
### What's the difference between core and verified sources?

docs/website/sidebars.js

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -59,6 +59,7 @@ const sidebars = {
5959
{
6060
type: 'category',
6161
label: 'Filesystem & buckets',
62+
description: 'AWS S3, GCP, Azure, local files',
6263
link: {
6364
type: 'doc',
6465
id: 'dlt-ecosystem/verified-sources/filesystem/index',

0 commit comments

Comments
 (0)