Skip to content

recognize uris as data locations in pipeline.run method and auto use core sources #338

@rudolfix

Description

@rudolfix

Background
With working fsspec we may recognize and automatically load data from various uris. We can combine this with a few additional types like pandas frame.

  • accept strings as dlt data if they are uris to resources
  • recognize fsspec uris
  • allow loading the following data formats (by extension): json, jsonl, csv ... from those uris
  • accept gzipped files
    - accept panda frames
  • allow to stream large json files, recognize files containing lists of objects and a few other streamable cases (we have a concept code)

Implementation Outline
Extent our internal sources by merging the following into main library
- pandas source (enumerate pandas frames)

  • json and jsonl sources
  • jsonl streaming source with format autodetection
  • we can use pandas for csv, xml, xls etc.

Future Work
At some point we want to change how the normalizer works so it can deal with (serialized) panda frames (ie. feather), parquet files etc. directly to not be forces to convert all of them into python objects and back

Metadata

Metadata

Assignees

No one assigned

    Labels

    QoLQuality of Life: improve the developer experienceenhancementNew feature or request

    Type

    No type

    Projects

    Status

    Todo

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions