Skip to content

Data dumps feedback #2078

Closed
Closed
@kornelski

Description

@kornelski

Overall having a full data access is great! However, the implementation could be optimized:

  1. The tarball format is problematic. It doesn't allow random access, so to extract only interesting parts it's necessary to download and decompress most of it. A ZIP archive or individually gzipped files would be more convenient for selective consumption.

  2. crates.csv is needed to map crate names to crates-io internal IDs, so parsing of this file is required to make sense of the rest of the data. However, this file also contains bodies of README files, which are relatively big. It'd be nice to put READMEs separately. It also has textsearchable_index_col which is postgres-specific and redundant.

  3. version_downloads.csv is the largest file, and it will keep growing. It'd be nice to shard this data by time (e.g. a separate file by year or even by day). I would like to get downloads data daily, but I'd rather download one day of data daily, not all days every day.

Metadata

Metadata

Assignees

No one assigned

    Labels

    C-enhancement ✨Category: Adding new behavior or a change to the way an existing feature works

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions