Provide a database snapshot to facilitate development. #630

LeopoldArkham · 2017-03-15T17:52:51Z

It would be nice to have a snapshot of (a subset of) the database that someone wishing to contribute could load locally to test out the code on real data. You can fake crates, dependencies, downloads, etc., but that only gets you so far.

Eh2406 · 2017-03-15T21:19:12Z

I recall there being an issue for creating sanitized sql dumps of the database. Such a dump, would be useful for this aswell. Of course now I cannot find the issue.

carols10cents · 2017-03-18T02:28:05Z

#410 is the closest one I know of, going to close that in favor of this one since we want to keep the index small.

Notable info from that issue is that rubygems.org provides a sanitized snapshot, i'll probably look into their infrastructure to see if there's something we could reuse.

The easiest thing to do sanitation-wise is probably to just not dump the users and follows tables. Actually we should probably make this a white list instead of a black list of tables to dump so that we don't leak data by adding a table that should be stripped out and forgetting to add it to the blacklist.

Eh2406 · 2017-03-18T02:38:09Z

That is the one I had in mind, thanks! When we have a game plan for this I'd love to help.

moore3071 · 2018-01-27T23:53:06Z

Would it be possible to use the repository crates.io-index to create a quick development database? This would make it a lot harder to accidentally leak any private data.

icefoxen · 2019-02-21T01:09:08Z

Do people still want this? Not having to scrape the crates.io API for things like downloads and crate owners would be great. I'll happily throw together a PR.

Edit: Would it be easier for you to just have a script to dump to a static file that gets run by a cron job and stuck in a static location? Or have an API endpoint that serves it somehow? I dunno much about heroku and your other infrastructure actually works. Upload it to a particular S3 location or something?

icefoxen · 2019-02-21T01:47:46Z

Ok, after a little research it looks like heroku lets you make backups, pull backups to a local machine or push them to heroku and so on, but not do what we want directly, which is to be very careful to dump only certain tables to a backup and then make it easily public. Seems like the easy way would be to use heroku pg:pull to pull a remote database to a local postgres instance, call pg_dump and give it your whitelist of tables to dump, then upload the results to an S3 bucket or something and have a URL that redirects to that. Not sure how big/unwieldy the database is, and I dunno how easy it is for crates.io to set up these sort of ad-hoc scripts; this is really not a programming problem as much as an infrastructure problem.

smarnach · 2019-07-11T08:12:38Z

Given the recent discussion about squashing the index, I took a look at this issue. Here is a suggestion for a design.

Goals

Allow crawlers to download a single archive with all information instead of hitting crates.io thousands of times to retrieve all information on all crates.
Facilitate local testing in crates.io development (e.g. query optimisation with a realistic dataset).
Facilitate running a crates.io read-only mirror.

Data to export and privacy considerations

We should only export data that users expect us to be public, which is roughly the data that is already exposed via the API.

Some of the tables should not be exported at all, for some table we can only export a subset of the columns, and for some tables we should even filter the rows.

The users table is in the last category. I believe that users have a reasonable expectation of remaining private if they just create an account on crates.io without publishing anything (though I can see that others may disagree with this sentinment).

Here is a prototype suggestion for the data to export, in the form a a psql script:

-- Only select crate owners that have not been deleted.
CREATE TEMPORARY VIEW crate_owners_export AS (
    SELECT
        crate_id, owner_id, created_at, updated_at, owner_kind
    FROM crate_owners
    WHERE NOT deleted
);

-- Only select users who are publicly visible through public activity.
-- This query can be simplified by introducing a Boolean `public` column that is
-- flipped to `true` when a user has their first public activity.
CREATE TEMPORARY VIEW users_export AS (
    SELECT
        id, gh_login, name, gh_avatar, gh_id
    FROM users
    WHERE
        id in (
            SELECT owner_id AS user_id FROM crate_owners_export WHERE owner_kind = 0
            UNION
            SELECT published_by as user_id FROM versions
        )
);

-- \copy statements can't be broken up into multiple lines.
\copy badges (crate_id, badge_type, attributes) TO 'badges.csv' WITH CSV HEADER
\copy categories (id, category, slug, description, crates_cnt, created_at, path) TO 'categories.csv' WITH CSV HEADER
\copy (SELECT * FROM crate_owners_export) TO 'crate_owners.csv' WITH CSV HEADER
\copy crates (id, name, updated_at, created_at, downloads, description, homepage, documentation, readme, textsearchable_index_col, license, repository, max_upload_size) TO 'crates.csv' WITH CSV HEADER
\copy crates_categories (crate_id, category_id) TO 'crates_categories.csv' WITH CSV HEADER
\copy crates_keywords (crate_id, keyword_id) TO 'crates_keywords.csv' WITH CSV HEADER
\copy dependencies (id, version_id, crate_id, req, optional, default_features, features, target, kind) TO 'dependencies.csv' WITH CSV HEADER
\copy keywords (id, keyword, crates_cnt, created_at) TO 'keywords.csv' WITH CSV HEADER
\copy metadata (total_downloads) TO 'metadata.csv' WITH CSV HEADER
\copy readme_renderings (version_id, rendered_at) TO 'readme_renderings.csv' WITH CSV HEADER
\copy reserved_crate_names (name) TO 'reserved_crate_names.csv' WITH CSV HEADER
\copy teams (id, login, github_id, name, avatar) TO 'teams.csv' WITH CSV HEADER
\copy (SELECT * FROM users_export) TO 'users.csv' WITH CSV HEADER
\copy version_authors (id, version_id, name) TO 'version_authors.csv' WITH CSV HEADER
\copy version_downloads (version_id, downloads, counted, date) TO 'version_downloads.csv' WITH CSV HEADER
\copy versions (id, crate_id, num, updated_at, created_at, downloads, features, yanked, license, crate_size, published_by) TO 'versions.csv' WITH CSV HEADER

These CSV dumps can be bundled in a single tarball or zip archive, together with an import script that allows reimporting them.

The script above explicitly lists the names of all tables and columns to export, in order to make it a conscious decision to export the data from newly introduced columns.

As an additional mechanism to protect against accidental data leaks, we could run the script as a user who only has permission to access the public columns (and we could even consider enabling row security for the tables that we filter by row).

No data from these tables is exported by the above script:

__diesel_schema_migrations
api_tokens
background_jobs
crate_owner_invitations
emails
follows
publish_limit_buckets
publish_rate_overrides
versions_published_by

Data consistency

The above \copy statements do not necessarily result in a consistent export. As an example, a new crate may be published between exporting the crates table and the versions table. The CSV export of the versions table will then contain a reference to a crate that is not in the CSV export of the crates table, which will result in an error when trying to re-import the data to Postgres.

It should be possible to mitigate this issue by carefully selecting the order for the table export.

Alternatives considered

SQL dumps

The pg_dump tool can only dump full tables. Since we omit columns and filter rows for several tables, we would need to create temporary export tables to be able to export them with pg_dump. SQL dumps can only be consumed by importing them into a proper SQL database, while CSV dumps can also be directly loaded to memory and processed ad hoc.

JSON dumps

The JSON schema exported by the crates/:crate_name endpoint could be expanded by including all relevant information for a single crate, including everything that is currently only covered as links. This would result in a format that is easy to consume for crawlers that currently download the data via the API. However, this format would be very verbose. While the data would be only moderately denormalized – most data belongs to only a single crate anyway – the constant repetition of column names results in significant overhead.

carols10cents · 2019-07-11T20:31:22Z

The users table is in the last category. I believe that users have a reasonable expectation of remaining private if they just create an account on crates.io without publishing anything (though I can see that others may disagree with this sentinment).

This is an interesting point that hasn't been raised before; I agree with your thoughts and your implementation here.

The script above explicitly lists the names of all tables and columns to export, in order to make it a conscious decision to export the data from newly introduced columns.

I agree that managing a list of allowed columns is the right direction. One concern I have is how do we manage updates to this script, if we add columns to the database that SHOULD be exported? I haven't been able to come up with a reasonable solution that isn't labor intensive and/or brittle.

smarnach · 2019-07-11T21:03:57Z

One concern I have is how do we manage updates to this script, if we add columns to the database that SHOULD be exported?

As mentioned above, the tarball with CSV files should also contain an SQL script to re-import the data. A basic test we can run in CI is exporting some data, then trying to re-import it in a clean database. This will catch some of the failure modes of forgetting to export a required column (e.g. if the column is not nullable, or if some database-level constraints are unmet without the new values).

In other cases, the result of not exporting a column is simply that some data that should be in the exports is missing. This generally won't break existing use cases of the dumps, since clients obviously don't rely on a column that has never been exported in a database dump. In these cases eventually someone will complain that they want to get that data, but that seem tolerable to me.

There are some failure modes that aren't easy to catch; e.g. when the import succeeds, but does not maintain some invariants that are not enforced at the database level. I epxect these failure modes to be extremely rare. (Adding new public columns is rare by itself.)

smarnach · 2019-07-11T21:31:01Z

Or a better solution: We add a test that builds the set of all columns on all tables, subtracts the set of columns contained in the export and compares the result to the known set of private columns and tables. If we add a new column, we either need to add it to the export script, or to the known set of private columns in the test.

icefoxen · 2019-07-12T04:55:49Z

I agree that managing a list of allowed columns is the right direction. One concern I have is how do we manage updates to this script, if we add columns to the database that SHOULD be exported? I haven't been able to come up with a reasonable solution that isn't labor intensive and/or brittle.

Frankly, if I have a tool that imports this dump, when you add fields to the database and forget to update the dumping system to include them, that's a nice soft error since it will not make my tool break. It's just also a silent error.

We add a test that builds the set of all columns on all tables, subtracts the set of columns contained in the export and compares the result to the known set of private columns and tables.

I was just about to suggest that! It makes the silent error noisy.

smarnach · 2019-07-15T19:18:38Z

I suggest two improvements over my previous proposal.

Using row-level security instead of temporary views

A variation of the above approach would be to use row-level security instead of temporary views to filter rows. Specifically, this would involve the following steps:

Add a Boolean column public to the users table. This columns is FALSE by default, and gets flipped to TRUE when the user performs a publicly visible action (publishing a crate version or becoming owner of a crate).

ALTER TABLE users ADD COLUMN public BOOLEAN NOT NULL DEFAULT 'f';
UPDATE users
SET public = 't'
WHERE
    id IN (
        SELECT owner_id AS user_id FROM crate_owners WHERE NOT deleted AND owner_kind = 0
        UNION
        SELECT published_by as user_id FROM versions
    );

Create a backup user with adequate permissions.

CREATE USER db_backup WITH PASSWORD 'some_password';
GRANT SELECT ON ALL TABLES IN SCHEMA public TO db_backup;
ALTER DEFAULT PRIVILEGES IN SCHEMA public GRANT SELECT ON TABLES TO db_backup;

Enable row-level security on the tables we want to filter. The table owner is not affected by row-level security, so we only need to add policies for the backup user.

ALTER TABLE crate_owners ENABLE ROW LEVEL SECURITY;
CREATE POLICY db_backup_crate_owners ON crate_owners TO db_backup USING (NOT deleted);
ALTER TABLE users ENABLE ROW LEVEL SECURITY;
CREATE POLICY db_backup_users ON users TO db_backup USING (public);

After these steps, the backup user can only see the rows we want to export, so we can use \copy commands of the same form as we do for all other tables. This allows to auto-generate the export script based exclusively on the set of columns we want to export. This set of columns can come from a configuration file, or even be hard-coded in Rust code.

enum Visibility {
    Private,
    Public,
}

use Visibility::*;

static VISIBILITY: &[(&str, &[(&str, Visibility)])] = &[
    (
        "api_tokens",
        &[
            ("id", Private),
            ("user_id", Private),
            ...
        ],
    ),
    (
        "crate_owners",
        &[
            ("crate_id", Public),
            ("owner_id", Public),
            ("created_at", Public),
            ("created_by", Private),
            ("deleted", Private),
            ("updated_at", Public),
            ("owner_kind", Public),
        ],
    ),
    ...
];

This approach has several advantages over what I proposed before.

The configuration format makes it very easy to see what columns and tables are exported, and which ones are kept private. It is a lot more readable than the psql script in my previous proposal.
The set of columns in the configuration can be easily compared to the set of columns in the database. If there is any difference, we can make a test fail with a nice error message, forcing everyone who adds a column to decide whether it should be exported or not. This decision can be made by simply adding the column to the configuration.
We can more easily control the database dump from Rust code.

Data integrity

The problem with data integrity mentioned above has an easy solution in Postgres that I haven't been aware of. A transaction can use the isolation level SERIALIZABLE READ ONLY DEFERABLE to give it the guarantees of a serializable transaction, but without the overhead, and without any serialization errors. The transaction might block initially until a suitable snapshot can be created, but after that point it will only operate on that consistent snapshot. This mode is specifically intended for database backups.

Prototype: Public database dumps This is an unfinished prototype implementation of the design I proposed to implement #630 (see #630 (comment) and #630 (comment)). I am submitting this for review to gather some feedback on the basic approach before spending more time on this. This PR adds a background task to create a database dump. The task can be triggered with the `enqueue-job` binary, so it is easy to schedule in production using Heroku Scheduler. ### Testing instructions To create a dump: 1. Start the background worker: cargo run --bin background-worker 1. Trigger a database dump: cargo run --bin enqueue-job dump_db The resulting tarball can be found in `./local_uploads/db-dump.tar.gz`. To re-import the dump 1. Unpack the tarball: tar xzf local_uploads/db-dump.tar.gz 1. Create a new database: createdb test_import_dump 1. Run the Diesel migrations for the new DB: diesel migration run --database-url=postgres:///test_import_dump 1. Import the dump cd DUMP_DIRECTORY psql test_import_dump < import.sql (Depending on your local PostgreSQL setup, in particular the permissions for your user account, you may need different commands and URIs than given above.) ### Author's notes * The background task executes `psql` in a subprocess to actually create the dump. One reason for this approach is its simplicity – the `\copy` convenience command issues a suitable `COPY TO STDOUT` SQL command and streams the result directly to a local file. Another reason is that I couldn't figure out how to do this at all in Rust with a Diesel `PgConnection`. There doesn't seem to be a way to run raw SQL with full access to the result. * The unit test to verify that the column visibility information in `dump_db.toml` is up to date compares the information in that file to the current schema of the test database. Diesel does not provide any schema reflection functionality, so we query the actual database instead. This test may spuriously fail or succeed locally if you still have some migrations from unmerged branches applied to your test database. On Travis this shouldn't be a problem, since I believe we always start with a fresh database there. (My preferred solution for this problem would be for Diesel to provide some way to introspect the information in `schema.rs`.) ### Remaining work * [x] Address TODOs in the source code. The most significant one is to update the `Uploader` interface to accept streamed data instead of a `Vec<u8>`. Currently the whole database dump needs to be loaded into memory at once. * ~~Record the URLs of uploaded tarballs in the database, and provide an API endpoint to download them.~~ Decided to only store latest at a known URL * [x] Devise a scheme for cleaning up old dumps from S3. The easiest option is to only keep the latest dump. * [x] Somewhere in the tar file, note the date and time the dump was generated * [x] Verify that `dump-db.toml` is correct, i.e. that we don't leak any data we don't want to leak. Done via manual inspection. ~~One idea to do so is to reconstruct dumps from the information available via the API and compare to information in a test dump in the staging environment. This way we could verify what additional information will be made public.~~ * [x] The code needs some form of integration test. Idea from #1629: exporting some data, then trying to re-import it in a clean database. * [x] Implement and document a way of re-importing the dumps to the database, e.g. to allow local testing of crates.io with realistic data. * [x] Rebase and remove commits containing the first implementation * [x] Document the existence of this dump, how often it's regenerated, and that only the most recent dump is available (maybe in the crawler policy/crawler blocked error message?) * [x] Include the commit hash of the crates.io version that created the dump in the tarball

carols10cents · 2019-10-09T13:50:06Z

This is done! Please see the documentation for the database dumps at https://crates.io/data-access

carols10cents mentioned this issue Mar 18, 2017

More metadata in index? #410

Closed

carols10cents added C-enhancement ✨ Category: Adding new behavior or a change to the way an existing feature works A-infrastructure 📡 labels Mar 18, 2017

moore3071 mentioned this issue Feb 13, 2018

Add an option to order crates by last updated time #1210

Closed

ishitatsuyuki mentioned this issue Oct 5, 2018

Realtime API #1508

Closed

ishitatsuyuki mentioned this issue Oct 24, 2018

Publish an official crawler policy #1535

Merged

This was referenced Aug 7, 2019

Move update-downloads to a background job #1798

Merged

Prototype: Public database dumps #1800

Merged

sgrif mentioned this issue Aug 19, 2019

Use the next_page metadata from crates.io repology/repology-updater#897

Merged

carols10cents closed this as completed Oct 9, 2019

carols10cents changed the title ~~Provide a databse snapshot to facilitate development.~~ Provide a database snapshot to facilitate development. Dec 2, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Provide a database snapshot to facilitate development. #630

Provide a database snapshot to facilitate development. #630

LeopoldArkham commented Mar 15, 2017

Eh2406 commented Mar 15, 2017

carols10cents commented Mar 18, 2017

Eh2406 commented Mar 18, 2017

moore3071 commented Jan 27, 2018

icefoxen commented Feb 21, 2019 •

edited

Loading

icefoxen commented Feb 21, 2019 •

edited

Loading

smarnach commented Jul 11, 2019 •

edited

Loading

carols10cents commented Jul 11, 2019

smarnach commented Jul 11, 2019

smarnach commented Jul 11, 2019

icefoxen commented Jul 12, 2019

smarnach commented Jul 15, 2019 •

edited

Loading

carols10cents commented Oct 9, 2019

Provide a database snapshot to facilitate development. #630

Provide a database snapshot to facilitate development. #630

Comments

LeopoldArkham commented Mar 15, 2017

Eh2406 commented Mar 15, 2017

carols10cents commented Mar 18, 2017

Eh2406 commented Mar 18, 2017

moore3071 commented Jan 27, 2018

icefoxen commented Feb 21, 2019 • edited Loading

icefoxen commented Feb 21, 2019 • edited Loading

smarnach commented Jul 11, 2019 • edited Loading

Goals

Data to export and privacy considerations

Data consistency

Alternatives considered

SQL dumps

JSON dumps

carols10cents commented Jul 11, 2019

smarnach commented Jul 11, 2019

smarnach commented Jul 11, 2019

icefoxen commented Jul 12, 2019

smarnach commented Jul 15, 2019 • edited Loading

Using row-level security instead of temporary views

Data integrity

carols10cents commented Oct 9, 2019

icefoxen commented Feb 21, 2019 •

edited

Loading

icefoxen commented Feb 21, 2019 •

edited

Loading

smarnach commented Jul 11, 2019 •

edited

Loading

smarnach commented Jul 15, 2019 •

edited

Loading