-
Notifications
You must be signed in to change notification settings - Fork 643
Provide a database snapshot to facilitate development. #630
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I recall there being an issue for creating sanitized sql dumps of the database. Such a dump, would be useful for this aswell. Of course now I cannot find the issue. |
#410 is the closest one I know of, going to close that in favor of this one since we want to keep the index small. Notable info from that issue is that rubygems.org provides a sanitized snapshot, i'll probably look into their infrastructure to see if there's something we could reuse. The easiest thing to do sanitation-wise is probably to just not dump the users and follows tables. Actually we should probably make this a white list instead of a black list of tables to dump so that we don't leak data by adding a table that should be stripped out and forgetting to add it to the blacklist. |
That is the one I had in mind, thanks! When we have a game plan for this I'd love to help. |
Would it be possible to use the repository crates.io-index to create a quick development database? This would make it a lot harder to accidentally leak any private data. |
Do people still want this? Not having to scrape the crates.io API for things like downloads and crate owners would be great. I'll happily throw together a PR. Edit: Would it be easier for you to just have a script to dump to a static file that gets run by a cron job and stuck in a static location? Or have an API endpoint that serves it somehow? I dunno much about heroku and your other infrastructure actually works. Upload it to a particular S3 location or something? |
Ok, after a little research it looks like heroku lets you make backups, pull backups to a local machine or push them to heroku and so on, but not do what we want directly, which is to be very careful to dump only certain tables to a backup and then make it easily public. Seems like the easy way would be to use |
Given the recent discussion about squashing the index, I took a look at this issue. Here is a suggestion for a design. Goals
Data to export and privacy considerationsWe should only export data that users expect us to be public, which is roughly the data that is already exposed via the API. Some of the tables should not be exported at all, for some table we can only export a subset of the columns, and for some tables we should even filter the rows. The Here is a prototype suggestion for the data to export, in the form a a -- Only select crate owners that have not been deleted.
CREATE TEMPORARY VIEW crate_owners_export AS (
SELECT
crate_id, owner_id, created_at, updated_at, owner_kind
FROM crate_owners
WHERE NOT deleted
);
-- Only select users who are publicly visible through public activity.
-- This query can be simplified by introducing a Boolean `public` column that is
-- flipped to `true` when a user has their first public activity.
CREATE TEMPORARY VIEW users_export AS (
SELECT
id, gh_login, name, gh_avatar, gh_id
FROM users
WHERE
id in (
SELECT owner_id AS user_id FROM crate_owners_export WHERE owner_kind = 0
UNION
SELECT published_by as user_id FROM versions
)
);
-- \copy statements can't be broken up into multiple lines.
\copy badges (crate_id, badge_type, attributes) TO 'badges.csv' WITH CSV HEADER
\copy categories (id, category, slug, description, crates_cnt, created_at, path) TO 'categories.csv' WITH CSV HEADER
\copy (SELECT * FROM crate_owners_export) TO 'crate_owners.csv' WITH CSV HEADER
\copy crates (id, name, updated_at, created_at, downloads, description, homepage, documentation, readme, textsearchable_index_col, license, repository, max_upload_size) TO 'crates.csv' WITH CSV HEADER
\copy crates_categories (crate_id, category_id) TO 'crates_categories.csv' WITH CSV HEADER
\copy crates_keywords (crate_id, keyword_id) TO 'crates_keywords.csv' WITH CSV HEADER
\copy dependencies (id, version_id, crate_id, req, optional, default_features, features, target, kind) TO 'dependencies.csv' WITH CSV HEADER
\copy keywords (id, keyword, crates_cnt, created_at) TO 'keywords.csv' WITH CSV HEADER
\copy metadata (total_downloads) TO 'metadata.csv' WITH CSV HEADER
\copy readme_renderings (version_id, rendered_at) TO 'readme_renderings.csv' WITH CSV HEADER
\copy reserved_crate_names (name) TO 'reserved_crate_names.csv' WITH CSV HEADER
\copy teams (id, login, github_id, name, avatar) TO 'teams.csv' WITH CSV HEADER
\copy (SELECT * FROM users_export) TO 'users.csv' WITH CSV HEADER
\copy version_authors (id, version_id, name) TO 'version_authors.csv' WITH CSV HEADER
\copy version_downloads (version_id, downloads, counted, date) TO 'version_downloads.csv' WITH CSV HEADER
\copy versions (id, crate_id, num, updated_at, created_at, downloads, features, yanked, license, crate_size, published_by) TO 'versions.csv' WITH CSV HEADER These CSV dumps can be bundled in a single tarball or zip archive, together with an import script that allows reimporting them. The script above explicitly lists the names of all tables and columns to export, in order to make it a conscious decision to export the data from newly introduced columns. As an additional mechanism to protect against accidental data leaks, we could run the script as a user who only has permission to access the public columns (and we could even consider enabling row security for the tables that we filter by row). No data from these tables is exported by the above script:
Data consistencyThe above It should be possible to mitigate this issue by carefully selecting the order for the table export. Alternatives consideredSQL dumpsThe JSON dumpsThe JSON schema exported by the |
This is an interesting point that hasn't been raised before; I agree with your thoughts and your implementation here.
I agree that managing a list of allowed columns is the right direction. One concern I have is how do we manage updates to this script, if we add columns to the database that SHOULD be exported? I haven't been able to come up with a reasonable solution that isn't labor intensive and/or brittle. |
As mentioned above, the tarball with CSV files should also contain an SQL script to re-import the data. A basic test we can run in CI is exporting some data, then trying to re-import it in a clean database. This will catch some of the failure modes of forgetting to export a required column (e.g. if the column is not nullable, or if some database-level constraints are unmet without the new values). In other cases, the result of not exporting a column is simply that some data that should be in the exports is missing. This generally won't break existing use cases of the dumps, since clients obviously don't rely on a column that has never been exported in a database dump. In these cases eventually someone will complain that they want to get that data, but that seem tolerable to me. There are some failure modes that aren't easy to catch; e.g. when the import succeeds, but does not maintain some invariants that are not enforced at the database level. I epxect these failure modes to be extremely rare. (Adding new public columns is rare by itself.) |
Or a better solution: We add a test that builds the set of all columns on all tables, subtracts the set of columns contained in the export and compares the result to the known set of private columns and tables. If we add a new column, we either need to add it to the export script, or to the known set of private columns in the test. |
Frankly, if I have a tool that imports this dump, when you add fields to the database and forget to update the dumping system to include them, that's a nice soft error since it will not make my tool break. It's just also a silent error.
I was just about to suggest that! It makes the silent error noisy. |
I suggest two improvements over my previous proposal. Using row-level security instead of temporary viewsA variation of the above approach would be to use row-level security instead of temporary views to filter rows. Specifically, this would involve the following steps:
After these steps, the backup user can only see the rows we want to export, so we can use enum Visibility {
Private,
Public,
}
use Visibility::*;
static VISIBILITY: &[(&str, &[(&str, Visibility)])] = &[
(
"api_tokens",
&[
("id", Private),
("user_id", Private),
...
],
),
(
"crate_owners",
&[
("crate_id", Public),
("owner_id", Public),
("created_at", Public),
("created_by", Private),
("deleted", Private),
("updated_at", Public),
("owner_kind", Public),
],
),
...
]; This approach has several advantages over what I proposed before.
Data integrityThe problem with data integrity mentioned above has an easy solution in Postgres that I haven't been aware of. A transaction can use the isolation level |
Prototype: Public database dumps This is an unfinished prototype implementation of the design I proposed to implement #630 (see #630 (comment) and #630 (comment)). I am submitting this for review to gather some feedback on the basic approach before spending more time on this. This PR adds a background task to create a database dump. The task can be triggered with the `enqueue-job` binary, so it is easy to schedule in production using Heroku Scheduler. ### Testing instructions To create a dump: 1. Start the background worker: cargo run --bin background-worker 1. Trigger a database dump: cargo run --bin enqueue-job dump_db The resulting tarball can be found in `./local_uploads/db-dump.tar.gz`. To re-import the dump 1. Unpack the tarball: tar xzf local_uploads/db-dump.tar.gz 1. Create a new database: createdb test_import_dump 1. Run the Diesel migrations for the new DB: diesel migration run --database-url=postgres:///test_import_dump 1. Import the dump cd DUMP_DIRECTORY psql test_import_dump < import.sql (Depending on your local PostgreSQL setup, in particular the permissions for your user account, you may need different commands and URIs than given above.) ### Author's notes * The background task executes `psql` in a subprocess to actually create the dump. One reason for this approach is its simplicity – the `\copy` convenience command issues a suitable `COPY TO STDOUT` SQL command and streams the result directly to a local file. Another reason is that I couldn't figure out how to do this at all in Rust with a Diesel `PgConnection`. There doesn't seem to be a way to run raw SQL with full access to the result. * The unit test to verify that the column visibility information in `dump_db.toml` is up to date compares the information in that file to the current schema of the test database. Diesel does not provide any schema reflection functionality, so we query the actual database instead. This test may spuriously fail or succeed locally if you still have some migrations from unmerged branches applied to your test database. On Travis this shouldn't be a problem, since I believe we always start with a fresh database there. (My preferred solution for this problem would be for Diesel to provide some way to introspect the information in `schema.rs`.) ### Remaining work * [x] Address TODOs in the source code. The most significant one is to update the `Uploader` interface to accept streamed data instead of a `Vec<u8>`. Currently the whole database dump needs to be loaded into memory at once. * ~~Record the URLs of uploaded tarballs in the database, and provide an API endpoint to download them.~~ Decided to only store latest at a known URL * [x] Devise a scheme for cleaning up old dumps from S3. The easiest option is to only keep the latest dump. * [x] Somewhere in the tar file, note the date and time the dump was generated * [x] Verify that `dump-db.toml` is correct, i.e. that we don't leak any data we don't want to leak. Done via manual inspection. ~~One idea to do so is to reconstruct dumps from the information available via the API and compare to information in a test dump in the staging environment. This way we could verify what additional information will be made public.~~ * [x] The code needs some form of integration test. Idea from #1629: exporting some data, then trying to re-import it in a clean database. * [x] Implement and document a way of re-importing the dumps to the database, e.g. to allow local testing of crates.io with realistic data. * [x] Rebase and remove commits containing the first implementation * [x] Document the existence of this dump, how often it's regenerated, and that only the most recent dump is available (maybe in the crawler policy/crawler blocked error message?) * [x] Include the commit hash of the crates.io version that created the dump in the tarball
This is done! Please see the documentation for the database dumps at https://crates.io/data-access |
It would be nice to have a snapshot of (a subset of) the database that someone wishing to contribute could load locally to test out the code on real data. You can fake crates, dependencies, downloads, etc., but that only gets you so far.
The text was updated successfully, but these errors were encountered: