-
Notifications
You must be signed in to change notification settings - Fork 643
Move index updates off the web server #1588
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
☔ The latest upstream changes (presumably #1529) made this pull request unmergeable. Please resolve the merge conflicts. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You'll want to add background-worker
and monitor
to bin_names
in script/ci/prune-cache.sh
. This will prune the final build artifacts for the binaries to keep our cache sizes down on CI.
I see background_worker in Procfile
, but how (and how often) does monitor
get run?
On each index operation, the background job clones the index from GitHub into a temporary directory. I'm a bit worried about this from two perspectives. GitHub might throttle us if we clone too frequently, and I'm not sure how much space Heroku provides within the temporary directory as we accumulate many checkouts in a 24 hour period. We could query the GitHub api (curl -H "Accept: application/vnd.github.v3.sha" https://api.github.com/repos/rust-lang/crates.io-index/commits/master
returns the current SHA) to see if we already have the latest commit and do the equivalent of a fetch
and reset
if not.
Finally, the retry semantics allow for jobs to be executed out of order. The most common source of errors will be network connectivity to GitHub. After an outage new requestes will complete succesfully while old jobs are waiting for their retry time to expire. If someone yanks a crate during an outage and then unyanks it a few minutes later after connectivity recovers, then the index operations will complete out of order and the index state (yanked) will not match the database (unyanked). I think there needs to be some mechanism to guarantee that operations on the same (crate_name, version)
complete in order.
It gets run by any cron-like tool, which is Heroku scheduler for us. I figure we'll start at every 5 minutes, and tweak if needed. It is not configured in the repository. I disagree that throttling is going to be a concern from cloning, but you make a good point on disk space (I'd assume that
This is intentional, but that's a good point that yanking/unyanking quickly can cause the index to not match the database. I need to give this some more thought. I don't think I agree that we need to guarantee in-order execution for operations affecting |
This is blocked on #1607 |
Explicitly share a single connection in tests This is required to move forward on #1588, I have spun it off into a standalone PR for ease of review, since it's isolated and a change that makes sense on its own. This requires #1609 and includes those commits Right now the way our test suite works is that we set the database connection pool size to 1, and tell that single connection to start a test transaction. This works, but has the caveat that only a single reference to the connection can be live at any given time. This becomes a problem when we add background jobs into the mix, which will always want to grab two connections from the pool (and we still need them to be the same connection). The code that we end up running looks something like: ``` // In request let conn = pool.get(); insert_background_job(&conn)?; // In middleware that runs jobs for _ in 0..job_count { thread::spawn(|| { conn = pool.get(); lock_job(&conn); // in actual job conn = pool.get(); do_stuff(&conn); unlock_or_delete_job(&conn); }); } ``` Note that the worker and the job itself both need a connection. In order to make this work, we need to change our pool to actually just be a single connection given out multiple times in test mode. I've chosen to use a reentrant mutex for this for two reasons: 1. We need to satisfy the `Sync` bound on `App`, even though we're only running in a single thread in tests. 2. The background worker actually does end up running in another thread, so we do actually need to be thread safe. The result of this is that we can have the connection checked out more than once at the same time, but we still can't be more relaxed about when it gets dropped in our test code, since we need it to be unlocked if we try to get a connection from another thread as the result of running a background job. Our tests shouldn't know or care about whether background jobs were run async or not, so they should assume every request might run one. The mutex guard holds a reference to the mutex, so this introduces a lifetime constraint that hasn't existed since the Diesel port. The return type of `req.db_conn()` is now effectively `&PgConnection` instead of `Box<PgConnection>`. Since the method exists on `Request`, this means it gets treated as an immutable borrow of the whole request. The fixes are all pretty simple, and boil down to either cloning the app and getting the DB directly from there, taking immutable references instead of mutable ones, or dropping the connection when we're done with it.
4b5b9c4
to
5400248
Compare
@jtgeibel I've updated the yanking logic to handle database updates in the background job instead of on the web server. I have not changed the clone into a temp dir behavior, and I don't think we need to in this PR. The concern about disk space is actually already handled. We're using I do want to change this to reuse the same checkout over multiple runs for other reasons, but I think there's enough tradeoffs/discussions for that to be its own PR. The concern about disk usage is resolved, so I don't think it needs to block this on its own. So I believe this is good to go if you want to give it a last look. If it is good to merge, I'd appreciate a comment letting me know but holding off on merging for now. I consider this a pretty high risk PR so I want to make sure it's deployed by itself when I have plenty of time to do some final checks on staging and have overridden myself to be on call. |
☔ The latest upstream changes (presumably #1623) made this pull request unmergeable. Please resolve the merge conflicts. |
09deccf
to
8fc4b32
Compare
☔ The latest upstream changes (presumably #1618) made this pull request unmergeable. Please resolve the merge conflicts. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sgrif the updated yank logic looks good to me. Sorry I missed your ping on this last week.
I though I looked into the drop behavior of TempDir
, but I guess not since it is right there in the documentation. This looks good to me as well.
I'll let you merge this when you're ready to deploy.
Going to do some final testing on staging and deploy this next week |
Staging revealed a few issues which I've pushed fixes for. I'm confident that this actually works, I'm going to quickly test on staging with the full index instead of an empty one before merging. |
☔ The latest upstream changes (presumably #1650) made this pull request unmergeable. Please resolve the merge conflicts. |
Note: This is intended to possibly be extracted into a library, so the docs are written as if this were its own library. Cargo specific code (besides the use of `CargoResult`) will go in its own module for the same reason. This adds an MVP background queueing system intended to be used in place of the "try to commit 20 times" loop in `git.rs`. This is a fairly simple queue, that is intended to be "the easiest thing that fits our needs, with the least operational impact". There's a few reasons I've opted to go with our own queuing system here, rather than an existing solution like Faktory or Beanstalkd. - We'd have to write the majority of this code ourselves no matter what. - Client libraries for beanstalkd don't deal with the actual job logic, only `storage.rs` would get replaced - The only client for faktory that exists made some really odd API choices. Faktory also hasn't seen a lot of use in the wild yet. - I want to limit the number of services we have to manage. We have extremely limited ops bandwidth today, and every new part of the stack we have to manage is a huge cost. Right now we only have our server and PG. I'd like to keep it that way for as long as possible. This system takes advantage of the `SKIP LOCKED` feature in PostgreSQL 9.5 to handle all of the hard stuff for us. We use PG's row locking to treat a row as "currently being processed", which means we don't have to worry about returning it to the queue if the power goes out on one of our workers. This queue is intended only for jobs with "at least once" semantics. That means the entire job has to be idempotent. If the entire job completes successfully, but the power goes out before we commit the transaction, we will run the whole thing again. The code today also makes a few additional assumptions based on our current needs. We expect all jobs to complete successfully the first time, and the most likely reason a job would fail is due to an incident happening at GitHub, hence the extremely high retry timeout. I'm also assuming that all jobs will eventually complete, and that any job failing N (likely 5) times is an event that should page whoever is on call. (Paging is not part of this PR). Finally, it's unlikely that this queue will be appropriate for high thoughput use cases, since it requires one PG connection per worker (a real connection, adding pg bouncer wouldn't help here). Right now our only background work that happens is something that comes in on average every 5 minutes, but if we start moving more code to be run here we may want to revisit this in the future.
The `Runner` is basically the main interface into the job queue, and stores the environment, thread pool, and database connection pool. We technically don't need a thread pool since our jobs are so infrequent that we don't need more than one thread. But our tests are simplified by having it, and a general purpose lib will need it. The other oddity is that we will want to panic instead of returning a result if updating or fetching a job failed for some reason. This isn't something we expect to occur unless there's an issue in the DB, so this seems fine. A thread panicking won't kill the runner. We are also only panicking if changes to the `background_jobs` table fails, not if the job itself fails.
This fundamentally changes the workflow for all operations we perform involving git, so that they are not performed on the web server and do not block the response. This will improve the response times of `cargo publish`, and make the publish process more resilient, reducing the liklihood of an inconsistency occurring such as the index getting updated, but not our database. Previously, our workflow looked something like this: - When the server boots, do a full clone of the index into a known location - Some request comes in that needs to update the index - Database transaction is opened - Local checkout is modified, we attempt to commit & push (note: This involves a mutex to avoid contention with another request to update the index on the same server) - If push fails, we fetch, `reset --hard`, and try again up to 20 times - Database transaction is committed - We send a successful response The reason for the retry logic is that we have more than one web server, meaning no server can be sure that its local checkout is actually up to date. There's also a major opportunity for an inconsistent state to be reached here. If the power goes out, the server is restarted, something crashes, etc, in between the index being updated and the database transaction being committed, we will never retry it. The new workflow looks like this: - Some request comes in that needs to update the index - A job is queued in the database to update the index at some point in the future. - We send a successful response - A separate process pulls the job out of the database - A full clone of the index is performed into a temporary directory - The new checkout is modified, committed, and pushed - If push succeeds, job is removed from database - If push fails, job is marked as failed and will be retried at some point in the future While a background worker can be spread across multiple machines and/or threads, we will be able to avoid the race conditions that were previously possible by ensuring that we only have one worker with one thread that handles index updates. Right now that's easy since index updates are the only background job we have, but as we add more we will need to add support for multiple queues to account for this. I've opted to do a fresh checkout in every job, rather than relying on some state that was setup when the machine booted. This is mostly for simplicity's sake. It also means that if we need to scale to multiple threads/processes for other jobs, we can punt the multi-queue enhancement for a while if we wish. However, it does mean the job will take a bit longer to run. If this turns out to be a problem, it's easy to address. This should eliminate the opportunity for the index to enter an inconsistent state from our database -- or at least they should become eventually consistent. If the power goes out before the job is committed as done, it is assumed the job failed and it will be retried. The job itself is idempotent, so even if the power goes out after the index is updated, the retry should succeed. One other side effect of this change is that when `cargo publish` returns with an exit status of 0, that does not mean that your crate/new version is immediately available for use -- if you try to point to it in Cargo.toml seconds after publishing, you may get an error that it could not find that version. This was technically already true, since neither S3 nor GitHub guarantee that uploads/pushes are immediately visible. However, this does increase the timescale beyond the delay we would have seen there. In most cases it should be under 10 seconds, and at most a minute. One enhancement that will come as a followup, but is not included in this PR is a UI to see the status of your upload. This is definitely nice to have, but is not something I think is necessary for this feature to land. The time it would take to navigate to that UI is going to be longer than the time it takes the background job to run in most cases. That enhancement is something I think can go hand in hand with rust-lang#1503 (which incidentally becomes much easier to implement with this PR, since a "staging" publish just skips queuing the background job, and the only thing the button to full publish needs to do is queue the job). This setup does assume that all background jobs *must* eventually succeed. If any job fails, the index is in an inconsistent state with our database, and we are having an outage of some kind. Due to the nature of our background jobs, this likely means that GitHub is down, or there is a bug in our code. Either way, we page whoever is on-call, since it means publishing is broken. Since publishing crates is such an infrequent event, I've set the thresholds to be extremely low.
These binaries are pretty simple (and a bit spaghetti) for the time being, since we have so little actually using the background job framework. The runner is meant to be run on a worker dyno continuously, while the monitor should be run by a cron-like tool roughly every 5 minutes. We should move `update-downloads` to be scheduled as well, since it spends so little time actually working.
Now that we're wrapping the connection in our own smart pointer that doesn't implement `Connection` we need some explicit derefs (adding the manual `Connection` impl isn't worth avoiding these `*`s). We need to be able to clone the connection pool (only in tests, but this also only requires modifying the test variant), so we need the `Arc`. Similarly, since in tests the connection is a re-entrant mutex, we can't grab the connection before spawning the worker thread. The lock isn't `Send` that's for a very good reason. So we instead need to clone a handle to the pool and grab the connection on the thread we intend to use it.
Previously, we'd immediately update the database and then queue the index update to be run later. Jobs aren't guaranteed to run in the order they were queued (though in practice they likely will). This means that if someone yanked and unyanked a crate in rapid succession, the database may end up permanently out of sync with the index. With this change, the final result may still be out of order, but the database will be in sync with the index whatever the outcome is. Since rapidly yanking and unyanking isn't something we expect users to do, and even if they did jobs should be running in order under normal circumstances, I don't think we need to do anything to ensure more consistency than this.
When testing in staging, the runner started crashing on boot. This was because we only set the connection pool size to 1 when we meant to set it to 2 (one for the runner, one for the worker thread). So in each loop, if all jobs took greater than 1 second to run, the runner would crash. This fixes the pool size, and also does not return an error if no database connection could be retrieved from the pool.
Right now we check if a crate is already yanked/unyanked before enqueing the job at all. This means that if you try to undo a yank before the yank finishes running, we won't enqueue the unyank at all. This isn't an expected scenario, and we still might not do the unyank if the jobs are run out of order, but this means that we should always end up in the expected state under normal operation.
aa3712b
to
9d38d19
Compare
The time a full clone takes caused too long of a delay when publishing a crate. This instead performs a full clone when the runner boots, and then locks and does `git fetch && git reset --hard origin/master` at the top of every job. The reset means we can (somewhat) scale to multiple runners, and retains unwind safety in the environment. Note that there can still be a delay measured in minutes if a publish is performed right after the server boots, as the job runner starting up does not block the web server from starting.
Delay for publishes was too long with the full index, so I've set it up to maintain a local checkout. This eliminates the delay once the runner has booted, but there may still be a delay if a publish is done immediately after the server boots, as the web server now boots instantly rather than waiting for the clone to complete. |
Going to do one more test in staging and then deploy this. @bors: r+ |
📌 Commit 6777978 has been approved by |
Move index updates off the web server This fundamentally changes our workflow for publishing, yanking, and unyanking crates. Rather than synchronously updating the index when the request comes in (and potentially retrying multiple times since we have multiple web servers that can create a race condition), we instead queue the update to be run on another machine at some point in the future. This will improve the resiliency of index updates -- specifically letting us avoid the case where the index has been updated, but something happened to the web server before the database transaction committed. This setup assumes that all jobs *must* complete within a short timeframe, or something is seriously wrong. The only background jobs we have right now are index updates, which are extremely low volume. If a job fails, it most likely means that GitHub is down, or a bug has made it to production which is preventing publishing and/or yanking. For these reasons, this PR includes a monitor binary which will page whoever is on call with extremely low thresholds (defaults to paging if a job has been in the queue for 15 minutes, configurable by env var). The runner is meant to be run on a dedicated worker, while the monitor should be run by some cron-like tool on a regular interval (Heroku scheduler for us) One side effect of this change is that `cargo publish` returning with a 0 exit status does not mean that the crate can immediately be used. This has always technically been true, since S3 and GitHub both can have delays before they update as well, but it's going to consistently be a bit longer with this PR. It should only be a few seconds the majority of the time, and no more than a minute in the worst case. One enhancement I'd like to make, which is not included in this PR, is a UI to show the status of a publish. I did not include it here, as this PR is already huge, and I do not think that feature is strictly required to land this. In the common case, it will take longer to navigate to that UI than it will take for the job to complete. This enhancement will also go nicely with work on staging publishes if we want to add those (see #1503). There are also some low hanging fruit we can tackle to lower the job's running time if we feel it's necessary. As for the queue itself, I've chosen to implement one here based on PostgreSQL's row locking. There are a few reasons for this vs something like RabbitMQ or Faktory. The first is operational. We still have a very small team, and very limited ops bandwidth. If we can avoid introducing another piece to our stack, that is a win both in terms of the amount of work our existing team has to do, and making it easy to grow the team (by lowering the number of technologies one person has to learn). The second reason is that using an existing queue wouldn't actually reduce the amount of code required by that much. The majority of the code here is related to actually running jobs, not interacting with PostgreSQL or serialization. The only Rust libraries that exist for this are low level bindings to other queues, but the majority of the "job" infrastructure would still be needed. The queue code is intended to eventually be extracted to a library. This portion of the code is the `background` module, and is why a lot of the code in that module is written a bit more generically than crates.io specifically needs. It's still a bit too coupled to crates.io to be extracted right now, though -- and I'd like to have it in the wild for a bit before extracting it. The `background_jobs` module is our code for interacting with this "library".
💔 Test failed - checks-travis |
@bors r+ |
📌 Commit 1ab2c08 has been approved by |
Move index updates off the web server This fundamentally changes our workflow for publishing, yanking, and unyanking crates. Rather than synchronously updating the index when the request comes in (and potentially retrying multiple times since we have multiple web servers that can create a race condition), we instead queue the update to be run on another machine at some point in the future. This will improve the resiliency of index updates -- specifically letting us avoid the case where the index has been updated, but something happened to the web server before the database transaction committed. This setup assumes that all jobs *must* complete within a short timeframe, or something is seriously wrong. The only background jobs we have right now are index updates, which are extremely low volume. If a job fails, it most likely means that GitHub is down, or a bug has made it to production which is preventing publishing and/or yanking. For these reasons, this PR includes a monitor binary which will page whoever is on call with extremely low thresholds (defaults to paging if a job has been in the queue for 15 minutes, configurable by env var). The runner is meant to be run on a dedicated worker, while the monitor should be run by some cron-like tool on a regular interval (Heroku scheduler for us) One side effect of this change is that `cargo publish` returning with a 0 exit status does not mean that the crate can immediately be used. This has always technically been true, since S3 and GitHub both can have delays before they update as well, but it's going to consistently be a bit longer with this PR. It should only be a few seconds the majority of the time, and no more than a minute in the worst case. One enhancement I'd like to make, which is not included in this PR, is a UI to show the status of a publish. I did not include it here, as this PR is already huge, and I do not think that feature is strictly required to land this. In the common case, it will take longer to navigate to that UI than it will take for the job to complete. This enhancement will also go nicely with work on staging publishes if we want to add those (see #1503). There are also some low hanging fruit we can tackle to lower the job's running time if we feel it's necessary. As for the queue itself, I've chosen to implement one here based on PostgreSQL's row locking. There are a few reasons for this vs something like RabbitMQ or Faktory. The first is operational. We still have a very small team, and very limited ops bandwidth. If we can avoid introducing another piece to our stack, that is a win both in terms of the amount of work our existing team has to do, and making it easy to grow the team (by lowering the number of technologies one person has to learn). The second reason is that using an existing queue wouldn't actually reduce the amount of code required by that much. The majority of the code here is related to actually running jobs, not interacting with PostgreSQL or serialization. The only Rust libraries that exist for this are low level bindings to other queues, but the majority of the "job" infrastructure would still be needed. The queue code is intended to eventually be extracted to a library. This portion of the code is the `background` module, and is why a lot of the code in that module is written a bit more generically than crates.io specifically needs. It's still a bit too coupled to crates.io to be extracted right now, though -- and I'd like to have it in the wild for a bit before extracting it. The `background_jobs` module is our code for interacting with this "library".
☀️ Test successful - checks-travis |
This fundamentally changes our workflow for publishing, yanking, and unyanking crates. Rather than synchronously updating the index when the request comes in (and potentially retrying multiple times since we have multiple web servers that can create a race condition), we instead queue the update to be run on another machine at some point in the future.
This will improve the resiliency of index updates -- specifically letting us avoid the case where the index has been updated, but something happened to the web server before the database transaction committed.
This setup assumes that all jobs must complete within a short timeframe, or something is seriously wrong. The only background jobs we have right now are index updates, which are extremely low volume. If a job fails, it most likely means that GitHub is down, or a bug has made it to production which is preventing publishing and/or yanking. For these reasons, this PR includes a monitor binary which will page whoever is on call with extremely low thresholds (defaults to paging if a job has been in the queue for 15 minutes, configurable by env var). The runner is meant to be run on a dedicated worker, while the monitor should be run by some cron-like tool on a regular interval (Heroku scheduler for us)
One side effect of this change is that
cargo publish
returning with a 0 exit status does not mean that the crate can immediately be used. This has always technically been true, since S3 and GitHub both can have delays before they update as well, but it's going to consistently be a bit longer with this PR. It should only be a few seconds the majority of the time, and no more than a minute in the worst case. One enhancement I'd like to make, which is not included in this PR, is a UI to show the status of a publish. I did not include it here, as this PR is already huge, and I do not think that feature is strictly required to land this. In the common case, it will take longer to navigate to that UI than it will take for the job to complete. This enhancement will also go nicely with work on staging publishes if we want to add those (see #1503). There are also some low hanging fruit we can tackle to lower the job's running time if we feel it's necessary.As for the queue itself, I've chosen to implement one here based on PostgreSQL's row locking. There are a few reasons for this vs something like RabbitMQ or Faktory. The first is operational. We still have a very small team, and very limited ops bandwidth. If we can avoid introducing another piece to our stack, that is a win both in terms of the amount of work our existing team has to do, and making it easy to grow the team (by lowering the number of technologies one person has to learn). The second reason is that using an existing queue wouldn't actually reduce the amount of code required by that much. The majority of the code here is related to actually running jobs, not interacting with PostgreSQL or serialization. The only Rust libraries that exist for this are low level bindings to other queues, but the majority of the "job" infrastructure would still be needed.
The queue code is intended to eventually be extracted to a library. This portion of the code is the
background
module, and is why a lot of the code in that module is written a bit more generically than crates.io specifically needs. It's still a bit too coupled to crates.io to be extracted right now, though -- and I'd like to have it in the wild for a bit before extracting it. Thebackground_jobs
module is our code for interacting with this "library".