You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Move our index updates to be run in background jobs
This fundamentally changes the workflow for all operations we perform
involving git, so that they are not performed on the web server and do
not block the response. This will improve the response times of `cargo
publish`, and make the publish process more resilient, reducing the
liklihood of an inconsistency occurring such as the index getting
updated, but not our database.
Previously, our workflow looked something like this:
- When the server boots, do a full clone of the index into a known
location
- Some request comes in that needs to update the index
- Database transaction is opened
- Local checkout is modified, we attempt to commit & push (note: This
involves a mutex to avoid contention with another request to update
the index on the same server)
- If push fails, we fetch, `reset --hard`, and try again up to 20 times
- Database transaction is committed
- We send a successful response
The reason for the retry logic is that we have more than one web server,
meaning no server can be sure that its local checkout is actually up to
date. There's also a major opportunity for an inconsistent state to be
reached here. If the power goes out, the server is restarted, something
crashes, etc, in between the index being updated and the database
transaction being committed, we will never retry it.
The new workflow looks like this:
- Some request comes in that needs to update the index
- A job is queued in the database to update the index at some point in
the future.
- We send a successful response
- A separate process pulls the job out of the database
- A full clone of the index is performed into a temporary directory
- The new checkout is modified, committed, and pushed
- If push succeeds, job is removed from database
- If push fails, job is marked as failed and will be retried at some
point in the future
While a background worker can be spread across multiple machines and/or
threads, we will be able to avoid the race conditions that were
previously possible by ensuring that we only have one worker with one
thread that handles index updates. Right now that's easy since index
updates are the only background job we have, but as we add more we will
need to add support for multiple queues to account for this.
I've opted to do a fresh checkout in every job, rather than relying on
some state that was setup when the machine booted. This is mostly for
simplicity's sake. It also means that if we need to scale to multiple
threads/processes for other jobs, we can punt the multi-queue
enhancement for a while if we wish. However, it does mean the job will
take a bit longer to run. If this turns out to be a problem, it's easy
to address.
This should eliminate the opportunity for the index to enter an
inconsistent state from our database -- or at least they should become
eventually consistent. If the power goes out before the job is committed
as done, it is assumed the job failed and it will be retried. The job
itself is idempotent, so even if the power goes out after the index is
updated, the retry should succeed.
One other side effect of this change is that when `cargo publish`
returns with an exit status of 0, that does not mean that your crate/new
version is immediately available for use -- if you try to point to it in
Cargo.toml seconds after publishing, you may get an error that it could
not find that version. This was technically already true, since neither
S3 nor GitHub guarantee that uploads/pushes are immediately visible.
However, this does increase the timescale beyond the delay we would have
seen there. In most cases it should be under 10 seconds, and at most a
minute.
One enhancement that will come as a followup, but is not included in
this PR is a UI to see the status of your upload. This is definitely
nice to have, but is not something I think is necessary for this feature
to land. The time it would take to navigate to that UI is going to be
longer than the time it takes the background job to run in most cases.
That enhancement is something I think can go hand in hand with rust-lang#1503
(which incidentally becomes much easier to implement with this PR, since
a "staging" publish just skips queuing the background job, and the only
thing the button to full publish needs to do is queue the job).
This setup does assume that all background jobs *must* eventually
succeed. If any job fails, the index is in an inconsistent state with
our database, and we are having an outage of some kind. Due to the
nature of our background jobs, this likely means that GitHub is down, or
there is a bug in our code. Either way, we page whoever is on-call,
since it means publishing is broken. Since publishing crates is such an
infrequent event, I've set the thresholds to be extremely low.
0 commit comments