Skip to content

Prototype using persistent volumes for storage of user data #8104

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
kylos101 opened this issue Feb 8, 2022 · 15 comments
Closed

Prototype using persistent volumes for storage of user data #8104

kylos101 opened this issue Feb 8, 2022 · 15 comments
Assignees
Labels
team: workspace Issue belongs to the Workspace team

Comments

@kylos101
Copy link
Contributor

kylos101 commented Feb 8, 2022

Is your feature request related to a problem? Please describe

This is a test and learn, so we can explore using persistent volumes to store /workspace data. https://www.notion.so/gitpod/Ensure-durability-and-availability-of-user-workspace-files-9edff6bbf87248d5ac73a7d4548ee4b3

Describe the behaviour you'd like

Happy path:
Store working copy files, /workspace, on a distinct persistent volume. When a workspace is stopped, it's data must then be backed up, and the persistent volume removed.

Questions:

  1. If a node powering a workspace is evicted from the cluster unexpectedly, test we can backup the user's data to object storage.
  2. How might this look as a new component, where it's sole purpose is to backup from persistent volumes?
  3. Would these persistent volumes tolerate a cluster failure?
  4. How will this impact our ability to start a workspace, that was previously shutdown, where the files still exist in the persistent volume? Can we restore the state of the workspace from the persistent volume (the latest data) rather than object storage (the previous backup)?
@kylos101 kylos101 moved this to Scheduled in 🌌 Workspace Team Feb 8, 2022
@sagor999 sagor999 moved this from Scheduled to In Progress in 🌌 Workspace Team Feb 9, 2022
@sagor999
Copy link
Contributor

sagor999 commented Feb 10, 2022

https://github.com/kubernetes-sigs/gcp-compute-persistent-disk-csi-driver
Is working in local k3s cluster in workspace-preview project. Supports all GCP PD type (standard, balanced, ssd). Creates and deletes PVC and provisions GCP disks without issues. (testing only zonal PVCs, since regional PVC are much more expensive and not needed in our case). Testing using PVC of size 30G. Regional PVCs would only be beneficial if one of the zones fail and not available.

Pod creation speed:
local path: takes about 1 second to create the pod and start pulling an image
gcp-csi: takes about 15 seconds to provision PVC and get to pulling an image phase. About the same time for all types of GCP disks.

Ran another test. Setup storage class to create PVC immediately instead of waiting for first consumer.
That way PVC is created ahead of time. But that did not improve start up time. Still 15 seconds.
It seems like csi-attacher is taking about 9 seconds to attach PVC to pod. Which is a time it takes for GCP to attach that volume to the node itself. (cannot be reduced and this timing might vary depending on external factors, like if gcp has some extra load and that operation might take longer).

With objective of having workspace start up time to be less then 40 seconds, allocating 15-20 seconds for volume attachment is too much.

There is an easy workaround for this though. It works like this:
We create a pod (that consumes no resources) that creates and attaches PVC. It absorbs attach time. And we create those ahead of time (though we shouldn't create just a maximum amount of them per node, as they allocate PD and that incurs costs).
We then create an actual workspace pod when needed, that attaches same PVC. As long as that workspace pod has an affinity to run on the node as our temp pod, it will mount that PVC just fine (even if readwritesingle) and it will not incur attach penalty time (verified in the test).
It does make our setup a bit more complicated though. But it gives us great flexibility in terms of not being bound to specific cloud (for example maybe some of them do not provide big local ssd drives that we use right now), and it removes the point of failure of local ssd when node has died. Also makes it completely transparent for self hosted solution as well. (I believe self hosted right now if using PVC instead of local disk will be incurring that 20 second penalty start up time per workspace).

Alternative solution:
Currently our nodes are provisioned using local ssd (hence if node dies, data dies with it). We could instead use PD, though pd-ssd is quite a bit more expensive (local ssd is 8c per hour, pd-ssd is about 18c depending on the zone).

If we instead use ceph/rook or anything else, it will still incur same volume attachment penalty.

So we have following solutions:
Use PVC with a temp pod that absorbs attachment penalty ahead of time.
Use local ssd, but risk losing customer data in case of node failure. (maybe we can build some sort of distributed storage solution on top of local ssd, that would replicate data to other nodes, but that seems complex)
Use PD disks instead of local disk and keep using same setup as right now, but incur extra cost (and potentially probably lower IOPS)
Maybe there is some other alternative that I am not seeing?

All solutions that would not use local ssd will incur extra cost though. Though it might be mitigated, for example right now each node allocates 2x375GB of local ssd for 2x375x0.08=$60 per hour. PVC solution for fully maxed out node running 20 workspaces using pd-balanced disks: 20x30GBx0.12=$72 per hour. But during low peak, we would use less PVCs, unlike local disks that burn at constant rate no matter what.

@sagor999
Copy link
Contributor

Personally I would vote for an option of using PVC with a pod that absorbs attachment time cost. It adds a bit of complexity, since each ws-daemon now need to keep track of it and add them ahead of time, but that doesn't seem like it adds too much complexity. And to me it looks like pros outweigh cons.

@princerachit
Copy link
Contributor

Thanks for the in depth analysis and testing @sagor999 ! I have a few questions:

  1. Did you test what would happen if the node is deleted due to some reason?
  2. Does the PVC auto detaches? How do we attach it to a new node for backup? -> Use persistent Volume of pod spec?

@sagor999
Copy link
Contributor

  1. if node is deleted pvc will stay there. even if cluster is deleted pd disk can still remain and later on we can create pvc out of it for data recovery.

  2. yes, if node is gone csi driver will detach pvc. it can then be mounted by any other pod running on a different node.

@atduarte atduarte added the team: workspace Issue belongs to the Workspace team label Feb 10, 2022
@csweichel
Copy link
Contributor

The PVC approach indeed looks promising. Do I understand correctly that we'd need to delete the temp pod to free up the PV for the workspace? If so, that would incur the deletion time of that pod, which may include time for the CSI driver to detach the PV.

Are there other means to pre-provision/pre-attach the PVs to nodes? I'm asking because we just got rid of "ghosts" which we'd essentially need to reintroduce otherwise.

@Furisto
Copy link
Member

Furisto commented Feb 10, 2022

@sagor999 Do you know what the impact on read/write performance would be?

@sagor999
Copy link
Contributor

one other disadvantage of pvc with temp pod approach:
we would need to decide to which node ws will be scheduled. :( since we need to give it pvc ahead of time in pod spec.
that is applicable to one ws = one pvc approach.

@csweichel yes, we would need delete temp pod when ws have terminated already. temp pod would be eating up those costs.

as for other means: we can go with a route of allocating disk per node instead of local ssd. which actually might be similar cost to what we have right now. we use two 375G local disks, instead we can use one ssd network attached disk (since google provides reliability for those already afaik) and it will be similar cost, and no changes to our setup. and if node dies - disk is still there that can be re-attached to a different node.

@Furisto no, i did not do that test just yet. but it will be slower then local ssd for sure, just not sure by how much.

@sagor999
Copy link
Contributor

https://github.com/gitpod-io/ops/blob/ff7e87b7a7ced2425cd9110160270369be003ee5/deploy/workspace/cluster-up.sh#L263
It doesn't seem like we are actually using local-ssd drives at all. We are attaching two pd-balanced disks to the node.
We do create them with auto-delete=yes probably so that we don't need to worry about cleaning them up afterward, but that had a side effect that when node is deleted due to health check failure, it will kill those disks as well.

But that also means that pd-balanced disks seem to be sufficient for our workspace clusters. (I was under impression that we had to use local-ssd for perf and iops)

@sagor999
Copy link
Contributor

sagor999 commented Feb 10, 2022

So tl;dr of sorts:
Currently we are using two pd-balanced disks in raid 0 that we use for workspace storage, both disks are set with auto-delete=true. Some problems with existing approach:

  1. If node dies, disks get deleted and data lost. (we can improve this by setting auto-delete to false and then have a reaper process that can delete disks that have been properly backed up).
  2. All workspaces share the same disk. (potentially noisy neighbour can reduce iops for all other workspaces).
  3. If we want to backup more then just /workspace folder, current approach has some limitations (not sure which).
  4. Not sure how this approach of attaching disks translates to self hosted solution.

Potential solutions:
A. Use PVC per workspace.
Pros:

  1. Only paying for disk usage that is actually needed.
  2. More flexibility in deciding which disk type to use (standard, balanced, ssd)
  3. Easier backup recovery, since if workspace or node fails, PVC can be easily re-attached to a different node.

Cons:

  1. Attach\Detach time penalty is great (~15-20 seconds added to start and stop of ws)
    a. This can be improved by using temp pod (ghost) that would attach pvc in advance. But that would mean we have to decide which node ws will run on ahead of time, instead of letting scheduler do that.

B. Update current pd-balanced disks to use auto-delete=false
Pros:

  1. Uses current approach

Cons:

  1. Current approach uses two disks in raid0 (not sure why, I think that is not needed)
  2. Requires creating new service that can monitor if node was shutdown cleanly and deletes disks, or if it did not, then spin up a new node that would attach those disks and perform backup recovery.
  3. Not sure how well this approach works in self-hosted scenario.

@kylos101
Copy link
Contributor Author

kylos101 commented Feb 10, 2022

Thank you for the tl;dr @sagor999 ! 🚀

Attach\Detach time penalty is great (~15-20 seconds added to start and stop of ws)

Does Google have any way for us to get priority and reduce this time, like reserving a certain amount of disks or paying for the storage ahead of time?

Regarding the initial questions:

  1. If a node powering a workspace is evicted from the cluster unexpectedly, test we can backup the user's data to object storage.

In hindsight, we know we can do this, I did this with Mo a while back on a separate incident.

  1. How might this look as a new component, where it's sole purpose is to backup from persistent volumes?

It would be good to your thoughts on the design for how /workspace backup would work with a PVC approach. Would we use what we have now? Would we build something new? What might the contract look like for the API? Perhaps just this (I got it from ws-manager):

    // backupWorkspace backs up a running workspace
    rpc BackupWorkspace(BackupWorkspaceRequest) returns (BackupWorkspaceResponse) {}
  1. Would these persistent volumes tolerate a cluster failure?

If yes, how would the user get access to their data again?

  1. How will this impact our ability to start a workspace, that was previously shutdown, but the files on disk were never uploaded to object storage?

In other words, would it still make sense to restore from object storage (which could be old), instead of restoring from the disk (the working copy, which is more recent)?

@sagor999
Copy link
Contributor

sagor999 commented Feb 10, 2022

Does Google have any way for us to get priority and reduce this time, like reserving a certain amount of disks or paying for the storage ahead of time?

No. Allocating disk takes about 2 seconds. It is attachment of that disk to the node that takes a long time. One solution to this is what I proposed above, using temp pod that have pvc attached to it, and then workspace uses same pvc. But that means we are back in business of scheduling pods on the node (instead of letting scheduler do that).

It would be good to your thoughts on the design for how /workspace backup would work with a PVC approach.

same way as it is right now, nothing would change in any drastic way. we might add our own finalizer to PVC to make sure we back it up before it disappears into ether.

Would these persistent volumes tolerate a cluster failure? If yes, how would the user get access to their data again?

Yes. Disks would still be in GCP. We would need to add some tooling into automatic recovery of them via creating a PVC binding for them and running backup.

In other words, would it still make sense to restore from object storage (which could be old), instead of restoring from the disk (the working copy, which is more recent)?

This depends on the timing. If we have some automated process, I would assume we would have to wait for it to complete so that user will have access to it. I don't we would be able to somehow give direct access to the disk to the user (plus I think it is not great UX for user to recover their files on their own, but we would definitely need some automatic process for this)

@sagor999 sagor999 moved this from In Progress to Scheduled in 🌌 Workspace Team Feb 10, 2022
@kylos101
Copy link
Contributor Author

Do we know how many disks we can mount to our nodes? I'm seeing 16 here, and 127 here. 🤔 I figure it falls somewhere in between, but am unsure.

@sagor999
Copy link
Contributor

on GCP it seems like limit is 128 disks per VM:
https://cloud.google.com/compute/docs/disks/#pdnumberlimits

@csweichel
Copy link
Contributor

@sagor999 Option B (Update current pd-balanced disks to use auto-delete=false) looks appealing to me. It requires very few changes to our current setup, and building the service that looks after those disks should not be too involved for the GCP case.

The self-hosted drawback is real though. However, this may be a question of time: if we can move with this approach quickly, we have the issue solved in SaaS. It will take some time until SH installations reach that size, which gives us time to investigate building that "backup service" or other alternatives.

Note: it would also mean that we really need to clean up our state management of ws-daemon :) Right now it's leaking workspace state on the disk.

@sagor999
Copy link
Contributor

Closing this issue, as PVC approach is out of scope for now. Will concentrate on approach described in #8202

Repository owner moved this from Scheduled to Done in 🌌 Workspace Team Feb 14, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
team: workspace Issue belongs to the Workspace team
Projects
No open projects
Archived in project
Development

No branches or pull requests

6 participants