+I need temporary (but long term) access to a massive amount of GPU enabled infrastructure to train a foundation model. I want to be able to “fire-and-forget” my ML Job into this environment, which involves submitting my job directly to MCAD via TorchX, with the MCAD-Kubernetes scheduler or a Ray Cluster via TorchX, with the Ray scheduler. Due to the size and cost associated with this job, it has already been well tested and validated, so access to jupyter notebooks is unnecessary. I would prefer to write my job as a bash script leveraging the CodeFlare CLI, or as a python script leveraging the CodeFlare SDK. I need the ability to monitor the job while it is running, as well as access to all of its artifacts once complete. I also want to see where my jobs are in the current MCAD queue and the progress of all my current jobs visualized in a simple dashboard.
0 commit comments