You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: CodeFlareSDK_Design_Doc.md
+6-6
Original file line number
Diff line number
Diff line change
@@ -27,7 +27,7 @@ In order to achieve this we need the capacity to:
27
27
* Generate valid AppWrapper yaml files based on user provided parameters
28
28
* Get, list, watch, create, update, patch, and delete AppWrapper custom resources on a kubernetes cluster
29
29
* Get, list, watch, create, update, patch, and delete RayCluster custom resources on a kubernetes cluster.
30
-
* Expose a secure route to the Ray Dashboard endpoint.
30
+
* Expose a secure route to the Ray Dashboard endpoint.
31
31
* Define, submit, monitor and cancel Jobs submitted via TorchX. TorchX jobs must support both Ray and MCAD-Kubernetes scheduler backends.
32
32
* Provide means of authenticating to a Kubernetes cluster
33
33
@@ -37,17 +37,17 @@ In order to achieve this we need the capacity to:
37
37
38
38
In order to create these framework clusters, we will start with a template AppWrapper yaml file with reasonable defaults that will generate a valid RayCluster via MCAD.
39
39
40
-
Users can customize their AppWrapper by passing their desired parameters to `ClusterConfig()` and applying that configuration when initializing a `Cluster()` object. When a `Cluster()` is initialized, it will update the AppWrapper template with the user’s specified requirements, and save it to the current working directory.
40
+
Users can customize their AppWrapper by passing their desired parameters to `ClusterConfig()` and applying that configuration when initializing a `Cluster()` object. When a `Cluster()` is initialized, it will update the AppWrapper template with the user’s specified requirements, and save it to the current working directory.
41
41
42
42
Our aim is to simplify the process of generating valid AppWrappers for RayClusters, so we will strive to find the appropriate balance between ease of use and exposing all possible AppWrapper parameters. And we will find this balance through user feedback.
43
43
44
44
With a valid AppWrapper, we will use the Kubernetes python client to apply the AppWrapper to our Kubernetes cluster via a call to `cluster.up()`
45
45
46
-
We will also use the Kubernetes python client to get information about both the RayCluster and AppWrapper custom resources to monitor the status of our Framework Cluster.
46
+
We will also use the Kubernetes python client to get information about both the RayCluster and AppWrapper custom resources to monitor the status of our Framework Cluster via `cluster.status()` and `cluster,details()`.
47
47
48
48
The RayCluster deployed on your kubernetes cluster can be interacted with in two ways: Either through an interactive session via `ray.init()` or through the submission of batch jobs.
49
49
50
-
Finally we will use the Kubernetes python client to delete the AppWrapper via `Cluster.down()`
50
+
Finally we will use the Kubernetes python client to delete the AppWrapper via `cluster.down()`
51
51
52
52
### Training Jobs:
53
53
@@ -57,7 +57,7 @@ Users can define their jobs with `DDPJobDefinition()` providing parameters for t
57
57
58
58
Once a job is defined it can be submitted to the Kuberentes cluster to be run via `job.submit()`. If `job.submit()` is left empty the SDK will assume the Kuberentes-MCAD scheduler is being used. If a RayCluster is specified like, `job.submit(cluster)`, then the SDK will assume that the Ray scheduler is being used and submit the job to that RayCluster.
59
59
60
-
After the job is submitted, a user can monitor its progress via `job.status()` and `job.logs()` to retrieve the status and logs output by the job. At any point the user can also call `.cancel()` to stop the job.
60
+
After the job is submitted, a user can monitor its progress via `job.status()` and `job.logs()` to retrieve the status and logs output by the job. At any point the user can also call `job.cancel()` to stop the job.
61
61
62
62
### Authentication:
63
63
@@ -93,7 +93,7 @@ We will rely on the kubernetes cluster’s default security, where users cannot
93
93
94
94
* Unit testing for all SDK functionality
95
95
* Integration testing of SDK interactions with OpenShift and Kubernetes
96
-
* System tests of SDK as part of the entire CodeFlare stack for main scenarios
96
+
* System tests of SDK as part of the entire CodeFlare stack for main scenarios
97
97
* Unit testing, integration testing, and system testing approaches
98
98
* Unit testing will occur with every PR.
99
99
* For system testing we can leverage [current e2e](https://github.com/project-codeflare/codeflare-operator/tree/main/test/e2e) tests from the operator repo.
0 commit comments