Skip to content

Commit e6baaaa

Browse files
committed
Introduce ADR for Project CodeFlare test strategy
Signed-off-by: Karel Suta <[email protected]>
1 parent 9105afa commit e6baaaa

File tree

1 file changed

+392
-0
lines changed

1 file changed

+392
-0
lines changed
Lines changed: 392 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,392 @@
1+
# Define testing strategy for Project CodeFlare
2+
3+
| | |
4+
| -------------- | ------------------------ |
5+
| Date | 06/14/2023 |
6+
| Scope | |
7+
| Status | Proposed |
8+
| Authors | [Karel Suta](@sutaakar), [Antonin Stefanutti](@astefanutti) |
9+
| Supersedes | N/A |
10+
| Superseded by: | N/A |
11+
| Issues | |
12+
| Other docs: | [PCF-ADR-0003](https://github.com/project-codeflare/adr/blob/main/PCF-ADR-0003-codeflare-release-process.md) |
13+
14+
## What
15+
16+
This ADR introduces an overview of testing strategy and approach for Project CodeFlare.
17+
18+
## Why
19+
20+
Various components of Project CodeFlare currently handle their testing in individual and uncoordinated way. Additionally, there are no tests verifying interactions and compatibility between different components.
21+
This ADR aims to document a unified approach for testing of various components, extending the testing section of [Release process ADR](https://github.com/project-codeflare/adr/blob/main/PCF-ADR-0003-codeflare-release-process.md#testing).
22+
23+
## Goals
24+
25+
* Establish a common test approach for CodeFlare components
26+
* Define a testing scope and priorities
27+
* Analyze and decide test automation environment and tools
28+
29+
## Non-Goals
30+
31+
* Cover certification testing
32+
* Cover non functional testing (performance, security....)
33+
34+
## How
35+
36+
37+
### Scope
38+
39+
High priority components:
40+
* CodeFlare operator
41+
* KubeRay
42+
* MCAD
43+
* ODH Distributed Workloads Component
44+
45+
Lower priority components
46+
* CodeFlare SDK
47+
* MCAD dashboard
48+
* InstaScale
49+
50+
51+
### Test Approach
52+
53+
#### Test levels and requirements:
54+
55+
PR automated check:
56+
* Test execution speed is critical to provide fast feedback (less than 30 minutes)
57+
* Focusing on functional testing, avoid long running tests with lower value
58+
* Test composition (first execute unit tests, when unit tests pass then run e2e tests)
59+
* Use parallelization where possible to speed up execution
60+
61+
Nightly build check (to be implemented once nightly builds are available):
62+
* Relaxed test execution speed requirements (i.e. 2 hours)
63+
* Running complete test suite
64+
* Test also integration with ODH Distributed Workloads Component
65+
66+
Release build check:
67+
* Test speed is important to be able to provide a build fast (less than 1 hour):
68+
* Focusing on functional testing, avoid long running tests with lower value (tested by nightly checks)
69+
70+
#### Testing levels and Type of testing:
71+
72+
| Test type/development cycle | Feature developer | PR automated checks | PR review by reviewer | Nightly automated checks | Release automated checks |
73+
| ---------------------------- | ----------------- | -------------------- | --------------------- | ------------------------- | ------------------------ |
74+
| Manual testing | Yes | No | Yes | No | No |
75+
| Unit testing | Yes | Yes | No | Yes | Yes |
76+
| End to end testing | Yes | Yes | No | Yes | Yes |
77+
| Integration testing | Optional | Yes | No | Yes | Yes |
78+
| Upgrade testing | Optional | Yes | No | Yes | Yes |
79+
80+
Glossary:
81+
82+
Manual testing - Manual verification of the functionality.
83+
84+
Unit testing - Unit tests covering the base functionality.
85+
86+
End to end testing - Testing including all CodeFlare components, running in a cloud.
87+
88+
Integration testing - Testing of interactions between ODH components and CodeFlare components in a cloud.
89+
90+
Upgrade testing - Testing of component upgrades (deploy new operator version) in a cloud.
91+
92+
### Test Environment
93+
Test environment depends on resource requirements of executed tests. For unit tests we will leverage GitHub actions runner. For end to end/integration/upgrade testing we will currently use KinD cluster running on GitHub actions runner. In case the resources provided by GitHub actions runner are not sufficient we can consider alternatives like [testing farm](https://docs.testing-farm.io/general/0.1/index.html).
94+
95+
### Defect Management
96+
97+
#### Bug logging
98+
Bugs found by testing are reported as GitHub issues in respective repository. Every bug description has to contain steps to reproduce.
99+
100+
#### Bug fixing
101+
Bugs are going to be fixed in a similar approach as new features - planned as part of sprint planning, fixed through PRs. If possible the fix contains test coverage, preventing the issue from happening again.
102+
103+
104+
### Test Suite
105+
106+
#### E2E Test Cases
107+
108+
##### Setup
109+
110+
The following steps must be executed once, with a cluster admin role, as prerequisites to running the e2e test cases:
111+
1. Provision a test Kubernetes cluster (resp. an OpenShift cluster)
112+
2. Build the CodeFlare operator container image:
113+
* Clone the CodeFlare operator source code repository
114+
* Checkout the branch to be tested, e.g., the PR feature branch, i.e. `git clone --branch mytag0.1 --depth 1 https://example.com/my/repo.git`
115+
* Build the CodeFlare operator container image at HEAD revision
116+
* Push the CodeFlare operator container image into a test container image registry (resp. the OpenShift cluster internal image registry)
117+
3. Install the CodeFlare stack components into the test cluster:
118+
* Configure the CodeFlare operator deployment with Kustomize, to use the previously built container image
119+
* Create the codeflare-system Namespace (resp. Project)
120+
* Deploy the CodeFlare operator using Kustomize
121+
Note: the installation using OLM is covered as part of the installation / upgrade test cases
122+
* Deploy the KubeRay operator
123+
* Create a default MCAD resource
124+
* Grant the MCAD controller ServiceAccount edit permission on RayCluster resources
125+
* Create a default InstaScale resource
126+
Wait until all the components are ready
127+
The e2e test cases can now be executed, with a **standard user role**, ideally in parallel, to speed the execution time up and shorten the feedback loop as much as possible.
128+
> **Note**
129+
> The test cases should define lean / minimal compute resources, relative to the provisioned cluster, so parallel execution / throughput of the batch jobs submitted via the MCAD scheduling queue is maximized.
130+
131+
##### Submit a Sample PyTorch Job in a managed Ray Cluster
132+
133+
###### Description
134+
135+
Submit a test PyTorch batch job, to a Ray cluster managed by MCAD, in a user tenant, and assert successful completion of the job. Shutdown the Ray cluster, and assert successful freeing of resources.
136+
137+
###### Scenario
138+
139+
1. Create a test Namespace (resp. Project)
140+
2. Create a test AppWrapper resource with the following specifications: https://github.com/project-codeflare/multi-cluster-app-dispatcher/blob/14d569bec1cd016dd41352e3c026f461d851f480/doc/usage/examples/kuberay/config/aw-raycluster.yaml
141+
3. Wait until the Ray cluster is ready
142+
4. Submit the test batch job to the Ray cluster
143+
5. Wait until the job has completed
144+
6. Assert the job status is successful
145+
7. Delete the test AppWrapper resource
146+
8. Assert all the resources have been successfully freed
147+
148+
##### Submit a Sample PyTorch Job directly via MCAD
149+
150+
Lower priority.
151+
152+
###### Description
153+
154+
Submit a test PyTorch batch job to the MCAD scheduler, in a user tenant, and assert successful completion of the job. Assert successful execution of the job, and clean-up of resources upon the batch job completion.
155+
156+
###### Scenario
157+
158+
1. Create a test Namespace (resp. Project)
159+
2. Create a test AppWrapper resource with the following specifications:
160+
```
161+
apiVersion: mcad.ibm.com/v1beta1
162+
kind: AppWrapper
163+
spec:
164+
resources:
165+
GenericItems:
166+
- allocated: 0
167+
generictemplate:
168+
apiVersion: v1
169+
kind: Pod
170+
metadata:
171+
spec:
172+
containers:
173+
- command:
174+
- bash
175+
- '-c'
176+
- >-
177+
torchrun --rdzv_backend c10d --rdzv_endpoint
178+
$TORCHX_RANK0_HOST:49782 --rdzv_id 'test-job'
179+
--nnodes 1 --nproc_per_node 1 --node_rank '0' --tee 3 --role
180+
'' mnist.py
181+
env:
182+
- name: TORCHX_TRACKING_EXPERIMENT_NAME
183+
value: default-experiment
184+
- name: LOGLEVEL
185+
value: WARNING
186+
- name: TORCHX_JOB_ID
187+
value: 'kubernetes_mcad://torchx/test-job'
188+
- name: TORCHX_RANK0_HOST
189+
value: localhost
190+
- name: TORCHX_MCAD_MNIST_0_HOSTS
191+
value: test-job
192+
image: 'quay.io/michaelclifford/mnist-test:latest'
193+
name: test-job
194+
ports:
195+
- containerPort: 29500
196+
name: c10d
197+
resources:
198+
limits:
199+
cpu: 1000m
200+
memory: 4000M
201+
requests:
202+
cpu: 900m
203+
memory: 2976M
204+
volumeMounts:
205+
- mountPath: /dev/shm
206+
name: dshm
207+
hostname: test-job
208+
restartPolicy: Never
209+
subdomain: test-job
210+
volumes:
211+
- emptyDir:
212+
medium: Memory
213+
name: dshm
214+
priority: 0
215+
priorityslope: 0
216+
replicas: 1
217+
- allocated: 0
218+
generictemplate:
219+
apiVersion: v1
220+
kind: Service
221+
metadata:
222+
name: test-job
223+
spec:
224+
clusterIP: None
225+
ports:
226+
- port: 29500
227+
protocol: TCP
228+
targetPort: 29500
229+
publishNotReadyAddresses: true
230+
selector:
231+
appwrapper.mcad.ibm.com: test-job
232+
sessionAffinity: None
233+
type: ClusterIP
234+
schedulingSpec:
235+
requeuing:
236+
growthType: exponential
237+
maxNumRequeuings: 0
238+
maxTimeInSeconds: 0
239+
numRequeuings: 0
240+
timeInSeconds: 300
241+
```
242+
243+
3. Wait until the test job has completed
244+
4. Assert the job status is successful
245+
5. Delete the test AppWrapper resource
246+
6. Assert all the resources have been successfully freed
247+
248+
#### OLM Installation / Upgrade Test Cases
249+
250+
##### Setup
251+
252+
The following steps must be executed once, with a **cluster admin role**, as prerequisites to running the OLM installation / upgrade test cases:
253+
254+
1. Provision a test Kubernetes cluster (resp. an OpenShift cluster)
255+
2. For Kubernetes cluster only: install OLM by executing the following commands:
256+
```
257+
$ kubectl apply -f https://github.com/operator-framework/operator-lifecycle-manager/releases/download/v0.24.0/crds.yaml
258+
$ kubectl apply -f https://github.com/operator-framework/operator-lifecycle-manager/releases/download/v0.24.0/olm.yaml
259+
```
260+
3. Build the CodeFlare operator OLM bundle:
261+
* Clone the CodeFlare operator source code repository
262+
* Checkout the branch to be tested, e.g., the PR feature branch, i.e. `git clone --branch mytag0.1 --depth 1 https://example.com/my/repo.git`
263+
* Build the CodeFlare operator container image at HEAD revision
264+
* Push the CodeFlare operator container image into a test container image registry (resp. the OpenShift cluster internal image registry)
265+
* Build the OLM bundle image
266+
* Push the OLM bundle image into a test container image registry (resp. the OpenShift cluster internal image registry)
267+
268+
4. Build the OLM index image:
269+
* Build the OLM index/catalog image with OPM from the latest Operator Hub community catalog (resp. OpenShift catalog), e.g.:
270+
```
271+
$ mkdir catalog
272+
$ opm render registry.redhat.io/redhat/redhat-operator-index:<OCP_VERSION> -o yaml > catalog/bundles.yaml
273+
$ opm render $BUNDLE_IMAGE > catalog/codeflare-operator-bundle.yaml
274+
$ sedtemp=$(mktemp sed-template-XXX.sed)
275+
$ cat << EOF > ${sedtemp}
276+
/- name: codeflare-operator.v${PREVIOUS_VERSION}/ {
277+
p;
278+
n;
279+
/ replaces:/ {
280+
p;
281+
n;
282+
/name: alpha$/ {
283+
i- name: codeflare-operator.v${PREVIOUS_VERSION}
284+
i\ \ replaces: codeflare-operator.v${PREVIOUS_VERSION}
285+
p;
286+
d;
287+
}
288+
}
289+
}
290+
p;
291+
EOF
292+
$ sed -i -n -f ${sedtemp} ${CATALOG_DIR}/bundles.yaml
293+
$ rm -f ${sedtemp}
294+
$ opm validate catalog
295+
$ opm generate dockerfile catalog
296+
$ podman build . -f catalog.Dockerfile -t <CATALOG_IMG>
297+
```
298+
299+
> **Note**
300+
> the above should be done according to https://github.com/operator-framework/operator-sdk/issues/5832 when available.
301+
302+
> **Note**
303+
> the channel must be adapted according to https://github.com/project-codeflare/codeflare-operator/issues/126.
304+
305+
* Push the OLM index/catalog image that’s been built at the previous step to the test container image registry (resp. The OpenShift cluster internal container image registry)
306+
307+
308+
The test cases can now be executed, with a **cluster admin role**, in sequence, as the upgrade test case depends on the installation one.
309+
310+
##### Installation of the CodeFlare Operator with OLM
311+
312+
###### Description
313+
314+
Install the latest released version of the CodeFlare operator using OLM, and assert successful deployment of the operator. Run a smoke test, to make sure the current version of the operator is working as expected.
315+
316+
###### Scenario
317+
318+
1. Create a codeflare-system test Namespace (resp. Project)
319+
2. Create a CatalogSource resource with the following specifications:
320+
```
321+
apiVersion: operators.coreos.com/v1alpha1
322+
kind: CatalogSource
323+
metadata:
324+
name: test-catalog
325+
spec:
326+
displayName: CodeFlare OLM Upgrade Tests
327+
image: registry.redhat.io/redhat/redhat-operator-index
328+
publisher: CodeFlare Team
329+
sourceType: grpc
330+
```
331+
3. Create a Subscription resource, that points to the CatalogSource resource created at the previous step, e.g.:
332+
```
333+
apiVersion: operators.coreos.com/v1alpha1
334+
kind: Subscription
335+
metadata:
336+
name: codeflare-operator
337+
spec:
338+
channel: alpha
339+
installPlanApproval: Automatic
340+
name: codeflare-operator
341+
source: test-catalog
342+
startingCSV: codeflare-operator.<LATEST_VERSION>
343+
```
344+
4. Run a smoke test, e.g. by running one of the e2e test cases, to make sure the operator current version is working correctly.
345+
346+
##### Upgrade of the CodeFlare Operator installed via OLM
347+
348+
###### Description
349+
350+
Add a newer version of the CodeFlare operator bundle, replacing the latest released version from the channel subscribed to during the Installation of the CodeFlare Operator with OLM test case. Assert successful upgrade of the operator to the newer version, and run a smoke test, to make sure that newer version is working as expected.
351+
352+
###### Scenario
353+
354+
> **Note**
355+
> Note: Make sure to execute the Installation of the CodeFlare Operator with OLM test case first.
356+
357+
1. In the same codeflare-system Namespace (resp. Project), update the existing test-catalog CatalogSource resource, by pointing to the new OLM index image, e.g.:
358+
```
359+
apiVersion: operators.coreos.com/v1alpha1
360+
kind: CatalogSource
361+
metadata:
362+
name: test-catalog
363+
spec:
364+
displayName: CodeFlare OLM Upgrade Tests
365+
image: <CATALOG_IMG>
366+
publisher: CodeFlare Team
367+
sourceType: grpc
368+
```
369+
2. Assert a new ClusterServiceVersion resource has been created for the newer operator version, and eventually reaches the CSVPhaseSucceeded phase
370+
3. Run a smoke test, e.g., by running one of the e2e test cases, to make sure the operator current version is working correctly.
371+
372+
## Open Questions
373+
374+
1. Add observability test cases
375+
2. Add integration test cases
376+
377+
## Alternatives
378+
379+
We didn't consider any other alternatives
380+
381+
## Stakeholder Impacts
382+
383+
| Group | Key Contacts | Date | Impacted? |
384+
| ----------------------------- | ------------------ | ---------- | --------- |
385+
| CodeFlare SDK | Mustafa Eyceoz | | yes |
386+
| MCAD | Abhishek Malvankar | | yes |
387+
| InstaScale | Abhishek Malvankar | | yes |
388+
| CodeFlare Operator | Anish Asthana | | yes |
389+
390+
## Reviews
391+
392+
Reviews on the pull request will suffice for the approval process. At least 2 approvals are required prior to this ADR being merged. The ADR must also remain open for at least one week.

0 commit comments

Comments
 (0)