Ray cluster headgroup resources #190

tedhtchang · 2023-07-01T01:32:09Z

Should we allow configurable [headgroup resources] ? (for development on a laptop with 8cpu x 16gb ram)
(

codeflare-sdk/src/codeflare_sdk/templates/base-template.yaml

Lines 145 to 152 in 52b94c4

    
           limits: 
        
             cpu: 2 
        
             memory: "8G" 
        
             nvidia.com/gpu: 0 
        
           requests: 
        
             cpu: 2 
        
             memory: "8G" 
        
             nvidia.com/gpu: 0

)

The resources allocation with only the codeflare-stack (w/o any ODH component) was:

Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource           Requests      Limits
  --------           --------      ------
  cpu                2445m (31%)   1100m (14%)
  memory             8442Mi (55%)  768Mi (5%)

Creating a cluster on OpenShift local on a 8cpu x 16gb workstation would fail with insufficient resources.
cluster = Cluster(ClusterConfiguration(namespace="default", name="torch", min_worker=1, max_worker=1, min_cpus=1, max_cpus=1, min_memory=1, max_memory=1, gpu=0, instascale=False))

I0630 22:41:52.423038       1 queuejob_controller_ex.go:1009] [getAggAvaiResPri] cpu 5365.00, memory 7098806272.00, GPU 0 available resources to schedule
I0630 22:41:52.423066       1 queuejob_controller_ex.go:1260] [ScheduleNext] XQJ torch with resources cpu 3000.00, memory 9000000000.00, GPU 0 to be scheduled on aggregated idle resources cpu 5365.00, memory 7098806272.00, GPU 0
I0630 22:41:52.423204       1 queuejob_controller_ex.go:1336] [ScheduleNext] HOL Blocking by torch for 163.595µs activeQ=false Unsched=true &qj=0xc0007df900 Version=61385 Status={Pending:0 Running:0 Succeeded:0 Failed:0 MinAvailable:0 CanRun:false IsDispatched:false State:Pending Message: SystemPriority:9 QueueJobState:HeadOfLine ControllerFirstTimestamp:2023-06-30 22:40:12.010146 +0000 UTC ControllerFirstDispatchTimestamp:0001-01-01 00:00:00 +0000 UTC FilterIgnore:true Sender:before ScheduleNext - setHOL Local:false Conditions:[{Type:Init Status:True LastUpdateMicroTime:2023-06-30 22:40:12.010149 +0000 UTC LastTransitionMicroTime:2023-06-30 22:40:12.01015 +0000 UTC Reason: Message:} {Type:Queueing Status:True LastUpdateMicroTime:2023-06-30 22:40:12.010603 +0000 UTC LastTransitionMicroTime:2023-06-30 22:40:12.010605 +0000 UTC Reason:AwaitingHeadOfLine Message:} {Type:HeadOfLine Status:True LastUpdateMicroTime:2023-06-30 22:40:12.082065 +0000 UTC LastTransitionMicroTime:2023-06-30 22:40:12.082067 +0000 UTC Reason:FrontOfQueue. Message:} {Type:Backoff Status:True LastUpdateMicroTime:2023-06-30 22:40:32.322476 +0000 UTC LastTransitionMicroTime:2023-06-30 22:40:32.322478 +0000 UTC Reason:AppWrapperNotRunnable. Message:Insufficient resources to dispatch AppWrapper.}] PendingPodConditions:[]}

My openshift local config:

crc config view
- consent-telemetry                     : yes
- cpus                                  : 8
- disk-size                             : 80
- memory                                : 16000
- network-mode                          : user
- pull-secret-file                      : /home/tedchang/secret.json

The text was updated successfully, but these errors were encountered:

MichaelClifford · 2023-07-14T14:36:33Z

Yes, we should definitely make the head group configurable.

astefanutti · 2023-08-24T16:24:59Z

It's blocking for #292. Let's add it to the backlog.

roytman · 2023-09-23T12:02:47Z

I'm joining this issue, but with an opposite use case, we run a big Ray cluster (216 workers), and it requires a head with at least 9 core CPU and 64G memory.

MichaelClifford · 2023-09-25T17:10:26Z

@roytman #354 Should address this issue for you.

astefanutti added this to Project CodeFlare Sprint Board Aug 24, 2023

Bobbins228 self-assigned this Aug 25, 2023

anishasthana assigned MichaelClifford Aug 28, 2023

anishasthana moved this to Todo in Project CodeFlare Sprint Board Aug 28, 2023

anishasthana unassigned MichaelClifford and Bobbins228 Sep 18, 2023

MichaelClifford self-assigned this Sep 23, 2023

MichaelClifford mentioned this issue Sep 24, 2023

Configure head #354

Merged

4 tasks

dimakis moved this from Todo to In Progress in Project CodeFlare Sprint Board Sep 25, 2023

openshift-merge-robot closed this as completed in #354 Sep 25, 2023

github-project-automation bot moved this from In Progress to Done in Project CodeFlare Sprint Board Sep 25, 2023

This was referenced Oct 19, 2023

Enable mcad_ray_test opendatahub-io/distributed-workloads#143

Closed

Enable mcad_ray_test opendatahub-io/distributed-workloads#144

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ray cluster headgroup resources #190

Ray cluster headgroup resources #190

tedhtchang commented Jul 1, 2023

MichaelClifford commented Jul 14, 2023

astefanutti commented Aug 24, 2023 •

edited

Loading

roytman commented Sep 23, 2023

MichaelClifford commented Sep 25, 2023

Ray cluster headgroup resources #190

Ray cluster headgroup resources #190

Comments

tedhtchang commented Jul 1, 2023

MichaelClifford commented Jul 14, 2023

astefanutti commented Aug 24, 2023 • edited Loading

roytman commented Sep 23, 2023

MichaelClifford commented Sep 25, 2023

astefanutti commented Aug 24, 2023 •

edited

Loading