Skip to content

Ray cluster headgroup resources #190

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
tedhtchang opened this issue Jul 1, 2023 · 4 comments · Fixed by #354
Closed

Ray cluster headgroup resources #190

tedhtchang opened this issue Jul 1, 2023 · 4 comments · Fixed by #354
Assignees

Comments

@tedhtchang
Copy link
Member

Should we allow configurable [headgroup resources] ? (for development on a laptop with 8cpu x 16gb ram)
(

limits:
cpu: 2
memory: "8G"
nvidia.com/gpu: 0
requests:
cpu: 2
memory: "8G"
nvidia.com/gpu: 0
)

The resources allocation with only the codeflare-stack (w/o any ODH component) was:

Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource           Requests      Limits
  --------           --------      ------
  cpu                2445m (31%)   1100m (14%)
  memory             8442Mi (55%)  768Mi (5%)

Creating a cluster on OpenShift local on a 8cpu x 16gb workstation would fail with insufficient resources.
cluster = Cluster(ClusterConfiguration(namespace="default", name="torch", min_worker=1, max_worker=1, min_cpus=1, max_cpus=1, min_memory=1, max_memory=1, gpu=0, instascale=False))

I0630 22:41:52.423038       1 queuejob_controller_ex.go:1009] [getAggAvaiResPri] cpu 5365.00, memory 7098806272.00, GPU 0 available resources to schedule
I0630 22:41:52.423066       1 queuejob_controller_ex.go:1260] [ScheduleNext] XQJ torch with resources cpu 3000.00, memory 9000000000.00, GPU 0 to be scheduled on aggregated idle resources cpu 5365.00, memory 7098806272.00, GPU 0
I0630 22:41:52.423204       1 queuejob_controller_ex.go:1336] [ScheduleNext] HOL Blocking by torch for 163.595µs activeQ=false Unsched=true &qj=0xc0007df900 Version=61385 Status={Pending:0 Running:0 Succeeded:0 Failed:0 MinAvailable:0 CanRun:false IsDispatched:false State:Pending Message: SystemPriority:9 QueueJobState:HeadOfLine ControllerFirstTimestamp:2023-06-30 22:40:12.010146 +0000 UTC ControllerFirstDispatchTimestamp:0001-01-01 00:00:00 +0000 UTC FilterIgnore:true Sender:before ScheduleNext - setHOL Local:false Conditions:[{Type:Init Status:True LastUpdateMicroTime:2023-06-30 22:40:12.010149 +0000 UTC LastTransitionMicroTime:2023-06-30 22:40:12.01015 +0000 UTC Reason: Message:} {Type:Queueing Status:True LastUpdateMicroTime:2023-06-30 22:40:12.010603 +0000 UTC LastTransitionMicroTime:2023-06-30 22:40:12.010605 +0000 UTC Reason:AwaitingHeadOfLine Message:} {Type:HeadOfLine Status:True LastUpdateMicroTime:2023-06-30 22:40:12.082065 +0000 UTC LastTransitionMicroTime:2023-06-30 22:40:12.082067 +0000 UTC Reason:FrontOfQueue. Message:} {Type:Backoff Status:True LastUpdateMicroTime:2023-06-30 22:40:32.322476 +0000 UTC LastTransitionMicroTime:2023-06-30 22:40:32.322478 +0000 UTC Reason:AppWrapperNotRunnable. Message:Insufficient resources to dispatch AppWrapper.}] PendingPodConditions:[]}

My openshift local config:

crc config view
- consent-telemetry                     : yes
- cpus                                  : 8
- disk-size                             : 80
- memory                                : 16000
- network-mode                          : user
- pull-secret-file                      : /home/tedchang/secret.json
@MichaelClifford
Copy link
Collaborator

Yes, we should definitely make the head group configurable.

@astefanutti
Copy link
Contributor

astefanutti commented Aug 24, 2023

It's blocking for #292. Let's add it to the backlog.

@roytman
Copy link
Contributor

roytman commented Sep 23, 2023

I'm joining this issue, but with an opposite use case, we run a big Ray cluster (216 workers), and it requires a head with at least 9 core CPU and 64G memory.

@MichaelClifford MichaelClifford self-assigned this Sep 23, 2023
@MichaelClifford MichaelClifford mentioned this issue Sep 24, 2023
4 tasks
@dimakis dimakis moved this from Todo to In Progress in Project CodeFlare Sprint Board Sep 25, 2023
@github-project-automation github-project-automation bot moved this from In Progress to Done in Project CodeFlare Sprint Board Sep 25, 2023
@MichaelClifford
Copy link
Collaborator

@roytman #354 Should address this issue for you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Development

Successfully merging a pull request may close this issue.

5 participants