Skip to content

Move CUDA stuff to an extension #4499

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 57 commits into
base: main
Choose a base branch
from
Open

Conversation

michel2323
Copy link
Collaborator

@michel2323 michel2323 commented May 12, 2025

This PR isolates CUDA into src/arch_cuda.jl. This removes any direct CUDA calls in the remaining Oceananigans code base. That feel can either serve as a template for a new GPU architecture or for a future CUDA extension. @vchuravy

Closes #3481

@glwagner
Copy link
Member

Possibly, we should simply implement a CUDA extension in this PR with appropriate organization of the code and get on with the breaking change!

tl;dr then after this is merged, anybody doing computations on nvidia GPU has to write

using Oceananigans
using CUDA

@glwagner
Copy link
Member

@simone-silvestri curious to hear your thoughts

@simone-silvestri
Copy link
Collaborator

I think it's a good idea. It provides templates to add new architectures and makes the code completely architecture agnostic. the extra using CUDA is a small price to pay.

@navidcy navidcy added the GPU 👾 Where Oceananigans gets its powers from label May 13, 2025
@michel2323 michel2323 force-pushed the ms/ka branch 2 times, most recently from 69eb545 to 59a441d Compare May 16, 2025 16:08
@glwagner
Copy link
Member

@michel2323 let us know when this is ready for prime time

Comment on lines -288 to +287
isnothing(devices) ? device!(node_rank % ndevices()) : device!(devices[node_rank+1])
isnothing(devices) ? device!(child_architecture, node_rank % ndevices(child_architecture)) : device!(child_architecture, devices[node_rank+1])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@michel2323
Copy link
Collaborator Author

michel2323 commented Jun 11, 2025

@glwagner For the failing tests, we have 4 in total

  • oceanangians-distributed: I think it can't find the commit because it's run from a fork.
  • cpu-turbulence-closure-tests: Bus error. No idea man.
  • gpu-multi-region-tests: That's the hard one I have to sit on. I dived into it and it's definitely my changes. However, I don't understand what is different at runtime that triggers this.
  • Documentation something.

@michel2323
Copy link
Collaborator Author

I think the main problem is that I haven't figured out what getdevice actually means across all objects where it is implemented. In particular, there's a bunch of getdevice(somearray). @vchuravy How would that look like with KA? Do the arrays know on which device they are?

@vchuravy
Copy link
Collaborator

Do the arrays know on which device they are?

To my knowledge that's an ill-formed query.

@michel2323
Copy link
Collaborator Author

michel2323 commented Jun 12, 2025

Do the arrays know on which device they are?

To my knowledge that's an ill-formed query.

@simone-silvestri @glwagner How do you want to proceed with these? Can this be rewritten to only use stuff from Architectures?

https://github.com/michel2323/Oceananigans.jl/blob/2e5f75498e8fa7896a91241351c0e2bac9904adc/src/Utils/multi_region_transformation.jl#L54-L70

validate_boundary_condition_architecture(::CuArray, ::GPU, bc, side) = nothing

validate_boundary_condition_architecture(::CuArray, ::CPU, bc, side) =
throw(ArgumentError("$side $bc must use `Array` rather than `CuArray` on CPU architectures!"))

validate_boundary_condition_architecture(::Array, ::GPU, bc, side) =
throw(ArgumentError("$side $bc must use `CuArray` rather than `Array` on GPU architectures!"))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this reads like a CUDA-specific message? Is the error message wrongly CUDA-specific here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
throw(ArgumentError("$side $bc must use `CuArray` rather than `Array` on GPU architectures!"))
throw(ArgumentError("$side $bc must use `GPUArray` rather than `Array` on GPU architectures!"))

?

@navidcy
Copy link
Member

navidcy commented Jul 6, 2025

I fixed the docs.

The last error seems to be coming from MultiRegion. The cubed sphere simulation fails at the run!(simulation) when it tries to write output. Seems that the error is coming from an iterator? I wasn't able to figure it out.

Here's an MWE

using Oceananigans
grid = ConformalCubedSphereGrid(CPU(); panel_size = (18, 18, 9), z = (0, 1), radius = 1, horizontal_direction_halo = 6)
model = HydrostaticFreeSurfaceModel(; grid,
                                    momentum_advection = WENOVectorInvariant(order=5),
                                    tracer_advection = WENO(order=5),
                                    free_surface = SplitExplicitFreeSurface(grid; substeps=12),
                                    coriolis = HydrostaticSphericalCoriolis(eltype(grid)),
                                    tracers = :b,
                                    buoyancy = BuoyancyTracer())
simulation = Simulation(model, Δt=60, stop_time=600)

simulation.output_writers[:fields] = JLD2Writer(model, fields(model);
                                                schedule = IterationInterval(2),
                                                filename = "cubed_sphere_output",
                                                verbose = false,
                                                overwrite_existing = true)

run!(simulation)

@navidcy
Copy link
Member

navidcy commented Jul 6, 2025

julia> using Oceananigans
[ Info: Oceananigans will use 12 threads

julia> grid = ConformalCubedSphereGrid(CPU(); panel_size = (18, 18, 9), z = (0, 1), radius = 1, horizontal_direction_halo = 6)
ConformalCubedSphereGrid{Float64, Oceananigans.Grids.FullyConnected, Oceananigans.Grids.FullyConnected, Bounded} partitioned on CPU():
├── grids: 18×18×9 OrthogonalSphericalShellGrid{Float64, Oceananigans.Grids.FullyConnected, Oceananigans.Grids.FullyConnected, Bounded} on CPU with 6×6×6 halo and with precomputed metrics
├── partitioning: CubedSpherePartition with (1 region in each panel)
├── connectivity: CubedSphereConnectivity
└── devices: (CPU(), CPU(), CPU(), CPU(), CPU(), CPU())

julia> model = HydrostaticFreeSurfaceModel(; grid,
                                           momentum_advection = WENOVectorInvariant(order=5),
                                           tracer_advection = WENO(order=5),
                                           free_surface = SplitExplicitFreeSurface(grid; substeps=12),
                                           coriolis = HydrostaticSphericalCoriolis(eltype(grid)),
                                           tracers = :b,
                                           buoyancy = BuoyancyTracer())
HydrostaticFreeSurfaceModel{CPU, MultiRegionGrid}(time = 0 seconds, iteration = 0)
├── grid: 18×18×9 ConformalCubedSphereGrid{Float64, Oceananigans.Grids.FullyConnected, Oceananigans.Grids.FullyConnected, Bounded} on CPU with 6×6×6 halo
├── timestepper: QuasiAdamsBashforth2TimeStepper
├── tracers: b
├── closure: Nothing
├── buoyancy: BuoyancyTracer with ĝ = NegativeZDirection()
├── free surface: SplitExplicitFreeSurface with gravitational acceleration 9.80665 m s⁻²
│   └── substepping: FixedSubstepNumber(8)
├── advection scheme:
│   ├── momentum: MultiRegionObject{NTuple{6, WENOVectorInvariant{3, 3, Float64, Oceananigans.Advection.OnlySelfUpwinding{Centered{2, Float64, Centered{1, Float64, Nothing}}, Oceananigans.Advection.FunctionStencil{typeof(Oceananigans.Advection.divergence_smoothness)}, Oceananigans.Advection.FunctionStencil{typeof(Oceananigans.Advection.divergence_smoothness)}, Oceananigans.Advection.FunctionStencil{typeof(Oceananigans.Advection.u_smoothness)}, Oceananigans.Advection.FunctionStencil{typeof(Oceananigans.Advection.v_smoothness)}}, WENO{3, Float64, Float32, Nothing, WENO{2, Float64, Float32, Nothing, UpwindBiased{1, Float64, Nothing, Centered{1, Float64, Nothing}}, Centered{1, Float64, Nothing}}, Centered{2, Float64, Centered{1, Float64, Nothing}}}, Oceananigans.Advection.VelocityStencil, WENO{3, Float64, Float32, Nothing, WENO{2, Float64, Float32, Nothing, UpwindBiased{1, Float64, Nothing, Centered{1, Float64, Nothing}}, Centered{1, Float64, Nothing}}, Centered{2, Float64, Centered{1, Float64, Nothing}}}, WENO{3, Float64, Float32, Nothing, WENO{2, Float64, Float32, Nothing, UpwindBiased{1, Float64, Nothing, Centered{1, Float64, Nothing}}, Centered{1, Float64, Nothing}}, Centered{2, Float64, Centered{1, Float64, Nothing}}}, WENO{3, Float64, Float32, Nothing, WENO{2, Float64, Float32, Nothing, UpwindBiased{1, Float64, Nothing, Centered{1, Float64, Nothing}}, Centered{1, Float64, Nothing}}, Centered{2, Float64, Centered{1, Float64, Nothing}}}, Oceananigans.Advection.OnlySelfUpwinding{Centered{2, Float64, Centered{1, Float64, Nothing}}, Oceananigans.Advection.FunctionStencil{typeof(Oceananigans.Advection.divergence_smoothness)}, Oceananigans.Advection.FunctionStencil{typeof(Oceananigans.Advection.divergence_smoothness)}, Oceananigans.Advection.FunctionStencil{typeof(Oceananigans.Advection.u_smoothness)}, Oceananigans.Advection.FunctionStencil{typeof(Oceananigans.Advection.v_smoothness)}}}}, NTuple{6, CPU}, KernelAbstractions.CPU}
│   └── b: WENO{3, Float64, Float32}(order=5)
└── coriolis: HydrostaticSphericalCoriolis{Oceananigans.Advection.EnstrophyConserving{Float64}, Float64}

julia> simulation = Simulation(model, Δt=60, stop_time=600)
Simulation of HydrostaticFreeSurfaceModel{CPU, MultiRegionGrid}(time = 0 seconds, iteration = 0)
├── Next time step: 1 minute
├── Elapsed wall time: 0 seconds
├── Wall time per iteration: NaN days
├── Stop time: 10 minutes
├── Stop iteration: Inf
├── Wall time limit: Inf
├── Minimum relative step: 0.0
├── Callbacks: OrderedDict with 4 entries:
│   ├── stop_time_exceeded => 4
│   ├── stop_iteration_exceeded => -
│   ├── wall_time_limit_exceeded => e
│   └── nan_checker => }
├── Output writers: OrderedDict with no entries
└── Diagnostics: OrderedDict with no entries

julia> simulation.output_writers[:fields] = JLD2Writer(model, fields(model);
                                                       schedule = IterationInterval(2),
                                                       filename = "cubed_sphere_output",
                                                       verbose = false,
                                                       overwrite_existing = true)
JLD2Writer scheduled on IterationInterval(2):
├── filepath: cubed_sphere_output.jld2
├── 7 outputs: (u, v, w, b, η, U, V)
├── array type: Array{Float32}
├── including: [:grid, :coriolis, :buoyancy, :closure]
├── file_splitting: NoFileSplitting
└── file size: 1.9 MiB

julia> run!(simulation)
[ Info: Initializing simulation...
ERROR: MethodError: no method matching MultiRegionObject(::NTuple{6, Array{Float32, 3}})

Closest candidates are:
  MultiRegionObject(::KernelAbstractions.Backend, ::Tuple, ::Tuple)
   @ Oceananigans ~/Library/CloudStorage/OneDrive-TheUniversityofMelbourne/Documents/Research/Oceananigans.jl/src/Utils/multi_region_transformation.jl:25
  MultiRegionObject(::KernelAbstractions.Backend, Any...; devices)
   @ Oceananigans ~/Library/CloudStorage/OneDrive-TheUniversityofMelbourne/Documents/Research/Oceananigans.jl/src/Utils/multi_region_transformation.jl:18
  MultiRegionObject(::Oceananigans.Architectures.AbstractArchitecture, ::Tuple, ::Tuple)
   @ Oceananigans ~/Library/CloudStorage/OneDrive-TheUniversityofMelbourne/Documents/Research/Oceananigans.jl/src/Utils/multi_region_transformation.jl:34
  ...

Stacktrace:
  [1] convert_output(mo::MultiRegionObject{…}, writer::JLD2Writer{…})
    @ Oceananigans.MultiRegion ~/Library/CloudStorage/OneDrive-TheUniversityofMelbourne/Documents/Research/Oceananigans.jl/src/MultiRegion/multi_region_output_writers.jl:54
  [2] fetch_and_convert_output(output::Field{…}, model::HydrostaticFreeSurfaceModel{…}, writer::JLD2Writer{…})
    @ Oceananigans.OutputWriters ~/Library/CloudStorage/OneDrive-TheUniversityofMelbourne/Documents/Research/Oceananigans.jl/src/OutputWriters/fetch_output.jl:40
  [3] (::Oceananigans.OutputWriters.var"#36#37"{JLD2Writer{}, HydrostaticFreeSurfaceModel{}})(::Tuple{Symbol, Field{…}})
    @ Oceananigans.OutputWriters ./none:0
  [4] iterate
    @ ./generator.jl:47 [inlined]
  [5] merge(a::@NamedTuple{}, itr::Base.Generator{Base.Iterators.Zip{Tuple{…}}, Oceananigans.OutputWriters.var"#36#37"{JLD2Writer{…}, HydrostaticFreeSurfaceModel{…}}})
    @ Base ./namedtuple.jl:360
  [6] NamedTuple
    @ ./namedtuple.jl:151 [inlined]
  [7] macro expansion
    @ ./timing.jl:395 [inlined]
  [8] write_output!(writer::JLD2Writer{…}, model::HydrostaticFreeSurfaceModel{…})
    @ Oceananigans.OutputWriters ~/Library/CloudStorage/OneDrive-TheUniversityofMelbourne/Documents/Research/Oceananigans.jl/src/OutputWriters/jld2_writer.jl:253
  [9] write_output!(writer::JLD2Writer{…}, sim::Simulation{…})
    @ Oceananigans.Simulations ~/Library/CloudStorage/OneDrive-TheUniversityofMelbourne/Documents/Research/Oceananigans.jl/src/Simulations/simulation.jl:252
 [10] initialize!(sim::Simulation{…})
    @ Oceananigans.Simulations ~/Library/CloudStorage/OneDrive-TheUniversityofMelbourne/Documents/Research/Oceananigans.jl/src/Simulations/run.jl:243
 [11] time_step!(sim::Simulation{…})
    @ Oceananigans.Simulations ~/Library/CloudStorage/OneDrive-TheUniversityofMelbourne/Documents/Research/Oceananigans.jl/src/Simulations/run.jl:136
 [12] run!(sim::Simulation{…}; pickup::Bool)
    @ Oceananigans.Simulations ~/Library/CloudStorage/OneDrive-TheUniversityofMelbourne/Documents/Research/Oceananigans.jl/src/Simulations/run.jl:105
 [13] run!(sim::Simulation{…})
    @ Oceananigans.Simulations ~/Library/CloudStorage/OneDrive-TheUniversityofMelbourne/Documents/Research/Oceananigans.jl/src/Simulations/run.jl:92
 [14] top-level scope
    @ REPL[6]:1
Some type information was truncated. Use `show(err)` to see complete types.
``

@navidcy navidcy self-requested a review July 6, 2025 23:41
Comment on lines +5 to +6
# This should be deprectated. Calls GPU() which is only
# defined when CUDA is loaded and maps to CUDAGPU()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think at the moment only the NCDatasetExt uses this method?
do we really need it? cc @tomchor

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it depends how you defined need lol. This just spits out versioninfo() with the GPU info at bottom. It's definitely good to have that information in NetCDF files after the runs, although it's not essential. That said, it's a small piece of code so I don't know if we'd gain much by removing it.

It's also used in benchmark/src/Benchmarks.jl btw.

@navidcy
Copy link
Member

navidcy commented Jul 6, 2025

I added the backend as first arg in the MultiRegionObject constructor; see

https://github.com/michel2323/Oceananigans.jl/blob/05b1d9927275d7ec74fce2f7a7afab2769bc21d5/src/MultiRegion/multi_region_output_writers.jl#L56

Is this the right thing to do?

Now there is a FieldTimeSeries-related error is further down in the test_multi_region_cubed_sphere.jl...

@navidcy
Copy link
Member

navidcy commented Jul 7, 2025

Providing CPU() to MultiRegionObject constructor seems to do the job...

convert_output(mo::MultiRegionObject, writer) =
    MultiRegionObject(CPU(), Tuple(convert(writer.array_type, obj) for obj in mo.regional_objects))

But is this the right thing to do? Not sure.

@navidcy
Copy link
Member

navidcy commented Jul 7, 2025

All tests on tartarus pass!
I'd like to see the distributed CI pass as well tho...

@navidcy
Copy link
Member

navidcy commented Jul 7, 2025

Distributed CI passes 🎉

@michel2323
Copy link
Collaborator Author

Providing CPU() to MultiRegionObject constructor seems to do the job...

convert_output(mo::MultiRegionObject, writer) =
    MultiRegionObject(CPU(), Tuple(convert(writer.array_type, obj) for obj in mo.regional_objects))

But is this the right thing to do? Not sure.

I think the more general form is

 convert_output(mo::MultiRegionObject, writer) =
    MultiRegionObject(
        architecture_from_type(writer.array_type),
        Tuple(convert(writer.array_type, obj) for obj in mo.regional_objects)
    )

I had to add:

architecture_from_type(type::Type{<:AbstractArray}) = architecture(type())

@glwagner Opinions?

@michel2323
Copy link
Collaborator Author

@navidcy Thank you so much for the review and fixes!

@navidcy
Copy link
Member

navidcy commented Jul 7, 2025

hi @michel2323,
I doubt that the architecture_from_type method did the job -- the tests still fail.

That's because Array type cannot be instantiated like Array() I believe...

What do we want here? We want it to return the architecture that corresponds to the outer type of the writer.array_type, correct? E.g., if this is Array{Float32} we want CPU() and if it's CuArray{Float64} we want GPU(), etc?

@navidcy
Copy link
Member

navidcy commented Jul 7, 2025

hi @michel2323, I doubt that the architecture_from_type method did the job -- the tests still fail.

That's because Array type cannot be instantiated like Array() I believe...

What do we want here? We want it to return the architecture that corresponds to the outer type of the writer.array_type, correct? E.g., if this is Array{Float32} we want CPU() and if it's CuArray{Float64} we want GPU(), etc?

ff957c8 attempts to resolve this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
extensions 🧬 GPU 👾 Where Oceananigans gets its powers from
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Move CUDA into an extension
6 participants