RFC for adding PTX and AMDGPU targets #1641

MixmasterFresh · 2016-06-05T22:12:24Z

PTX and AMDGPU targets can be added fairly easily.

japaric · 2016-06-06T00:20:29Z

@TheAustinSeven Thanks for starting a discussion on this topic.

I think the "Detailed Design" section is not detailed enough 😄. I have a few questions:

How is this new target meant to be used? By calling rustc --target ptx foo.rs to produce a .ptx file, or something else?

Motivation related: How would one use these targets to build GPU programs? Hypothetical workflow: Use a build.rs (or syntax extension) to go from Rust to PTX (rustc --target ptx), then include_str/include_bytes that PTX in src/main.rs and use CUDA libraries. Or do you have something else in mind?

The (partial) definition of the new ptx target should be included in this section. What's the value of target_pointer_width, target_endian, etc for this target?

Are there going to be multiple PTX targets to cover different gpu architectures and/or compute capabilities? (This is related to the previous question.)

How does one access stuff like syncthreads, blockIdx, threadIdx, etc? New intrinsics?

Is any part of Rust "fundamentally" not translatable to PTX?

Annoying implementation details: Can compiler-rt even be cross compiled to PTX?

I know nothing about AMDGPU so I can't comment about it.

I personally would like to see some form of PTX target land in the compiler just to let people play with it. Then, they can tell us what else they need to build useful programs, libraries, etc.

FWIW, AFAIK, the precedent is that new targets have not required an RFC to land in the compiler.

cc @rust-lang/tools

MixmasterFresh · 2016-06-06T01:13:18Z

You bring up some great points and I will update the RFC to address as many as I can. I think that the new target here does merit an RFC because it is a fundamentally different target than all of the targets that exist now.

Real quick just to give you an idea of what I will add:

This should work just as other targets do so just specifying the target to the compiler should be fine.
Your hypothetical workflow does sound reasonable, and is somewhat similar to what I had in mind. A single source binary really should be the end goal and that is a good way to go about it. At that point you can use CUDA for PTX or OpenCL for either.
I am not aware of any parts of Rust that are fundamentally incompatible with PTX, but that does go into the extensive testing. Because this is so different from other targets, I think that is important to test somewhat more extensively than other targets.
There are clearly features that would be nice to have beyond the basics when running on the GPU, but what I am suggesting is just adding the target and then evaluating what the next steps should be.
Compiler-rt does not list AMDGPU or PTX as supported platforms, but I don't think this is an issue. Mips is also not a supported architecture but it is an official target.
PTX and AMDGPU are supposed to be able to run on any generation that supports it, so there should only be one common target for each.

sanxiyn · 2016-06-06T13:51:20Z

How does one access stuff like syncthreads, blockIdx, threadIdx, etc? New intrinsics?

In terms of implementation, LLVM provides intrinsics named llvm.cuda.syncthreads and llvm.nvvm.read.ptx.sreg.*. http://llvm.org/docs/NVPTXUsage.html#target-intrinsics. How to expose them to Rust is a different matter though.

hanna-kruppe · 2016-06-07T09:08:18Z

Big 👍 for getting Rust onto GPU! This is something I've also wanted several times already. However, I have doubts whether we can just slap a new target on it and be done with it.

For starters, AFAIK the backends have very few users besides the respective device manufacturers (I only know of Google using PTX and Mesa using AMDGPU) and they all compile the same C-like language as the manufacturers do. So I would be surprised if we didn't run into stupid, unnecessary, but very real limitations in those backends as soon as we throw the LLVM IR we currently generate at them.

Furthermore, GPUs have historically been pretty restrictive targets (e.g., no indirect calls = no function pointers and no trait objects). While this has improved (PTX 2.1 supports indirect calls) I bet that many people who want to write some Rust for some GPU will run into a limitation of this kind. The RFC wisely presents itself as an incremental step on a long road, but I think we should think sooner rather than later about how to model the various subsets of functionality. Fortunately, this question is already being discussed: https://internals.rust-lang.org/t/pre-rfc-a-vision-for-platform-architecture-configuration-specific-apis/3502

hanna-kruppe · 2016-06-07T09:15:30Z

Also I just skimmed over the GCN 3 ISA and the SPIR-V spec and I can't find any evidence that indirect calls are supported (and I seem to recall the same was true of OpenCL 2.0 last time I checked). I guess it's possible that Nvidia has started supporting indirect calls but AMD hasn't. This would make the whole "account for device limitations" thing more urgent since you probably can't even compile libcore without generating indirect calls or at least vtables containing function pointers (though I suppose if you never use those code paths and LTO removes them it might work).

tomaka · 2016-06-07T15:38:24Z

What about Intel GPUs (two third of GPUs market share)? What about mobile GPUs, where the market is divided amongst 3/4 different vendors?
Wouldn't it be better to add a SPIR or SPIR-V backend instead for example?

Also, supporting GPUs means:

No stdlib. Not even things like Vec.
No panicks, as you can't unwind or abort a GPU program.
No trait objects.
No pointers to functions.
Extern blocks would need to be forbidden.
SIMD operations would need to be core.
Special constructs have to be added to the language to represent textures, buffers, and special variables like the workgroup size.
A new standard library must be added to provide the features missing from the CPU (like sampling from a texture).

At this point there's so much special casing that in my opinion the only sane conclusion is that the Rust language isn't made for running on a GPU.

A much saner alternative, however, would be to write a plugin that can parse and compile an individual Rust function (or multiple individual functions). Yet another thing that is blocked on plugins being stable.

MixmasterFresh · 2016-06-07T15:51:26Z

While this may not enable people to write software like Nvidia's VRworks or other graphics libraries, it is hard to argue that there is no use to adding these GPU targets. GPUs are currently used in many fields that don't involve graphics(Machine learning, Simulations, other HPC). I don't think this RFC should die on the idea that if it can't run on every GPU ever produced, then it isn't worth anything at all. Adding the AMDGPU and PTX targets allows people (like myself) who are involved in HPC to write Rust instead of C++ variants. As for the plugin, this would be the best first step in that direction.

I would love to add a SPIR-V backend, however, there is not currently a SPIR-V backend in LLVM, so that would be a Herculean task intended for a team of people intimately familiar with the SPIR-V language. As I said in my RFC, I attempted to adapt the LLVM-SPIRV Bi-directional Translator, but it is too tightly intertwined with the OpenCL C architecture that it was impossible to run Rust through.

I understand the concerns that you bring up, but I think we are thinking of different directions that people would take this in. I am thinking of HPC and you are thinking graphics.

tomaka · 2016-06-07T16:05:33Z

GPUs are currently used in many fields that don't involve graphics(Machine learning, Simulations, other HPC).

Just because I mentioned textures doesn't mean I'm thinking graphics.
Textures are the most optimal way to store two-dimensional or three-dimensional data, and the fact that the Rust language can't support one of the most important feature of the target you're compiling to is a thorn in the foot.

MixmasterFresh · 2016-06-07T16:17:09Z

Sorry for the misunderstanding. I think that while the PTX and AMDGPU targets may start off without much of the support that CUDA or OpenCL C currently have, over time these could be added with language patches and external libraries. I honestly think the place for most of this should lie outside the core language, but the compiler itself should support compiling to these targets.

brson · 2016-06-07T23:03:36Z

This looks fine to me. I don't think it necessarily even requires an RFC since we add experimental architecture support all the time - as long as it doesn't have a significant maintenance impact.

Since it's not obvious what form Rust-on-GPU will take there's going to be a lot of experimenting, and hopefully out of tree. The way I would expect this to proceed is for somebody to do the initial work in a fork to get a feel for what modifications are required, then upstream the basic target definitions and whatever compiler changes are necessary to get code gen working, then continue experimenting with the (presumably weird) library ecosystem out of tree.

hanna-kruppe · 2016-06-07T23:16:32Z

Thank you @tomaka for spelling out the (worst-case) implications of a GPU target. I agree that more restricted (older/smaller) GPUs are very interesting and useful targets. I don't think any of these issue need to be dealbreakers — after all, Rust is also quite viable for CPU-based embedded platforms with peculiar restrictions. Yes, you can't use all the fancy things that "normal Rust" has, but you still have a very nice language and can use any library that restricts itself to the capabilities of the platform.

Or, put differently: nobody in their right mind would try to compile a complete Rust application with I/O, dynamic memory management, etc. to PTX and run it on a Titan. GPU backends would be used to implement computations in Rust that would otherwise be written in any of the other GPU-targeting languages. There's still a whole host-side application submitting the kernel to the device. Whether the crate boundary is the right unit of compilation for the kernels is relatively unimportant at this stage. All the headaches we're discussing here remain when you compile individual functions, the only difference is whether the end user puts the GPU-side code into the same file or in a different directory.

To come back to this RFC specifically, I think first targeting feature-rich GPGPU devices is a good start, and that weaker devices should probably be added later as different targets. These targets are easier to hammer out since less design work (how to deal with all the restrictions) and implementation work (ready-made LLVM backends), and they allow people who only care about "compute capability X.Y" to use recursion and cudaMalloc and so on.

Finally, here's my take on the restrictions and how to deal with them:

No stdlib. Not even things like Vec.

Well, use core! Or a subset of it. Or even no library at all (#![no_core] exists for a reason).

No panicks, as you can't unwind or abort a GPU program.

This is already an issue for operating systems and some embedded programs written in Rust, how do they handle it?

Anyway, depending on what instruction set you target, there may be ways to kill the thread/warp/wavefront/workgroup (I know this is true in GCN 3 and all versions of PTX). If nothing like that exists, the target is probably so primitive that recursion and indirect calls aren't allowed, meaning you can just inline the program into a single function with static memory allocation and terminate via return or a jump to the end of the function.

It's not ideal but we'll make do.

No trait objects.

Yeah, not having such a fundamental language feature feels weird. Then again, people have long discussed the possibility of targets without floating point support, and disabling the primitive float types on those targets is a credible possibility. So I think that a target where you can't create trait objects is perfectly fine, and one can write a lot of very useful Rust code without any trait objects.

No pointers to functions.

Similar issue to trait objects, see above. (And note that as with trait objects, everything using generics is still statically dispatched.)

Extern blocks would need to be forbidden.

Well, duh. See above for disabling language features.

SIMD operations would need to be core.

Would they? The current model is that the compiler provides intrinsics and some attributes (i.e., those are available even without any crates, though you have to declare intrinsics to use them), and crates like simd can build abstractions on top of this. I am not familiar with simd's internals but I would be very surprised if it depended on anything except basic arithmetic, control flow, shuffling data in local variables, and of course invoking intrinsics. So it can most likely just be another GPU-compatible crate that you link to (statically w/ LTO, of course), or if there's one piece of functionality that causes trouble, it can disabled on those targets.

Special constructs have to be added to the language to represent textures, buffers, and special variables like the workgroup size.

This certainly requires design work, but isn't particularly hard. Some intrinsics and attributes and perhaps lang items, with a little wrapper library for the intrinsics, would be my guess.

A new standard library must be added to provide the features missing from the CPU (like sampling from a texture).

I am not quite sure what this entails. Don't many natural targets already provide this in some form, especially if you compiler to a reasonably high level language like GLSL or OpenCL C or even SPIR-V? (Man I should really go and properly learn SPIR-V instead of spot-checking the spec whenever I need to know something.)

Regardless, this is just a little support library (right?) if the underlying primitives are exposed with intrinsics etc. — perhaps it has to be written, but it's just code.

tomaka · 2016-06-08T05:12:21Z

GPU backends would be used to implement computations in Rust that would otherwise be written in any of the other GPU-targeting languages.

But why?
You're suggesting that disabling half of the language is fine, and adding tons of intrinsics specific to GPUs is fine. Why even use Rust then? For the safety? Everything is 'static in a GPU program, so there is absolutely no need for lifetimes.

From my experience, the biggest safety issue when writing a GPU program in general is the interface between the CPU and GPU. In order words, the CPU and GPU have to agree about how the data is aligned in memory. This RFC doesn't tackle that at all. On the other side, a plugin that is part of a library and that doesn't require a new compilation target, would.

MixmasterFresh · 2016-06-08T05:48:14Z

In order to build a plugin that would do this you would need a rust compiler capable of compiling to these targets anyways. You can't have a plugin that allows Rust on the GPU without a new compilation target.

I would just like to clarify that this RFC is intended to add the ability to compile to GPU targets, but the expected use case would not extend beyond compiling several individual functions to PTX or AMDGPU. At that point there could be a library that adds everything extra you might want in such a target, but due to the limitations that such a target imposes, I don't think it is very realistic to think that such a task can be accomplished effectively without adding the target to the language.

Why even use Rust then?

I think that there are several reasons to use Rust. The first of which is that it is much easier to write both sides of a GPU-CPU program in the same language. There are plenty of reasons to want a Rust GPU backend, and of course there will be limitations just as there are with OpenCL C and CUDA, but I don't think any of those limitations are deal-breakers.

hanna-kruppe · 2016-06-08T07:33:34Z

You're suggesting that disabling half of the language is fine, and adding tons of intrinsics specific to GPUs is fine. Why even use Rust then? For the safety? Everything is 'static in a GPU program, so there is absolutely no need for lifetimes.

I like Rust the language. Type inference, traits, macros, unboxed closures, and generics are all useful and require zero runtime support (and I must admit, I just plain find Rust more aesthetic, which can make the difference between me having enough motivation for a hobby project or not). I also like its standard library, but the core language plus selected bits and pieces from core is already a much better development experience than OpenCL C or GLSL or whatever other language I would have to write otherwise. And as in CUDA, code and data structure definitions could be shared between host and device.

On the other side, a plugin that is part of a library and that doesn't require a new compilation target, would.

If it takes in Rust code (in whatever chunk size, crate or function) and spits out code that runs on GPUs then it's by definition a compiler targeting GPU. Again, it doesn't really matter for the compiler where the Rust code comes from. Whether you collect the code to be compiled from a whole #![no_std]-crate or from annotated functions in a normal crate, the code needs to satisfy the same restrictions and needs to be translated to the same (e.g.) SPIR-V code.

tomaka · 2016-06-08T08:12:14Z

If it takes in Rust code (in whatever chunk size, crate or function) and spits out code that runs on GPUs then it's by definition a compiler targeting GPU.

The property of a plugin compared to a new compilation target, is that a plugin could operate on a domain-specific and well-defined language that looks like Rust but is not exactly Rust, where things like panics and virtual function calls trigger compilation errors and where the additional required capabilities (like textures) would exist.

Just like the people who designed CUDA and OpenCL chose not to use the C language, but a domain-specific well-defined language that looks like C but is not exactly C.

tomaka · 2016-06-08T08:26:25Z

In addition to this, an important point is that PTX, SPIR and SPIR-V are only intermediate representations.
Notably this means that they don't contain any optimization. Instead it's the CUDA/OpenCL/Vulkan driver that translates this intermediate representation into real code and performs the optimizations based on the architecture of the client.

The consequence is that it's possible (and most importantly, sane) to translate the MIR output of the compiler directly to PTX/SPIR/SPIR-V. The only thing a plugin would need to do is run the Rust parser and trans, check that it doesn't contain anything invalid, and then translate to PTX/SPIR/SPIR-V instead of translating to LLVM IR. This is exactly the kind of thing that a plugin could perform.

hanna-kruppe · 2016-06-08T11:54:20Z

Just like the people who designed CUDA and OpenCL chose not to use the C language, but a domain-specific well-defined language that looks like C but is not exactly C.

Good point. If it does turn out that significant language features would have to be added, one should rather design a dedicated language. This seems far from certain to me though. A laundry list of platform-specific intrinsics and attributes (perhaps with a convenience wrapper analogous to simd) seem enough to model most of the features, if not all. These sorts of things have precedent and so it wouldn't feel out of place (to me) to add some more for different platforms.

I think part of the reason why OpenCL C and CUDA are separate language is that Khronos/Nvidia don't have to power to add things to the C/C++ standards, so they create something that could either be called "a language inspired by C/C++" or "normal C/C++ with some parts ripped out and lots of compiler extensions". The latter perspective is particularly strong for CUDA which is basically a beefed-up clang with some compiler extensions for both host- and client-side code. But the C and C++ standards don't even support common restrictions and extensions needed by CPU programs (e.g., embedded programs, operating systems), such as

not having the stdlib, or only parts of it
placing a global variable at a specific address and linker section
not having stack unwinding and exception catching

... with the consequence that compilers add some or all these restrictions and extensions without standard support (occasionally a de facto standard emerges but not always). Rust on the other hand already supports these restrictions and extensions, so why can't it also directly support another use case that was previously always shoved in third party dialects?

This much for philosophical reasons why I think GPUs make fine compilation targets. I also have some technical points to nit-pick:

where things like panics [...] trigger compilation errors

This is already the case when you compile a #![no_std] program, such as embedded applications and operating systems. You need to define rust_begin_panic or whatever it's called if there are potential panics in the code. Apparently it's commonly defined as just loop {} ಠ_ಠ All this assuming aborting the thread/warp/workgroup/kernel [0] really isn't possible, which I contested above.

[0]: Killing the thread would be enough for soundness, but killing more would probably be fine and perhaps even helpful.

In addition to this, an important point is that PTX, SPIR and SPIR-V are only intermediate representations. [...]

I don't think this is accurate. Yes, the code is also optimized by the driver, but in every toolchain I'm aware of, the PTX/SPIR-V/etc. is the output of a middle-end that performs extensive optimizations like LLVM does (in fact, all toolchains I know literally use LLVM — that's why the PTX and AMDGPU backends exist). Cutting out these optimizations is likely to degrade performance because the optimizer in the driver is almost certainly written to take pretty good IR and turn it into slightly better machine code.

hanna-kruppe · 2016-06-08T11:59:08Z

I should add that I'm not really opposed to the CUDA-style model, it may very well be the best interface. I just believe it's best to think of the contents of a #[device] fn (straw syntax) as "Rust targeting a weird no_std target" rather than a whole different language dialect.

Neither do I think we need to go through LLVM. MIR optimizations are coming and targeting OpenCL 1.0 or GLSL for reaching older devices. For experimenting with PTX, however, LLVM is quicker to get started and may also be the best option in the long run (e.g., Google uses it).

japaric · 2016-06-10T04:10:54Z

See rust-lang/rust#34195 for a minimal implementation that generates PTX from Rust code. A few design questions arose while testing, see the PR for details.

MixmasterFresh · 2016-06-11T03:12:58Z

That looks good! I think that you nailed almost exactly what I was talking about on the PTX side of this RFC. I have been thinking quite a bit about this lately, and the more I think about it, the more I think that @brson was right. We should do a significant amount of experimentation and testing(especially testing) in a fork and upstream some of the changes somewhere down the line. @japaric if you are interested in working together to flesh this out, drop me an email [email protected]. I don't know if we should close this RFC, but I do think that the changes should be done in the way that @brson suggested.

hanna-kruppe · 2016-06-11T11:03:15Z

Please drop a link here once you have something to show for. I won't be much help with testing for lack of access to Nvidia chips, but at the very least I'd like to follow the design work (I imagine there's significant overlap with other GPU targets, to which I could and would like to contribute).

MixmasterFresh · 2016-06-13T02:13:43Z

@rkruppe feel free to join the discussion here

brson · 2016-06-15T08:17:12Z

I'm going to go ahead and close this. I think the immediate way forward is clear and doesn't require an approved RFC. Go ahead and start proving it out of tree and even upstream the basic definitions. When we get to a point where GPU support is going to impact the language definition and there's a clear design for how, then let's do another RFC.

RFC for adding PTX and AMDGPU targets

776d756

Added some additional details to the GPU RFC

de130e6

japaric mentioned this pull request Jun 10, 2016

[WIP][RFC] initial support for PTX generation rust-lang/rust#34195

Closed

brson added the T-dev-tools Relevant to the development tools team, which will review and decide on the RFC. label Jun 15, 2016

brson closed this Jun 15, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC for adding PTX and AMDGPU targets #1641

RFC for adding PTX and AMDGPU targets #1641

MixmasterFresh commented Jun 5, 2016

japaric commented Jun 6, 2016

MixmasterFresh commented Jun 6, 2016 •

edited

Loading

sanxiyn commented Jun 6, 2016

hanna-kruppe commented Jun 7, 2016 •

edited

Loading

hanna-kruppe commented Jun 7, 2016 •

edited

Loading

tomaka commented Jun 7, 2016 •

edited

Loading

MixmasterFresh commented Jun 7, 2016 •

edited

Loading

tomaka commented Jun 7, 2016

MixmasterFresh commented Jun 7, 2016

brson commented Jun 7, 2016

hanna-kruppe commented Jun 7, 2016 •

edited

Loading

tomaka commented Jun 8, 2016

MixmasterFresh commented Jun 8, 2016

hanna-kruppe commented Jun 8, 2016 •

edited

Loading

tomaka commented Jun 8, 2016 •

edited

Loading

tomaka commented Jun 8, 2016 •

edited

Loading

hanna-kruppe commented Jun 8, 2016

hanna-kruppe commented Jun 8, 2016 •

edited

Loading

japaric commented Jun 10, 2016

MixmasterFresh commented Jun 11, 2016 •

edited

Loading

hanna-kruppe commented Jun 11, 2016

MixmasterFresh commented Jun 13, 2016

brson commented Jun 15, 2016

RFC for adding PTX and AMDGPU targets #1641

RFC for adding PTX and AMDGPU targets #1641

Conversation

MixmasterFresh commented Jun 5, 2016

japaric commented Jun 6, 2016

MixmasterFresh commented Jun 6, 2016 • edited Loading

sanxiyn commented Jun 6, 2016

hanna-kruppe commented Jun 7, 2016 • edited Loading

hanna-kruppe commented Jun 7, 2016 • edited Loading

tomaka commented Jun 7, 2016 • edited Loading

MixmasterFresh commented Jun 7, 2016 • edited Loading

tomaka commented Jun 7, 2016

MixmasterFresh commented Jun 7, 2016

brson commented Jun 7, 2016

hanna-kruppe commented Jun 7, 2016 • edited Loading

tomaka commented Jun 8, 2016

MixmasterFresh commented Jun 8, 2016

hanna-kruppe commented Jun 8, 2016 • edited Loading

tomaka commented Jun 8, 2016 • edited Loading

tomaka commented Jun 8, 2016 • edited Loading

hanna-kruppe commented Jun 8, 2016

hanna-kruppe commented Jun 8, 2016 • edited Loading

japaric commented Jun 10, 2016

MixmasterFresh commented Jun 11, 2016 • edited Loading

hanna-kruppe commented Jun 11, 2016

MixmasterFresh commented Jun 13, 2016

brson commented Jun 15, 2016

MixmasterFresh commented Jun 6, 2016 •

edited

Loading

hanna-kruppe commented Jun 7, 2016 •

edited

Loading

hanna-kruppe commented Jun 7, 2016 •

edited

Loading

tomaka commented Jun 7, 2016 •

edited

Loading

MixmasterFresh commented Jun 7, 2016 •

edited

Loading

hanna-kruppe commented Jun 7, 2016 •

edited

Loading

hanna-kruppe commented Jun 8, 2016 •

edited

Loading

tomaka commented Jun 8, 2016 •

edited

Loading

tomaka commented Jun 8, 2016 •

edited

Loading

hanna-kruppe commented Jun 8, 2016 •

edited

Loading

MixmasterFresh commented Jun 11, 2016 •

edited

Loading