-
Notifications
You must be signed in to change notification settings - Fork 2
API redesign with async support #18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Experiments with Rust Futures Implemented derive for RustToCudaAsync Implemented async kernel launch Fixed RustToCudaAsync derive LaunchPackage with non-mut Stream Moved stream to be an explicit kernel argument Updated ExchangeWrapperOn[Device|Host]Async::move_to_stream Upgraded to fixed RustaCuda Added scratch-space methods for uni-directional CudaExchangeItem Added unsafe-aliasing API to SplitSlideOverCudaThreads[Const|Dynamic]Stride Extended the CudaExchangeItem API with scratch and uMaybeUninit Rename SplitSliceOverCudaThreads[Const|Dynamic]Strude::alias_[mut_]unchecked Implemented #[cuda(crate)] and #[kernel(crate)] attributes Added simple thread-block shared memory support Fixed device utils doc tests Convert cuda thread-block-shared memory address to generic First steps towards better shared memory, including dynamic Revert derive changes + R2C-based approach start Some progress on shared slices Backup of progress on compile-time PTX checking Clean up the PTX JIT implementation Add convenience functions for ThreadBlockShared arrays Improve and fix CI Remove broken ThreadBlockShared RustToCuda impl Refactor kernel trait generation to push more safety constraints to the kernel definition Fixed SomeCudaAlloc import Added error handling to the compile-time PTX checking Add PTX lint parsing, no actual support yet Added lint checking support to monomorphised kernel impls Improve kernel checking + added cubin dump lint Fix kernel macro config parsing Explicitly fitting Device[Const|Mut]Ref into device registers Switched one std:: to core:: Remove register-sized CUDA kernel args check, unnecessary since rust-lang/rust#94703 Simplified the kernel parameter layout extraction from PTX Fix up rebase issues Install CUDA in all CI steps Use CStr literals Simplify and document the safety traits Fix move_to_cuda bound Fix clippy for 1.76 Cleaned up the rust-cuda device macros with better print The implementation still uses String for dynamic formatting, which currently pulls in loads of formatting and panic machinery. While a custom String type that pre-allocated the exact format String length can avoid some of that, the formatting machinery even for e.g. usize is still large. If `format_args!` is ever optimised for better inlining, the more verbose and lower-level implementation could be reconsidered. Switch to using more vprintf in embedded CUDA kernel Make print example fully executable Clean up the print example ptr_from_ref is stable from 1.76 Exit on CUDA panic instead of abort to allow the host to handle the error Backup of early progress for switching from kernel traits to functions More work into kernel functions instead of traits Eliminate almost all ArgsTrait usages Some refactoring of the async kernel func type + wrap code Early sketch of extracting type wrapping from macro into types and traits Early work towards using trait for kernel type wrap, ptx jit workaround missing Lift complete CPU kernel wrapper from proc macro into public functions Add async launch helper Further cleanup of the new kernel param API Start cleaning up the public API Allow passing ThreadBlockShared to kernels again Remove unsound mutable lending to CUDA for now Allow passing ThreadBlockSharedSlice to kernel for dynamic shared memory Begin refactoring the public API with device feature Refactoring to prepare for better module structure Extract kernel module just for parameters Add RustToCuda impls for &T, &mut T, &[T], and &mut [T] where T: RustToCuda Large restructuring of the module layout for rust-cuda Split rust-cuda-kernel off from rust-cuda-derive Update codecov action to handle rust-cuda-kernel Fix clippy lint Far too much time spent getting rid of DeviceCopy More refactoring and auditing kernel param bounds First exploration towards a stricter async CUDA API More experiments with async API Further API experimentation Further async API experimentation Further async API design work Add RustToCudaAsync impls for &T and &[T], but not &mut T or &mut [T] Add back mostly unchanged exchange wrapper + buffer with RustToCudaAsync impls Add back mostly unchanged anti-aliasing types with RustToCudaAsync impls Progress on replacing ...Async with Async<...> Seal more implementation details Further small API improvements Add AsyncProj helper API struct for async projections Disable async derive in examples for now Implement RustToCudaAsync derive impls Further async API improvements to add drop behaviour First sketch of the safety constraints of a new NoSafeAliasing trait First steps towards reintroducing LendToCudaMut Fix no-std Box import for LendRustToCuda derive Re-add RustToCuda implementation for Final Remove redundant RustToCudaAsyncProxy More progress on less 'static bounds on kernel params Further investigation of less 'static bounds Remove 'static bounds from LendToCuda ref kernel params Make CudaExchangeBuffer Sync Make CudaExchangeBuffer Sync v2 Add AsyncProj proj_ref and proj_mut convenience methods Add RustToCudaWithPortableBitCloneSemantics adapter Fix invalid const fn bounds Add Deref[Mut] to the adapters Fix pointer type inference error Try removing __rust_cuda_ffi_safe_assert module Ensure async launch mutable borrow safety with barriers on use and stream move Fix uniqueness guarantee for Stream using branded types Try without ref proj Try add extract ref Fix doc link clean up kernel signature check Some cleanup before merging Fix some clippy lints, add FIXMEs for others Add docs for rust-cuda-derive Small refactoring + added docs for rust-cuda-kernel Bump MSRV to 1.77-nightly Try trait-based kernel signature check Try naming host kernel layout const Try match against byte literal for faster comparison Try with memcmp intrinsic Try out experimental const-type-layout with compression Try check Try check again
Codecov ReportAttention: Patch coverage is
❗ Your organization needs to install the Codecov GitHub app to enable full functionality. Additional details and impacted files@@ Coverage Diff @@
## main #18 +/- ##
==========================================
- Coverage 58.39% 0.00% -58.40%
==========================================
Files 48 33 -15
Lines 3653 3290 -363
==========================================
- Hits 2133 0 -2133
- Misses 1520 3290 +1770 ☔ View full report in Codecov by Sentry. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
No description provided.