Skip to content

test(autoware_tensorrt_plugins): add reference kernel tests#12561

Open
mojomex wants to merge 7 commits intoautowarefoundation:mainfrom
mojomex:test/tensorrt-reference-kernels
Open

test(autoware_tensorrt_plugins): add reference kernel tests#12561
mojomex wants to merge 7 commits intoautowarefoundation:mainfrom
mojomex:test/tensorrt-reference-kernels

Conversation

@mojomex
Copy link
Copy Markdown
Contributor

@mojomex mojomex commented May 8, 2026

Summary

Adds CUDA kernel vs. known-good CPU reference tests for autoware_tensorrt_plugins:

  • verifies argsort output against a CPU stable-sort reference
  • verifies unique values, inverse indices, and counts against a CPU reference

Compatibility note

While validating the test, the unique counts check exposed an existing sentinel write bug, so this PR includes the minimal fix needed for the new test to pass: write the final range sentinel to range_ptr + num_out, not range_ptr + num_out * sizeof(int64_t) (range_ptr is an int64_t*).

See

for a standalone version of that fix.

Validation

colcon build --packages-up-to autoware_tensorrt_plugins && \
colcon test-result --delete-yes &&
colcon test --packages-select autoware_tensorrt_plugins --event-handlers console_cohesion+ && \
colcon test-result --verbose

Look out for:

[==========] Running 2 tests from 1 test suite.
1: [----------] Global test environment set-up.
1: [----------] 2 tests from ReferenceKernelsTest
1: [ RUN      ] ReferenceKernelsTest.ArgsortMatchesCpuReference
1: [       OK ] ReferenceKernelsTest.ArgsortMatchesCpuReference (277 ms)
1: [ RUN      ] ReferenceKernelsTest.UniqueMatchesCpuReference
1: [       OK ] ReferenceKernelsTest.UniqueMatchesCpuReference (1 ms)
1: [----------] 2 tests from ReferenceKernelsTest (278 ms total)
1: 
1: [----------] Global test environment tear-down
1: [==========] 2 tests from 1 test suite ran. (278 ms total)
1: [  PASSED  ] 2 tests.

@github-actions github-actions Bot added the component:perception Advanced sensor data processing and environment understanding. (auto-assigned) label May 8, 2026
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 8, 2026

Thank you for contributing to the Autoware project!

🚧 If your pull request is in progress, switch it to draft mode.

Please ensure:

@mojomex mojomex self-assigned this May 8, 2026
Co-authored-by: Copilot <copilot@github.com>
Signed-off-by: Max SCHMELLER <max.schmeller@tier4.jp>
@mojomex mojomex added the run:build-and-test-differential Mark to enable build-and-test-differential workflow. (used-by-ci) label May 8, 2026
mojomex and others added 2 commits May 8, 2026 15:21
Co-authored-by: Copilot <copilot@github.com>
Signed-off-by: Max SCHMELLER <max.schmeller@tier4.jp>
@mojomex mojomex force-pushed the test/tensorrt-reference-kernels branch from 30a2371 to fd53532 Compare May 8, 2026 06:26
Signed-off-by: Max SCHMELLER <max.schmeller@tier4.jp>
@mojomex mojomex marked this pull request as ready for review May 8, 2026 06:29
@mojomex mojomex requested a review from manato May 8, 2026 07:05
Co-authored-by: Copilot <copilot@github.com>
Signed-off-by: Max SCHMELLER <max.schmeller@tier4.jp>
@codecov
Copy link
Copy Markdown

codecov Bot commented May 8, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 0.22%. Comparing base (54af299) to head (5002ac5).

❗ There is a different number of reports uploaded between BASE (54af299) and HEAD (5002ac5). Click for more details.

HEAD has 1 upload less than BASE
Flag BASE (54af299) HEAD (5002ac5)
daily 1 0
Additional details and impacted files
@@             Coverage Diff             @@
##             main   #12561       +/-   ##
===========================================
- Coverage   18.64%    0.22%   -18.42%     
===========================================
  Files        1918       97     -1821     
  Lines      131362     3500   -127862     
  Branches    44502       25    -44477     
===========================================
- Hits        24489        8    -24481     
+ Misses      86760     3491    -83269     
+ Partials    20113        1    -20112     
Flag Coverage Δ
daily ?
full-suite 0.22% <ø> (-18.42%) ⬇️

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Copy Markdown
Contributor

@manato manato left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mojomex
Thank you very much for improving the unit tests. In terms of test strictness, I left small suggestions. I would appreciate it if you consider them!

Comment on lines +70 to +76
copy_to_device(input_d.get(), input);

ASSERT_EQ(
argsort(
input_d.get(), output_d.get(), workspace_d.get(), input.size(),
get_argsort_workspace_size(input.size()), stream.get()),
cudaSuccess);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
copy_to_device(input_d.get(), input);
ASSERT_EQ(
argsort(
input_d.get(), output_d.get(), workspace_d.get(), input.size(),
get_argsort_workspace_size(input.size()), stream.get()),
cudaSuccess);
cudaEvent_t copy_event;
cudaEventCreate(&copy_event);
copy_to_device(input_d.get(), input);
cudaEventRecord(copy_event, 0); // record event on the default stream
cudaStreamWaitEvent(stream.get(), copy_event, cudaEventWaitDefault);
ASSERT_EQ(
argsort(
input_d.get(), output_d.get(), workspace_d.get(), input.size(),
get_argsort_workspace_size(input.size()), stream.get()),
cudaSuccess);

According to the memcpy synchronous behavior:

For transfers from pageable host memory to device memory, a stream sync is performed before the copy is initiated. The function will return once the pageable buffer has been copied to the staging memory for DMA transfer to device memory, but the DMA to final destination may not have completed.

Since input is pageable host memory, it would be better to insert explicit synchronization before operating on input_d.

Alternatively, we can skip this kind of explicit synchronization if copy_to_device takes a CUDA stream as an argument and performs cudaMemcpyAsync inside.

DeviceBuffer<std::uint8_t> workspace_d(get_unique_workspace_size(input.size()));

copy_to_device(input_d.get(), input);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as the the case of argsort-_kernel_test.cpp. better to insert explicit sync

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

component:perception Advanced sensor data processing and environment understanding. (auto-assigned) run:build-and-test-differential Mark to enable build-and-test-differential workflow. (used-by-ci)

Projects

Status: To Triage

Development

Successfully merging this pull request may close these issues.

2 participants