Skip to content

Fix NPU segfault with two concurrent inference pipelines#795

Open
oonyshch wants to merge 2 commits intomainfrom
fix/ITEP-88326-npu-concurrent-pipelines
Open

Fix NPU segfault with two concurrent inference pipelines#795
oonyshch wants to merge 2 commits intomainfrom
fix/ITEP-88326-npu-concurrent-pipelines

Conversation

@oonyshch
Copy link
Copy Markdown
Contributor

@oonyshch oonyshch commented Apr 21, 2026

Description

When two gvadetect pipelines target device=NPU simultaneously, the NPU VCL compiler crashes with ConvertVPUMI37XX2ELF failed: bad optional access followed by SIGSEGV.
The root cause is a race condition in ov::Core::compile_model()
A secondary issue is that gvafpscounter accesses its global fps_counters map without holding channels_mutex in several functions, which is a data race when multiple pipelines run in the same process.

Note: batch-size of 16 or 32 still fails due to an NPU VCL compiler limitation (ConvertVPUMI37XX2ELF pass), but this PR enables support for batch-sizes up to 8 for the yolo model

Reproduction

Single pipeline:

MODEL_PATH=/path/to/yolo11s/INT8/yolo11s.xml
VIDEO_PATH=/path/to/1192116-sd_640_360_30fps.mp4

gst-launch-1.0 \
  filesrc location=$VIDEO_PATH \
  ! qtdemux ! h264parse ! vaapidecodebin \
  ! vapostproc ! vapostproc \
  ! 'video/x-raw(memory:VAMemory)' \
  ! vapostproc \
  ! 'video/x-raw(memory:VAMemory)' \
  ! gvadetect model=$MODEL_PATH pre-process-backend=va device=NPU batch-size=8 nireq=6 \
  ! gvafpscounter \
  ! fakesink sync=false

Double pipeline:

MODEL_PATH=/path/to/yolo11s/INT8/yolo11s.xml
VIDEO_PATH=/path/to/1192116-sd_640_360_30fps.mp4

gst-launch-1.0 \
  filesrc location=$VIDEO_PATH \
  ! qtdemux ! h264parse ! vaapidecodebin \
  ! vapostproc ! vapostproc \
  ! 'video/x-raw(memory:VAMemory)' \
  ! vapostproc \
  ! 'video/x-raw(memory:VAMemory)' \
  ! gvadetect model=$MODEL_PATH pre-process-backend=va device=NPU batch-size=8 nireq=6 \
  ! gvafpscounter \
  ! fakesink sync=false \
  filesrc location=$VIDEO_PATH \
  ! qtdemux ! h264parse ! vaapidecodebin \
  ! vapostproc ! vapostproc \
  ! 'video/x-raw(memory:VAMemory)' \
  ! vapostproc \
  ! 'video/x-raw(memory:VAMemory)' \
  ! gvadetect model=$MODEL_PATH pre-process-backend=va device=NPU batch-size=8 nireq=6 \
  ! gvafpscounter \
  ! fakesink sync=false

Metrics:

batch-size Single pipeline Dual in-process
1 90.49 fps 88.95 fps
2 29.61 fps 29.21 fps
4 31.18 fps 31.05 fps
8 26.22 fps 25.77 fps

How Has This Been Tested?

Locally on ARL-H with Intel Core Ultra 9 285H, NPU driver v1.30, OpenVINO 2026.1.

Checklist:

  • I agree to use the MIT license for my code changes.
  • I have not introduced any 3rd party components incompatible with MIT.
  • I have not included any company confidential information, trade secret, password or security token.
  • I have performed a self-review of my code.

@oonyshch oonyshch marked this pull request as draft April 21, 2026 23:09
@oonyshch oonyshch force-pushed the fix/ITEP-88326-npu-concurrent-pipelines branch from 0d3d675 to 13d69e0 Compare April 21, 2026 23:29
@oonyshch oonyshch marked this pull request as ready for review April 21, 2026 23:32
@oonyshch oonyshch force-pushed the fix/ITEP-88326-npu-concurrent-pipelines branch 2 times, most recently from ebee3f4 to fb8d164 Compare April 22, 2026 09:29
Comment thread src/gst/elements/gvafpscounter/fpscounter_c.cpp
Copy link
Copy Markdown
Contributor

@walidbarakat walidbarakat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

over all i think we shouldn't handle the batch size as NPU plugin is take caring of it already

NPU plugin batching

Image

I understand that in case user don't use the same model instance id - which is something against DLS performance guide - the concurrent access to NPU plugin compiling can be the issue. but why we need the locks in FPS counter functionality ?

another question, how this bug was discovered? i tried to run these repro pipelines on my ARL machine and go no reproduction.

note : please let's keep the commit atomic and focusing the needed change for the bug resolving.

if (_app_context) {
try {
dlstreamer::D3D11ContextPtr d3d11_ctx = dlstreamer::D3D11Context::create(_app_context);
std::lock_guard<std::mutex> lock(compile_mutex());
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why we need to create a lock for the remote context for GPU?

if (_app_context) {
try {
dlstreamer::VAAPIContextPtr vaapi_ctx = dlstreamer::VAAPIContext::create(_app_context);
std::lock_guard<std::mutex> lock(compile_mutex());
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why we need to create a lock for the remote context for GPU?

Comment on lines +1082 to +1090
std::lock_guard<std::mutex> lock(compile_mutex());
if (_openvino_context) {
_compiled_model = core().compile_model(_model, _openvino_context->remote_context(), ov_params);
} else {
std::string formatted_device = _device;
if (_batch_timeout > -1) {
formatted_device = fmt::format("BATCH:{}({})", _device, _auto_batch_num_requests);
}
_compiled_model = core().compile_model(_model, formatted_device, ov_params);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here the lock is used for every device and mode, narrow it down to NPU only

_batch_size = core().get_property(config.device(), ov::optimal_batch_size);
} catch (...) {
_batch_size = 1; // Fallback if optimal batch size property is not supported
_batch_size = 1;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why delete the comment here?

Serialize ov::Core::compile_model via compile_mutex to prevent NPU
plugin init races.  Guard gvafpscounter global state with channels_mutex.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants