Skip to content

fix(rpc): Improve input validation and error handling #13069

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

thevilledev
Copy link

Fixes #13067

The rpc-server was vulnerable to Denial of Service attacks via several RPC commands (SET_TENSOR, GRAPH_COMPUTE, etc.). Malformed messages could trigger failed assertions (e.g., invalid ggml_type) or out-of-bounds reads/writes leading to GGML_ABORT calls, crashing the server process.

This PR introduces robust input validation and replaces abort() calls with graceful error handling:

  • Type Validation: deserialize_tensor now checks if the tensor->type is within the valid GGML_TYPE_COUNT range before calling ggml_new_tensor_4d. Returns nullptr on invalid type.
  • Bounds Checks: Replaced GGML_ABORT in set_tensor, set_tensor_hash, and get_tensor handlers with error logging and returning false when data/offset parameters are out of buffer bounds.
  • Size Checks: Added safe arithmetic checks (for overflow) in graph_compute when calculating required message sizes based on client-provided n_nodes and n_tensors. Returns early if the reported sizes conflict with the actual message size or would lead to overflow.
  • Error Propagation:
    • create_node now checks for nullptr return values from deserialize_tensor and its recursive calls, propagating nullptr upwards on failure. Uses find instead of at for safer map access.
    • copy_tensor now checks for nullptr from deserialize_tensor and sets the response status to failure if deserialization or bounds checks fail.
    • graph_compute now checks for nullptr return from create_node and returns failure status correctly. The final return value now reflects the actual computation status.

@github-actions github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Apr 22, 2025
@lexasub
Copy link
Contributor

lexasub commented Apr 22, 2025

on my opinion it may affects to perfomance, may be use feature flag (via cmake)?

auto it_ptr = tensor_ptrs.find(id);
if (it_ptr == tensor_ptrs.end()) {
GGML_LOG_ERROR("[%s] tensor id %" PRIu64 " not found in provided tensors\n", __func__, id);
return nullptr;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

using nullptr as return value for both valid and invalid input is wrong; we should either prevent calling create_node with id=0 or find another way to signal invalid input

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair point! I have an alternative in mind (which I'll push in a moment):

  • graph_compute checks if the input 'id' was non-zero when create_node returns nullptr. Hopefully identifies failures versus intentional null links.
  • create_node avoids recursive calls for zero IDs and propagates nullptr unambiguously on failure during recursion.

size_t required_size_before_tensors = min_header_size;

// Check for overflow before adding nodes_array_bytes
if (SIZE_MAX - required_size_before_tensors < nodes_array_bytes) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't quite understand the new checks which are being added; could you please provide an example where the existing checks fail to detect invalid input?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So from the old implementation:

  • Read n_nodes
  • Calculate offset O1 = sizeof(uint32_t) + n_nodes * sizeof(uint64_t)
  • Read n_tensors from input.data() + O1
  • Calculate offset O2 = O1 + sizeof(uint32_t) + n_tensors * sizeof(rpc_tensor)
  • Read tensors array from input.data() + O1 + sizeof(uint32_t)

A large n_nodes value combined with a small input.size() would read beyond the actual input buffer.

I'll try to summarise the new approach:

  • Reads n_nodes
  • Calculates nodes_array_bytes = n_nodes * sizeof(uint64_t)
  • Calculates required_size_before_tensors = min_header_size + nodes_array_bytes + n_tensors_field_size
    • With overflow checks for the additions, although overflow is very unlikely
  • Crucially: Checks if (input.size() < required_size_before_tensors) before attempting to read n_tensors
    • This prevents the out-of-bounds read when trying to access n_tensors if n_nodes was large.
  • Reads n_tensors
  • Calculates tensors_array_bytes = n_tensors * sizeof(rpc_tensor)
  • Calculates required_total_size = required_size_before_tensors + tensors_array_bytes
    • With overflow check
  • Checks if (input.size() < required_total_size) before attempting to access the tensors data.

This is quite extensive and might be overly cautious, but happy to hear alternatives!

The `rpc-server` was vulnerable to Denial of Service attacks via
several RPC commands (`SET_TENSOR`, `GRAPH_COMPUTE`, etc.). Malformed
messages could trigger failed assertions (e.g., invalid `ggml_type`)
or out-of-bounds reads/writes leading to `GGML_ABORT` calls,
crashing the server process.

This PR introduces robust input validation and replaces `abort()`
calls with graceful error handling:

- **Type Validation:** `deserialize_tensor` now checks if the
  `tensor->type` is within the valid `GGML_TYPE_COUNT` range
  *before* calling `ggml_new_tensor_4d`. Returns `nullptr` on
  invalid type.
- **Bounds Checks:** Replaced `GGML_ABORT` in `set_tensor`,
  `set_tensor_hash`, and `get_tensor` handlers with error
  logging and returning `false` when data/offset parameters
  are out of buffer bounds.
- **Size Checks:** Added safe arithmetic checks (for overflow) in
  `graph_compute` when calculating required message sizes based
  on client-provided `n_nodes` and `n_tensors`. Returns early
  if the reported sizes conflict with the actual message size or
  would lead to overflow.
- **Error Propagation:**
    - `create_node` now checks for `nullptr` return values from
      `deserialize_tensor` and its recursive calls, propagating
      `nullptr` upwards on failure. Uses `find` instead of `at`
      for safer map access.
    - `copy_tensor` now checks for `nullptr` from `deserialize_tensor`
      and sets the response status to failure if deserialization
      or bounds checks fail.
    - `graph_compute` now checks for `nullptr` return from
      `create_node` and returns failure status correctly. The final
      return value now reflects the actual computation status.

These changes improve the RPC server's resilience
against malformed client requests, preventing crashes and ensuring
errors are handled more gracefully.

Signed-off-by: Ville Vesilehto <[email protected]>
removed comments and unnecessary returns

Signed-off-by: Ville Vesilehto <[email protected]>
rpc_server::create_node could previously return nullptr if the input ID
was 0 (valid) or if an internal error (deserialization, recursion
failure) occurred (invalid). This ambiguity made error handling
difficult for the caller (`graph_compute`).

This commit clarifies the meaning of nullptr:
- `graph_compute` now checks if the input 'id' was non-zero when
  `create_node` returns nullptr, correctly identifying failures
  versus intentional null links.
- `create_node` avoids recursive calls for zero IDs and propagates
  nullptr unambiguously on failure during recursion.

Signed-off-by: Ville Vesilehto <[email protected]>
@thevilledev thevilledev force-pushed the fix/tensor-ggml-type branch from bef194d to 604f0a0 Compare April 23, 2025 18:00
@thevilledev
Copy link
Author

thevilledev commented Apr 23, 2025

on my opinion it may affects to perfomance, may be use feature flag (via cmake)?

I believe it would be interesting to see what the performance impact of this change is. I'm new to the project so pointers welcome if there's a test suite available which would show that.

Slightly off-topic but related: I think there's plenty of opportunities for similar improvements in the RPC server. From invalid tensor operations to crashing via deep recursion in create_node which I would like to also fix. I'd like to work on those one change at a time though.

I think multiple critical fixes behind a feature flag would be counterintuitive. Rather build bench tooling (if needed) and iterate on the fixes so there's minimal performance hit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ggml changes relating to the ggml tensor library for machine learning
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Misc. bug: RPC server crash on SET_TENSOR with invalid ggml_type
3 participants