-
Notifications
You must be signed in to change notification settings - Fork 11.5k
fix(rpc): Improve input validation and error handling #13069
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
on my opinion it may affects to perfomance, may be use feature flag (via cmake)? |
auto it_ptr = tensor_ptrs.find(id); | ||
if (it_ptr == tensor_ptrs.end()) { | ||
GGML_LOG_ERROR("[%s] tensor id %" PRIu64 " not found in provided tensors\n", __func__, id); | ||
return nullptr; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
using nullptr
as return value for both valid and invalid input is wrong; we should either prevent calling create_node
with id=0
or find another way to signal invalid input
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair point! I have an alternative in mind (which I'll push in a moment):
graph_compute
checks if the input 'id' was non-zero whencreate_node
returns nullptr. Hopefully identifies failures versus intentional null links.create_node
avoids recursive calls for zero IDs and propagates nullptr unambiguously on failure during recursion.
size_t required_size_before_tensors = min_header_size; | ||
|
||
// Check for overflow before adding nodes_array_bytes | ||
if (SIZE_MAX - required_size_before_tensors < nodes_array_bytes) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't quite understand the new checks which are being added; could you please provide an example where the existing checks fail to detect invalid input?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So from the old implementation:
- Read
n_nodes
- Calculate offset
O1 = sizeof(uint32_t) + n_nodes * sizeof(uint64_t)
- Read
n_tensors
frominput.data() + O1
- Calculate offset
O2 = O1 + sizeof(uint32_t) + n_tensors * sizeof(rpc_tensor)
- Read tensors array from
input.data() + O1 + sizeof(uint32_t)
A large n_nodes
value combined with a small input.size()
would read beyond the actual input buffer.
I'll try to summarise the new approach:
- Reads
n_nodes
- Calculates
nodes_array_bytes = n_nodes * sizeof(uint64_t)
- Calculates
required_size_before_tensors = min_header_size + nodes_array_bytes + n_tensors_field_size
- With overflow checks for the additions, although overflow is very unlikely
- Crucially: Checks
if (input.size() < required_size_before_tensors)
before attempting to readn_tensors
- This prevents the out-of-bounds read when trying to access
n_tensors
ifn_nodes
was large.
- This prevents the out-of-bounds read when trying to access
- Reads
n_tensors
- Calculates
tensors_array_bytes = n_tensors * sizeof(rpc_tensor)
- Calculates
required_total_size = required_size_before_tensors + tensors_array_bytes
- With overflow check
- Checks
if (input.size() < required_total_size)
before attempting to access the tensors data.
This is quite extensive and might be overly cautious, but happy to hear alternatives!
The `rpc-server` was vulnerable to Denial of Service attacks via several RPC commands (`SET_TENSOR`, `GRAPH_COMPUTE`, etc.). Malformed messages could trigger failed assertions (e.g., invalid `ggml_type`) or out-of-bounds reads/writes leading to `GGML_ABORT` calls, crashing the server process. This PR introduces robust input validation and replaces `abort()` calls with graceful error handling: - **Type Validation:** `deserialize_tensor` now checks if the `tensor->type` is within the valid `GGML_TYPE_COUNT` range *before* calling `ggml_new_tensor_4d`. Returns `nullptr` on invalid type. - **Bounds Checks:** Replaced `GGML_ABORT` in `set_tensor`, `set_tensor_hash`, and `get_tensor` handlers with error logging and returning `false` when data/offset parameters are out of buffer bounds. - **Size Checks:** Added safe arithmetic checks (for overflow) in `graph_compute` when calculating required message sizes based on client-provided `n_nodes` and `n_tensors`. Returns early if the reported sizes conflict with the actual message size or would lead to overflow. - **Error Propagation:** - `create_node` now checks for `nullptr` return values from `deserialize_tensor` and its recursive calls, propagating `nullptr` upwards on failure. Uses `find` instead of `at` for safer map access. - `copy_tensor` now checks for `nullptr` from `deserialize_tensor` and sets the response status to failure if deserialization or bounds checks fail. - `graph_compute` now checks for `nullptr` return from `create_node` and returns failure status correctly. The final return value now reflects the actual computation status. These changes improve the RPC server's resilience against malformed client requests, preventing crashes and ensuring errors are handled more gracefully. Signed-off-by: Ville Vesilehto <[email protected]>
removed comments and unnecessary returns Signed-off-by: Ville Vesilehto <[email protected]>
rpc_server::create_node could previously return nullptr if the input ID was 0 (valid) or if an internal error (deserialization, recursion failure) occurred (invalid). This ambiguity made error handling difficult for the caller (`graph_compute`). This commit clarifies the meaning of nullptr: - `graph_compute` now checks if the input 'id' was non-zero when `create_node` returns nullptr, correctly identifying failures versus intentional null links. - `create_node` avoids recursive calls for zero IDs and propagates nullptr unambiguously on failure during recursion. Signed-off-by: Ville Vesilehto <[email protected]>
bef194d
to
604f0a0
Compare
I believe it would be interesting to see what the performance impact of this change is. I'm new to the project so pointers welcome if there's a test suite available which would show that. Slightly off-topic but related: I think there's plenty of opportunities for similar improvements in the RPC server. From invalid tensor operations to crashing via deep recursion in I think multiple critical fixes behind a feature flag would be counterintuitive. Rather build bench tooling (if needed) and iterate on the fixes so there's minimal performance hit. |
Fixes #13067
The
rpc-server
was vulnerable to Denial of Service attacks via several RPC commands (SET_TENSOR
,GRAPH_COMPUTE
, etc.). Malformed messages could trigger failed assertions (e.g., invalidggml_type
) or out-of-bounds reads/writes leading toGGML_ABORT
calls, crashing the server process.This PR introduces robust input validation and replaces
abort()
calls with graceful error handling:deserialize_tensor
now checks if thetensor->type
is within the validGGML_TYPE_COUNT
range before callingggml_new_tensor_4d
. Returnsnullptr
on invalid type.GGML_ABORT
inset_tensor
,set_tensor_hash
, andget_tensor
handlers with error logging and returningfalse
when data/offset parameters are out of buffer bounds.graph_compute
when calculating required message sizes based on client-providedn_nodes
andn_tensors
. Returns early if the reported sizes conflict with the actual message size or would lead to overflow.create_node
now checks fornullptr
return values fromdeserialize_tensor
and its recursive calls, propagatingnullptr
upwards on failure. Usesfind
instead ofat
for safer map access.copy_tensor
now checks fornullptr
fromdeserialize_tensor
and sets the response status to failure if deserialization or bounds checks fail.graph_compute
now checks fornullptr
return fromcreate_node
and returns failure status correctly. The final return value now reflects the actual computation status.