Skip to content

Bug: ai-load-balancer plugin hangs on requests without body (GET/HEAD/DELETE), causing 504 timeout #3740

@Thomas-Eliot

Description

@Thomas-Eliot

Environment

  • Higress Gateway version: 2.1.11 (also reproducible on latest code in main branch)
  • Plugin: ai-load-balancer
  • Affected lb_policy: global_least_request, endpoint_metrics, prefix_cache
  • Not affected: cluster_metrics

Description

When ai-load-balancer plugin is enabled on a route, any request without a body (e.g. GET, HEAD, DELETE) will hang indefinitely until idle timeout, resulting in a 504 or downstream disconnect (DC).

The root cause is that 3 out of 4 lb_policy implementations return types.HeaderStopIteration in HandleHttpRequestHeaders, deferring all load-balancing logic (upstream host selection + proxywasm.ResumeHttpRequest()) to HandleHttpRequestBody. However, for requests with no body, the OnHttpRequestBody callback is never invoked by Envoy, so the request is paused forever.

Affected Code

  • global_least_request/lb_policy.goHandleHttpRequestHeaders returns HeaderStopIteration
  • endpoint_metrics/lb_policy.goHandleHttpRequestHeaders returns HeaderStopIteration
  • prefix_cache/lb_policy.goHandleHttpRequestHeaders returns HeaderStopIteration

All three rely on HandleHttpRequestBody to call proxywasm.ResumeHttpRequest(), which never happens for bodiless requests.

cluster_metrics/lb_policy.go is not affected because it completes host selection in HandleHttpRequestHeaders and returns ActionContinue.

Steps to Reproduce

  1. Configure a route with ai-load-balancer plugin enabled (any of the 3 affected policies)
  2. Send a GET request through Higress to that route:
    curl -i "http://<higress-gateway>/some-path?param=value" -H "host: example.com"
  3. The request hangs until timeout

Evidence from Envoy Debug Logs

# Request arrives with end_stream=true in headers (no body)
request headers complete (end_stream=true):
':path', '/sink?messageID=8c13992ee64e4dcaaf8f496c1387a9b5'
':method', 'GET'

# ~10 seconds later, stream reset due to client disconnect
stream reset: reset reason: connection termination

# ai-load-balancer reports error because HandleHttpRequestBody was never called
wasm log higress-system.ai-load-balancer-1.0.0: [ai-load-balancer] get host_selected failed

Access log confirms:

{
  "method": "GET",
  "path": "/sink?messageID=8c13992ee64e4dcaaf8f496c1387a9b5",
  "response_code": "0",
  "response_flags": "DC",
  "upstream_host": "-",
  "response_code_details": "downstream_remote_disconnect",
  "duration": "10320"
}

Suggested Fix

In HandleHttpRequestHeaders, detect whether the request has a body. If not, perform the load-balancing logic directly in the headers phase and return ActionContinue instead of HeaderStopIteration.

For example, check the :method header or use the endOfStream flag (if available via the wrapper framework):

func (lb GlobalLeastRequestLoadBalancer) HandleHttpRequestHeaders(ctx wrapper.HttpContext) types.Action {
    method, _ := proxywasm.GetHttpRequestHeader(":method")
    if method == "GET" || method == "HEAD" || method == "DELETE" || method == "OPTIONS" {
        // No body expected, perform LB logic here and continue
        // ... (same logic as HandleHttpRequestBody)
        return types.ActionContinue
    }
    return types.HeaderStopIteration
}

Workaround

Disable the ai-load-balancer plugin for routes that receive bodiless requests (GET/HEAD/DELETE).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions