-
Notifications
You must be signed in to change notification settings - Fork 16
Description
tl;dr; From what I can tell, mod_proxy_cluster tries to select a node based on the longest context match. However, very rarely, this does not work. Maybe a race condition due to missing locks?
Our Problem
We have a mod_proxy_cluster loadbalancer, where two nodes register, each with two contexts:
node1: /
node1: /onlyOnNode1
node2: /
node2: /onlyOnNode2
When I send a request /onlyOnNode1
to Apache, this request is sometimes mistakenly forwarded to node2
by mod_proxy_cluster.
The system will run fine for days or weeks, and only then will the problematic behavior occur.
Once the problematic behavior occurs, it doesn't seem to go away until restarting at least Apache or the Tomcat.
However, because the problematic state is very rare, I haven't been able to investigate it much.
Looking at the code
From what I've been able from the code, this part here searches all longest matches for a URL:
mod_proxy_cluster/native/common/common.c
Lines 426 to 435 in 3cd7ad6
if (strncmp(uri, context->context, len) == 0) { | |
if (uri[len] == '\0' || uri[len] == '/' || len == 1) { | |
status[j] = context->status; | |
length[j] = len; | |
if (len > max) { | |
max = len; | |
} | |
} | |
} | |
} |
So in my case assuming the context_table
given above, max
should be set to strlen(/onlyOnNode1) == 12
, and the length
array should look like this: 1, 12, 1, 0
.
Then, we do this:
mod_proxy_cluster/native/common/common.c
Lines 440 to 480 in 3cd7ad6
/* find the best matching contexts */ | |
nbest = 1; | |
for (j = 0; j < sizecontext; j++) { | |
if (length[j] == max) { | |
nbest++; | |
} | |
} | |
best = apr_palloc(r->pool, sizeof(node_context) * nbest); | |
nbest = 0; | |
for (j = 0; j < sizecontext; j++) { | |
if (length[j] == max) { | |
contextinfo_t *context; | |
int ok = 0; | |
context = &context_table->context_info[j]; | |
/* Check status */ | |
switch (status[j]) { | |
case ENABLED: | |
ok = 1; | |
break; | |
case DISABLED: | |
/* Only the request with sessionid ok for it */ | |
if (hassession_byname(r, context->node, route, node_table)) { | |
ok = 1; | |
} | |
break; | |
} | |
if (ok) { | |
best[nbest].node = context->node; | |
best[nbest].context = context->id; | |
nbest++; | |
} | |
} | |
} | |
if (nbest == 0) { | |
return NULL; | |
} | |
best[nbest].node = -1; | |
/* Save the result */ | |
apr_table_setn(r->notes, "proxy-context", (char *)best); | |
return best; |
Here, we ignore any context whose corresponding length-array-entry isn't == max == 12
, so the only one remaining would be node1: /onlyOnNode1
.
But this can't have happened, otherwise I wouldn't be here.
If my context dind't pass the status check on lines 456-465, we should end up here instead:
ap_log_rerror(APLOG_MARK, APLOG_TRACE6, 0, r, "proxy_cluster_trans: DECLINED %s uri: %s unparsed_uri: %s", |
But this didn't happen either. Instead we get to a request on Tomcat Node2.
Any ideas how this could happen?
Possible race condition?
One idea is, that for some reason, maybe a race condition, the entry node1: /onlyOnNode1
is missing completely from the context table. In that case, it might select node2: /
instead depending on the order of the context-table.
I wasn't able to make sense of the locking code, however, and how it influences (calls to) read_context_table
or maybe it's actually missing?
MCM Status Servlet
Current dump from the MCM servlet (only enabled after we ran into this issue a few times, haven't seen it since):
Node myapp1 (https://myapp1.mycompany:8443):
Enable Contexts Disable Contexts Stop Contexts
Balancer: myapp,LBGroup: ,Flushpackets: Off,Flushwait: 10000,Ping: 10000000,Smax: -1,Ttl: 60000000,Status: OK,Elected: 34929,Read: 967163,Transferred: 590290274,Connected: 0,Load: 100
Virtual Host 1:
Contexts:
/, Status: ENABLED Request: 0 Disable Stop
/onlyOnNode1, Status: ENABLED Request: 0 Disable Stop
Aliases:
localhost
Node myapp2 (https://myapp2.mycompany:8443):
Enable Contexts Disable Contexts Stop Contexts
Balancer: myapp,LBGroup: ,Flushpackets: Off,Flushwait: 10000,Ping: 10000000,Smax: -1,Ttl: 60000000,Status: OK,Elected: 22738,Read: 679877,Transferred: 371948068,Connected: 0,Load: 100
Virtual Host 1:
Contexts:
/, Status: ENABLED Request: 0 Disable Stop
/onlyOnNode2, Status: ENABLED Request: 0 Disable Stop
Aliases:
localhost
Enabling tracing does not seem to print anything about the context selection. Also it doesn't seem to be possible to restrict tracing to the mod_proxy_cluster module.