Skip to content

Longest prefix match for contexts sometimes doesn't work #353

@alshain

Description

@alshain

tl;dr; From what I can tell, mod_proxy_cluster tries to select a node based on the longest context match. However, very rarely, this does not work. Maybe a race condition due to missing locks?

Our Problem

We have a mod_proxy_cluster loadbalancer, where two nodes register, each with two contexts:

node1: /
node1: /onlyOnNode1
node2: /
node2: /onlyOnNode2

When I send a request /onlyOnNode1 to Apache, this request is sometimes mistakenly forwarded to node2 by mod_proxy_cluster.

The system will run fine for days or weeks, and only then will the problematic behavior occur.
Once the problematic behavior occurs, it doesn't seem to go away until restarting at least Apache or the Tomcat.
However, because the problematic state is very rare, I haven't been able to investigate it much.

Looking at the code

From what I've been able from the code, this part here searches all longest matches for a URL:

if (strncmp(uri, context->context, len) == 0) {
if (uri[len] == '\0' || uri[len] == '/' || len == 1) {
status[j] = context->status;
length[j] = len;
if (len > max) {
max = len;
}
}
}
}

So in my case assuming the context_table given above, max should be set to strlen(/onlyOnNode1) == 12, and the length array should look like this: 1, 12, 1, 0 .

Then, we do this:

/* find the best matching contexts */
nbest = 1;
for (j = 0; j < sizecontext; j++) {
if (length[j] == max) {
nbest++;
}
}
best = apr_palloc(r->pool, sizeof(node_context) * nbest);
nbest = 0;
for (j = 0; j < sizecontext; j++) {
if (length[j] == max) {
contextinfo_t *context;
int ok = 0;
context = &context_table->context_info[j];
/* Check status */
switch (status[j]) {
case ENABLED:
ok = 1;
break;
case DISABLED:
/* Only the request with sessionid ok for it */
if (hassession_byname(r, context->node, route, node_table)) {
ok = 1;
}
break;
}
if (ok) {
best[nbest].node = context->node;
best[nbest].context = context->id;
nbest++;
}
}
}
if (nbest == 0) {
return NULL;
}
best[nbest].node = -1;
/* Save the result */
apr_table_setn(r->notes, "proxy-context", (char *)best);
return best;

Here, we ignore any context whose corresponding length-array-entry isn't == max == 12, so the only one remaining would be node1: /onlyOnNode1.

But this can't have happened, otherwise I wouldn't be here.
If my context dind't pass the status check on lines 456-465, we should end up here instead:

ap_log_rerror(APLOG_MARK, APLOG_TRACE6, 0, r, "proxy_cluster_trans: DECLINED %s uri: %s unparsed_uri: %s",

But this didn't happen either. Instead we get to a request on Tomcat Node2.

Any ideas how this could happen?

Possible race condition?

One idea is, that for some reason, maybe a race condition, the entry node1: /onlyOnNode1 is missing completely from the context table. In that case, it might select node2: / instead depending on the order of the context-table.

I wasn't able to make sense of the locking code, however, and how it influences (calls to) read_context_table or maybe it's actually missing?

MCM Status Servlet

Current dump from the MCM servlet (only enabled after we ran into this issue a few times, haven't seen it since):

Node myapp1 (https://myapp1.mycompany:8443):
Enable Contexts Disable Contexts Stop Contexts
Balancer: myapp,LBGroup: ,Flushpackets: Off,Flushwait: 10000,Ping: 10000000,Smax: -1,Ttl: 60000000,Status: OK,Elected: 34929,Read: 967163,Transferred: 590290274,Connected: 0,Load: 100
Virtual Host 1:
Contexts:
/, Status: ENABLED Request: 0 Disable Stop
/onlyOnNode1, Status: ENABLED Request: 0 Disable Stop
Aliases:
localhost

Node myapp2 (https://myapp2.mycompany:8443):
Enable Contexts Disable Contexts Stop Contexts
Balancer: myapp,LBGroup: ,Flushpackets: Off,Flushwait: 10000,Ping: 10000000,Smax: -1,Ttl: 60000000,Status: OK,Elected: 22738,Read: 679877,Transferred: 371948068,Connected: 0,Load: 100
Virtual Host 1:
Contexts:
/, Status: ENABLED Request: 0 Disable Stop
/onlyOnNode2, Status: ENABLED Request: 0 Disable Stop
Aliases:
localhost

Enabling tracing does not seem to print anything about the context selection. Also it doesn't seem to be possible to restrict tracing to the mod_proxy_cluster module.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions