Skip to content

fix: retry N1N2MessageTransfer with AMF re-discovery on stale endpoint#549

Draft
donivtech wants to merge 1 commit into
omec-project:mainfrom
donivtech:fix/stale-amf-client-rel310
Draft

fix: retry N1N2MessageTransfer with AMF re-discovery on stale endpoint#549
donivtech wants to merge 1 commit into
omec-project:mainfrom
donivtech:fix/stale-amf-client-rel310

Conversation

@donivtech
Copy link
Copy Markdown

Summary

Fixes #548.

After an AMF pod restart (rolling update, OOM, node drain, etc.), the SMF continues sending N1N2MessageTransfer to the dead pod IP. The UE never receives the PDU Session Establishment Accept, T3580 expires, and the only known recovery is to restart the SMF pod. See #548 for the full root-cause analysis.

This PR adds a defensive retry path that fails fast on a dead AMF endpoint, queries NRF directly (bypassing the cache), and tries every other AMF candidate until one responds. It also rebuilds the per-SMContext CommunicationClient after MongoDB recovery so post-SMF-restart sessions don't see a nil client.

What changed

consumer/nf_management.go — new SendN1N2TransferWithRediscovery(ctx, smContext, n1n2Request):

  • First attempt with context.WithTimeout(ctx, 5*time.Second) so a dead endpoint fails inside the T3580 window (~16s) instead of the kernel TCP timeout (~60s+).
  • On failure, fetches all AMF candidates from NRF directly (skipping the cache, which would return the same stale entry).
  • Iterates through candidates whose NfInstanceId differs from the one that just failed; succeeds on the first live AMF, or returns the last error if all are dead.

context/sm_context.go — new (*SMContext).RebuildCommunicationClient() that reconstructs the HTTP client from the stored AMFProfile. Called from context/db.go after loading an SMContext from MongoDB so a recovered context has a usable client.

producer/pdu_session.go, producer/callback.go, pfcp/handler/handler.go, pfcp/message/send.go — the four direct call sites that previously did smContext.CommunicationClient.N1N2MessageCollectionDocumentApi.N1N2MessageTransfer(...) now go through the wrapper. Inline Namf_Communication.NewAPIClient(...) construction in producer/pdu_session.go is replaced by a call to RebuildCommunicationClient().

Diff stat: 7 files, +134 / −23.

The happy path is unchanged — when the cached client succeeds (the common case) the wrapper returns immediately with no extra NRF roundtrip and no retry.

Why iterate through every NRF candidate

Originally I picked one alternative AMF (the first one with a NfInstanceId different from the failed one) and retried once. That isn't enough when NRF holds multiple stale entries — observed live, NRF had three AMF profiles, two dead and one live, and the single-retry heuristic landed on a dead one and gave up. Iterating every candidate handles arbitrary NRF pollution at the cost of 5s × N_dead recovery time, which is still well within the T3580 retransmission window for any realistic count.

Why bypass the NRF cache on re-discovery

The SMF NRF cache (1-minute TTL, 15-minute eviction sweep) is keyed in part by TargetNfInstanceId. A targeted lookup by the old ServingNfId returns the stale cached entry. Because rediscoverAMF only runs after a confirmed failure, going straight to NRF is the right behaviour — we already know the cached value was wrong.

Verification

A/B tested on two RKE2 clusters with UERANSIM gNB+UE:

Scenario Stock rel-3.1.0 SMF This PR
Baseline PDU session success success (unchanged)
kubectl delete pod -l app=amf then re-establish T3580 expires 5×, procedure failure Accept on first or second attempt, ~5–11s
Multiple stale NRF entries permanent failure iterates through dead entries, succeeds on the live AMF

Captured SMF log when retry triggers:

N1N2 transfer initiated → old AMF 10.x.y.OLD (dead)
[5s timeout] N1N2Transfer failed (... i/o timeout), attempting AMF re-discovery
AMF re-discovery: querying NRF directly (bypassing cache)
AMF re-discovery retry 1: trying NfInstanceId 8e2dfca4-... → succeeded
N1N2 Transfer completed

Scope and follow-ups

This is a defensive SMF-side fix. The underlying NRF stale-entry accumulation (no preStop deregistration, no heartbeat-based TTL) and the AMF-side reuse of stale NfId/RegisterIPv4 from MongoDB on restart are tracked separately and need their own fixes. The change here works regardless of whether those land — and since pod restarts are routine in any K8s environment, hardening the SMF against stale endpoints seems valuable on its own.

I left three open questions in #548 for the maintainers (overall framing, single-PR vs split, test expectations). Happy to adjust based on your preference. If you'd like unit tests for RebuildCommunicationClient and the candidate-iteration logic, I can add them in a follow-up commit.

Test plan

  • go build ./...
  • go vet ./...
  • go test ./... — all packages pass
  • gofmt -d on changed files — clean
  • Live A/B test on two RKE2 clusters (UERANSIM)
  • Live verification on a third cluster after image deploy

After an AMF pod restart, SMF continues sending N1N2MessageTransfer
to the old AMF pod IP, causing PDU Session Establishment to fail
with T3580 expiry. This happens because:

1. CommunicationClient bakes in the AMF pod IP at session creation
   and is never refreshed.
2. CommunicationClient is not JSON-serializable, so it's nil after
   SMF recovers SMContext from MongoDB.
3. NRF accumulates stale NF registrations across pod restarts and
   the NRF cache (15min TTL) returns stale entries on re-discovery.
4. The HTTP client has no timeout, so TCP connect to a dead pod IP
   hangs for 60s+ — longer than the T3580 window.

Fix:

- Add SendN1N2TransferWithRediscovery() that wraps all N1N2 calls
  with a 5-second context timeout. On failure, it re-discovers the
  AMF by querying NRF directly (bypassing cache), prefers an AMF
  with a different NfInstanceId than the failed one, and retries.

- Add RebuildCommunicationClient() on SMContext to reconstruct the
  HTTP client from the stored AMFProfile after DB recovery.

- Replace all 5 direct CommunicationClient.N1N2MessageTransfer call
  sites with the retry wrapper.

Verified on live cluster: after AMF pod kill, N1N2 transfer fails
in 5s, re-discovers new AMF, retries successfully. Total recovery
time 5.3 seconds vs permanent failure without the fix.

Signed-off-by: Vinod Patmanathan <vinod.patmanathan@forsway.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Bug: PDU Session Establishment fails after AMF pod restart due to stale CommunicationClient

1 participant