fix: retry N1N2MessageTransfer with AMF re-discovery on stale endpoint#549
Draft
donivtech wants to merge 1 commit into
Draft
fix: retry N1N2MessageTransfer with AMF re-discovery on stale endpoint#549donivtech wants to merge 1 commit into
donivtech wants to merge 1 commit into
Conversation
After an AMF pod restart, SMF continues sending N1N2MessageTransfer to the old AMF pod IP, causing PDU Session Establishment to fail with T3580 expiry. This happens because: 1. CommunicationClient bakes in the AMF pod IP at session creation and is never refreshed. 2. CommunicationClient is not JSON-serializable, so it's nil after SMF recovers SMContext from MongoDB. 3. NRF accumulates stale NF registrations across pod restarts and the NRF cache (15min TTL) returns stale entries on re-discovery. 4. The HTTP client has no timeout, so TCP connect to a dead pod IP hangs for 60s+ — longer than the T3580 window. Fix: - Add SendN1N2TransferWithRediscovery() that wraps all N1N2 calls with a 5-second context timeout. On failure, it re-discovers the AMF by querying NRF directly (bypassing cache), prefers an AMF with a different NfInstanceId than the failed one, and retries. - Add RebuildCommunicationClient() on SMContext to reconstruct the HTTP client from the stored AMFProfile after DB recovery. - Replace all 5 direct CommunicationClient.N1N2MessageTransfer call sites with the retry wrapper. Verified on live cluster: after AMF pod kill, N1N2 transfer fails in 5s, re-discovers new AMF, retries successfully. Total recovery time 5.3 seconds vs permanent failure without the fix. Signed-off-by: Vinod Patmanathan <vinod.patmanathan@forsway.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes #548.
After an AMF pod restart (rolling update, OOM, node drain, etc.), the SMF continues sending
N1N2MessageTransferto the dead pod IP. The UE never receives the PDU Session Establishment Accept, T3580 expires, and the only known recovery is to restart the SMF pod. See #548 for the full root-cause analysis.This PR adds a defensive retry path that fails fast on a dead AMF endpoint, queries NRF directly (bypassing the cache), and tries every other AMF candidate until one responds. It also rebuilds the per-
SMContextCommunicationClientafter MongoDB recovery so post-SMF-restart sessions don't see anilclient.What changed
consumer/nf_management.go— newSendN1N2TransferWithRediscovery(ctx, smContext, n1n2Request):context.WithTimeout(ctx, 5*time.Second)so a dead endpoint fails inside the T3580 window (~16s) instead of the kernel TCP timeout (~60s+).NfInstanceIddiffers from the one that just failed; succeeds on the first live AMF, or returns the last error if all are dead.context/sm_context.go— new(*SMContext).RebuildCommunicationClient()that reconstructs the HTTP client from the storedAMFProfile. Called fromcontext/db.goafter loading an SMContext from MongoDB so a recovered context has a usable client.producer/pdu_session.go,producer/callback.go,pfcp/handler/handler.go,pfcp/message/send.go— the four direct call sites that previously didsmContext.CommunicationClient.N1N2MessageCollectionDocumentApi.N1N2MessageTransfer(...)now go through the wrapper. InlineNamf_Communication.NewAPIClient(...)construction inproducer/pdu_session.gois replaced by a call toRebuildCommunicationClient().Diff stat: 7 files, +134 / −23.
The happy path is unchanged — when the cached client succeeds (the common case) the wrapper returns immediately with no extra NRF roundtrip and no retry.
Why iterate through every NRF candidate
Originally I picked one alternative AMF (the first one with a
NfInstanceIddifferent from the failed one) and retried once. That isn't enough when NRF holds multiple stale entries — observed live, NRF had three AMF profiles, two dead and one live, and the single-retry heuristic landed on a dead one and gave up. Iterating every candidate handles arbitrary NRF pollution at the cost of5s × N_deadrecovery time, which is still well within the T3580 retransmission window for any realistic count.Why bypass the NRF cache on re-discovery
The SMF NRF cache (1-minute TTL, 15-minute eviction sweep) is keyed in part by
TargetNfInstanceId. A targeted lookup by the oldServingNfIdreturns the stale cached entry. BecauserediscoverAMFonly runs after a confirmed failure, going straight to NRF is the right behaviour — we already know the cached value was wrong.Verification
A/B tested on two RKE2 clusters with UERANSIM gNB+UE:
rel-3.1.0SMFkubectl delete pod -l app=amfthen re-establishCaptured SMF log when retry triggers:
Scope and follow-ups
This is a defensive SMF-side fix. The underlying NRF stale-entry accumulation (no preStop deregistration, no heartbeat-based TTL) and the AMF-side reuse of stale
NfId/RegisterIPv4from MongoDB on restart are tracked separately and need their own fixes. The change here works regardless of whether those land — and since pod restarts are routine in any K8s environment, hardening the SMF against stale endpoints seems valuable on its own.I left three open questions in #548 for the maintainers (overall framing, single-PR vs split, test expectations). Happy to adjust based on your preference. If you'd like unit tests for
RebuildCommunicationClientand the candidate-iteration logic, I can add them in a follow-up commit.Test plan
go build ./...go vet ./...go test ./...— all packages passgofmt -don changed files — clean