fix: retry N1N2MessageTransfer with AMF re-discovery on stale endpoint by donivtech · Pull Request #549 · omec-project/smf

donivtech · 2026-04-27T07:31:22Z

Summary

Fixes #548.

After an AMF pod restart (rolling update, OOM, node drain, etc.), the SMF continues sending N1N2MessageTransfer to the dead pod IP. The UE never receives the PDU Session Establishment Accept, T3580 expires, and the only known recovery is to restart the SMF pod. See #548 for the full root-cause analysis.

This PR adds a defensive retry path that fails fast on a dead AMF endpoint, queries NRF directly (bypassing the cache), and tries every other AMF candidate until one responds. It also rebuilds the per-SMContext CommunicationClient after MongoDB recovery so post-SMF-restart sessions don't see a nil client.

What changed

consumer/nf_management.go — new SendN1N2TransferWithRediscovery(ctx, smContext, n1n2Request):

First attempt with context.WithTimeout(ctx, 5*time.Second) so a dead endpoint fails inside the T3580 window (~16s) instead of the kernel TCP timeout (~60s+).
On failure, fetches all AMF candidates from NRF directly (skipping the cache, which would return the same stale entry).
Iterates through candidates whose NfInstanceId differs from the one that just failed; succeeds on the first live AMF, or returns the last error if all are dead.

context/sm_context.go — new (*SMContext).RebuildCommunicationClient() that reconstructs the HTTP client from the stored AMFProfile. Called from context/db.go after loading an SMContext from MongoDB so a recovered context has a usable client.

producer/pdu_session.go, producer/callback.go, pfcp/handler/handler.go, pfcp/message/send.go — the four direct call sites that previously did smContext.CommunicationClient.N1N2MessageCollectionDocumentApi.N1N2MessageTransfer(...) now go through the wrapper. Inline Namf_Communication.NewAPIClient(...) construction in producer/pdu_session.go is replaced by a call to RebuildCommunicationClient().

Diff stat: 7 files, +134 / −23.

The happy path is unchanged — when the cached client succeeds (the common case) the wrapper returns immediately with no extra NRF roundtrip and no retry.

Why iterate through every NRF candidate

Originally I picked one alternative AMF (the first one with a NfInstanceId different from the failed one) and retried once. That isn't enough when NRF holds multiple stale entries — observed live, NRF had three AMF profiles, two dead and one live, and the single-retry heuristic landed on a dead one and gave up. Iterating every candidate handles arbitrary NRF pollution at the cost of 5s × N_dead recovery time, which is still well within the T3580 retransmission window for any realistic count.

Why bypass the NRF cache on re-discovery

The SMF NRF cache (1-minute TTL, 15-minute eviction sweep) is keyed in part by TargetNfInstanceId. A targeted lookup by the old ServingNfId returns the stale cached entry. Because rediscoverAMF only runs after a confirmed failure, going straight to NRF is the right behaviour — we already know the cached value was wrong.

Verification

A/B tested on two RKE2 clusters with UERANSIM gNB+UE:

Scenario	Stock `rel-3.1.0` SMF	This PR
Baseline PDU session	success	success (unchanged)
`kubectl delete pod -l app=amf` then re-establish	T3580 expires 5×, procedure failure	Accept on first or second attempt, ~5–11s
Multiple stale NRF entries	permanent failure	iterates through dead entries, succeeds on the live AMF

Captured SMF log when retry triggers:

N1N2 transfer initiated → old AMF 10.x.y.OLD (dead)
[5s timeout] N1N2Transfer failed (... i/o timeout), attempting AMF re-discovery
AMF re-discovery: querying NRF directly (bypassing cache)
AMF re-discovery retry 1: trying NfInstanceId 8e2dfca4-... → succeeded
N1N2 Transfer completed

Scope and follow-ups

This is a defensive SMF-side fix. The underlying NRF stale-entry accumulation (no preStop deregistration, no heartbeat-based TTL) and the AMF-side reuse of stale NfId/RegisterIPv4 from MongoDB on restart are tracked separately and need their own fixes. The change here works regardless of whether those land — and since pod restarts are routine in any K8s environment, hardening the SMF against stale endpoints seems valuable on its own.

I left three open questions in #548 for the maintainers (overall framing, single-PR vs split, test expectations). Happy to adjust based on your preference. If you'd like unit tests for RebuildCommunicationClient and the candidate-iteration logic, I can add them in a follow-up commit.

Test plan

go build ./...
go vet ./...
go test ./... — all packages pass
gofmt -d on changed files — clean
Live A/B test on two RKE2 clusters (UERANSIM)
Live verification on a third cluster after image deploy

After an AMF pod restart, SMF continues sending N1N2MessageTransfer to the old AMF pod IP, causing PDU Session Establishment to fail with T3580 expiry. This happens because: 1. CommunicationClient bakes in the AMF pod IP at session creation and is never refreshed. 2. CommunicationClient is not JSON-serializable, so it's nil after SMF recovers SMContext from MongoDB. 3. NRF accumulates stale NF registrations across pod restarts and the NRF cache (15min TTL) returns stale entries on re-discovery. 4. The HTTP client has no timeout, so TCP connect to a dead pod IP hangs for 60s+ — longer than the T3580 window. Fix: - Add SendN1N2TransferWithRediscovery() that wraps all N1N2 calls with a 5-second context timeout. On failure, it re-discovers the AMF by querying NRF directly (bypassing cache), prefers an AMF with a different NfInstanceId than the failed one, and retries. - Add RebuildCommunicationClient() on SMContext to reconstruct the HTTP client from the stored AMFProfile after DB recovery. - Replace all 5 direct CommunicationClient.N1N2MessageTransfer call sites with the retry wrapper. Verified on live cluster: after AMF pod kill, N1N2 transfer fails in 5s, re-discovers new AMF, retries successfully. Total recovery time 5.3 seconds vs permanent failure without the fix. Signed-off-by: Vinod Patmanathan <vinod.patmanathan@forsway.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: retry N1N2MessageTransfer with AMF re-discovery on stale endpoint#549

fix: retry N1N2MessageTransfer with AMF re-discovery on stale endpoint#549
donivtech wants to merge 1 commit into
omec-project:mainfrom
donivtech:fix/stale-amf-client-rel310

donivtech commented Apr 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

donivtech commented Apr 27, 2026

Summary

What changed

Why iterate through every NRF candidate

Why bypass the NRF cache on re-discovery

Verification

Scope and follow-ups

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant