Skip to content

NRG: Decouple Raft transport layer#8181

Open
sciascid wants to merge 1 commit into
mainfrom
daniele/raft-transport
Open

NRG: Decouple Raft transport layer#8181
sciascid wants to merge 1 commit into
mainfrom
daniele/raft-transport

Conversation

@sciascid
Copy link
Copy Markdown
Contributor

The Raft implementation was tightly coupled to the server's internal client and send queue for Raft RPC communication. This made it difficult to test scenarios such as network partitions deterministically.

Introduce a raftTransport abstraction and keep the existing internal client/send queue behavior as the default transport. Tests can now inject a mock transport to simulate partitions and observe message delivery.

@sciascid sciascid requested a review from a team as a code owner May 18, 2026 09:39
@sciascid sciascid force-pushed the daniele/raft-transport branch from f3a15ab to 106a8cf Compare May 18, 2026 09:43
@sciascid
Copy link
Copy Markdown
Contributor Author

sciascid commented May 18, 2026

Verified that there are no additional per message allocations using this benchmark:

  func BenchmarkNRGPropose(b *testing.B) {
          c := createJetStreamClusterExplicit(b, "R3S", 3)
          defer c.shutdown()

          rg := c.createMemRaftGroup("BENCH", 3, newStateAdder)
          leader := rg.waitOnLeader()
          require_NotNil(b, leader)

          var buf [binary.MaxVarintLen64]byte
          n := binary.PutVarint(buf[:], 1)
          entry := buf[:n]
          node := leader.node()

          var m0, m1 runtime.MemStats
          b.ReportAllocs()
          runtime.ReadMemStats(&m0)
          b.ResetTimer()
          for i := 0; i < b.N; i++ {
                  require_NoError(b, node.Propose(entry))
          }
          b.StopTimer()
          runtime.ReadMemStats(&m1)
          b.ReportMetric(float64(m1.Mallocs-m0.Mallocs), "allocs_total")
  }

Without refactoring:
BenchmarkNRGPropose-16 1000000 309.3 ns/op 3718936 allocs_total 336 B/op 3 allocs/op

With refactoring:
BenchmarkNRGPropose-16 1000000 310.2 ns/op 3735642 allocs_total 337 B/op 3 allocs/op

The benchmark is not included in the PR, maybe not worth keeping.

Copy link
Copy Markdown
Member

@MauriceVanVeen MauriceVanVeen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Copy Markdown
Member

@MauriceVanVeen MauriceVanVeen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There seems to be a nil deref

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x18 pc=0xfc62be]

goroutine 92605 [running]:
github.com/nats-io/nats-server/v2/server.(*raft).GetTrafficAccountName(0xc003a43c08)
	/home/runner/work/nats-server/nats-server/server/raft.go:667 +0x9e

@sciascid sciascid force-pushed the daniele/raft-transport branch from 106a8cf to 627ea0e Compare May 18, 2026 11:45
@sciascid
Copy link
Copy Markdown
Contributor Author

There seems to be a nil deref

Fixed.

@sciascid
Copy link
Copy Markdown
Contributor Author

@claude review

@synadia-claude-reviewer
Copy link
Copy Markdown

synadia-claude-reviewer Bot commented May 18, 2026

Claude finished @sciascid's task in 3m 10s —— View job


Code Review

  • Gather context
  • Read changed files
  • Review code and provide feedback

Overall this is a clean, well-structured refactor. The raftTransport interface is minimal and sensible, the mock implementation is clever (reusing Sublist for subscription tracking), and the benchmark confirms there are no allocation regressions. A few things worth discussing:


1. n.snapshotting = false removed from Reset() (potential bug)

server/raft.go:2258 — The removal of n.snapshotting = false from Reset() (and the corresponding test assertion) is unrelated to the transport refactor. If Reset() is called while an async snapshot checkpoint is in flight, the flag stays true. The checkpoint goroutine holds a reference to the node via c.n and continues running. It will check c.n.snapshotting (lines 1465, 1485, 1520) — and since the flag was never cleared, it will proceed to call c.InstallSnapshot() on an already-reset node where n.applied=0. Additionally, any subsequent call to createSnapshotCheckpointLocked will return errSnapInProgress indefinitely, blocking all future snapshots.

Was this removal intentional? The original comment (// Abort any inflight async snapshot checkpoint.) makes its purpose clear, and removing it without a compelling reason seems risky.

Fix this →


2. Hub lock held during subscription callbacks (latent deadlock risk)

server/raft_transport_helpers_test.go:84raftTransportHub.publish holds h.mu.Lock() while calling sub.icb(...). Currently safe because all Raft handlers (e.g. handleVoteRequest, handleAppendEntry) just push to queues and return. But if any callback ever calls t.hub.publish (reentrant), it would deadlock since sync.Mutex is non-reentrant. Worth adding a comment documenting this invariant — the lock must be released before any code that could re-enter publish.

// NOTE: sub.icb must not call publish (directly or indirectly), as the
// hub lock is held here and sync.Mutex is not reentrant.
sub.icb(sub, nil, transport.acc, subject, reply, msg)

3. defaultTransport.Publish panics if called before Reset()

server/raft_transport.go:97 — Before this PR, sendRPC/sendReply had nil guards on sq. Now defaultTransport.Publish calls t.sq.send(...) directly; if Publish is ever called before Reset() initializes t.sq, this panics rather than silently dropping the message. In practice recreateInternalSubsLockedReset() is called early, so this is unlikely to matter — but it's a subtle behavioral change from fail-silent to panic. If this is intentional (fail-fast is better), it'd be worth a comment in Publish noting the precondition.


4. Nit: indirect nil return in GetTrafficAccountName

server/raft.go:668return (*Account)(nil).GetName() is technically correct since GetName() is nil-safe and returns "n/a", but it's an unusual idiom. A direct return "n/a" (or _EMPTY_) would be clearer and avoids depending on the nil-receiver behavior of GetName.


Minor observations

  • The Transport field on RaftConfig being a raftTransportFactory (not an raftTransport directly) is a good design choice — it allows each startRaftNode call to create an independent transport instance.
  • mockTransport.Subscribe correctly returns the exact pointer so that Unsubscribe can call t.sub.Remove(sub) with the same pointer — no issue there.
  • The setAfterMsgHook doc comment correctly explains that the hook is called after the hub lock is released, and that the sending node remains locked. The test using atomic.Int64 for the hook counter is the right call given this constraint.
  • TestNRGPartitionedPeerRemove is a solid addition — a concrete, deterministic test for a previously hard-to-test scenario.

@sciascid sciascid force-pushed the daniele/raft-transport branch from 627ea0e to beae436 Compare May 18, 2026 12:15
The Raft implementation was tightly coupled to the server's internal
client and send queue for Raft RPC communication. This made it difficult
to test scenarios such as network partitions deterministically.

Introduce a raftTransport abstraction and keep the existing internal
client/send queue behavior as the default transport.
Tests can now inject a mock transport to simulate partitions and
observe message delivery.

Signed-off-by: Daniele Sciascia <daniele@nats.io>
@sciascid sciascid force-pushed the daniele/raft-transport branch from beae436 to 1f992f5 Compare May 18, 2026 12:29
@sciascid
Copy link
Copy Markdown
Contributor Author

@claude

  1. huh?! this PR made no changes raft.Reset. The branch was not fully up-to-date with main... maybe that was confusing. Rebased on top of latest main.
  2. This is documented in the comment for setAfterMsgHook. The is no real risk here, this is testing code only. The reason for keeping the lock during callback is that it allows us to freeze raft exactly after a given message, that actually helps writing deterministic tests.
  3. Fixed.
  4. This is to make sure we return the same value returned by GetName().

@synadia-claude-reviewer
Copy link
Copy Markdown

synadia-claude-reviewer Bot commented May 18, 2026

Claude finished @sciascid's task in 1m 20s —— View job


Response to feedback

  • Gather context
  • Verify point 1: Reset() snapshotting flag — confirmed present in rebased branch
  • Verify point 2: Hub lock during callbacks — intentional, well-documented
  • Verify point 3: defaultTransport.Publish nil guard — fixed
  • Verify point 4: GetTrafficAccountName nil return — reasoning confirmed valid

Thanks for the clarifications and the rebase. All four points check out:

1. n.snapshotting = false in Reset()
Confirmed — n.snapshotting = false is present at server/raft.go:2262 with the original comment // Abort any inflight async snapshot checkpoint.. The previous review was working off a stale pre-rebase diff. My concern was a false positive.

2. Hub lock held during subscription callbacks
The code is clear on this: the hub lock is released (h.mu.Unlock() at line 104) before the afterMsgHook is called, and the setAfterMsgHook doc comment explicitly documents that the hub is unlocked during the hook. The lock is held only during the sub.icb dispatch to allow deterministic test freezes. This is testing code with a well-understood, documented invariant. No issue.

3. defaultTransport.Publish nil guard
Fixed — server/raft_transport.go:97-100 now has if t.sq == nil { return }, restoring the fail-silent behavior for the pre-Reset() case.

4. (*Account)(nil).GetName() in GetTrafficAccountName
The intent is clear: delegate to GetName()'s nil-receiver behavior rather than hardcode the sentinel string, so this function stays consistent if GetName() ever changes. Valid reasoning.

The PR looks good.

@sciascid
Copy link
Copy Markdown
Contributor Author

@codex review

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 1f992f5cf8

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread server/raft.go
if factory == nil {
factory = defaultRaftTransport
}
n.t = factory(s, n)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Reject nil raft transport from factory

After assigning n.t = factory(s, n), initialization proceeds into RecreateInternalSubs, which calls methods on n.t (for example Account, Reset, and later Publish) without a nil guard. If a custom RaftConfig.Transport factory returns nil (a realistic mistake when injecting test transports), startup will panic instead of returning a regular init error, making failures harder to diagnose and recover from.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Providing a custom factory is for testing only, so there's not much risk in that. Should a factory method returnsnil, it means that it needs to be fixed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants