Skip to content

Conversation

@IronJam11
Copy link

What was wrong?

Issue #1118

When using async topic validators in GossipSub, the pubsub service crashes with KeyError in mcache.shift().

Root cause: Race condition in MessageCache.put()

  1. Two async validators process the same message concurrently
  2. Both call mcache.put(msg) after validation completes
  3. put() has two operations:
    • self.msgs[mid] = msg - overwrites if exists (dict count stays 1)
    • self.history[0].append(...) - always appends (list count becomes 2)
  4. Result: 1 entry in msgs, but 2 entries in history for the same message ID
  5. When shift() rotates the history window, it tries to pop() the same mid twice
  6. Second pop raises KeyError because message was already removed

How was it fixed?

1. Duplicate detection in put()

Added early return if message ID already exists:

def put(self, msg: rpc_pb2.Message) -> None:
    mid: tuple[bytes, bytes] = (msg.seqno, msg.from_id)
    if mid in self.msgs:
        return  # Already cached, don't create duplicate history entry
    self.msgs[mid] = msg
    self.history[0].append(CacheEntry(mid, msg.topicIDs))

2. Defensive pop in shift()

Changed from pop(mid) to pop(mid, None) for additional safety:

self.msgs.pop(entry.mid, None)  # Won't raise KeyError if missing

Why this works in Trio: Since put() is synchronous (no await points), each call runs atomically. First call adds to both msgs and history, second call finds mid in self.msgs and returns early. No duplicate history entries can be created.

Files changed:

  • libp2p/pubsub/mcache.py - Added duplicate check + defensive pop
  • tests/core/pubsub/test_mcache_race_condition.py - Tests that fail before fix, pass after

Impact:

  • ✅ Fixes production crashes with async validators
  • ✅ Minimal performance impact (single dict lookup per put())
  • ✅ Backwards compatible (no API changes)
  • ✅ No side effects (messages cached exactly once)

To-Do

  • Clean up commit history
  • Add or update documentation related to these changes
  • Add entry to the release notes

Cute Animal Picture

capybara

Before fix:
Screenshot 2026-01-25 at 4 04 43 PM

After fix:
Screenshot 2026-01-25 at 4 06 07 PM

@IronJam11
Copy link
Author

Also please note: they might be pyrefly errors that are pre-existing in the repo

@seetadev
Copy link
Contributor

@acul71 , @sumanjeet0012, @Winter-Soren : CI/CD failure in this issue is not related to this PR. Please visit the logs: =========================== short test summary info ============================
FAILED tests/core/pubsub/test_dummyaccount_demo.py::test_set_then_send_from_five_diff_nodes_five_nodes_ring_topography
====== 1 failed, 1828 passed, 11 skipped, 3 warnings in 618.21s (0:10:18) ======
py310-core: exit 1 (620.13 seconds) /home/runner/work/py-libp2p/py-libp2p> pytest -n auto --timeout=1200 tests/core pid=2640
py310-core: FAIL code 1 (642.65=setup[19.32]+cmd[0.10,3.10,620.13] seconds)
evaluation failed :( (644.47 seconds)
Error: Process completed with exit code 1

Wish if you could open a new PR for arriving at a good conclusion on this issue.

@yashksaini-coder
Copy link
Contributor

@acul71 , @sumanjeet0012, @Winter-Soren : CI/CD failure in this issue is not related to this PR. Please visit the logs: =========================== short test summary info ============================
FAILED tests/core/pubsub/test_dummyaccount_demo.py::test_set_then_send_from_five_diff_nodes_five_nodes_ring_topography
====== 1 failed, 1828 passed, 11 skipped, 3 warnings in 618.21s (0:10:18) ======
py310-core: exit 1 (620.13 seconds) /home/runner/work/py-libp2p/py-libp2p> pytest -n auto --timeout=1200 tests/core pid=2640
py310-core: FAIL code 1 (642.65=setup[19.32]+cmd[0.10,3.10,620.13] seconds)
evaluation failed :( (644.47 seconds)
Error: Process completed with exit code 1

Wish if you could open a new PR for arriving at a good conclusion on this issue.

#1164

Facing similar issue in one of my PRs, whoever is working on providing the fix please also verify the ci failure address in the issue #1164

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants