Skip to content

Race condition in kanban _migrate_add_optional_columns on gateway startup #21374

@omar-elmountassir

Description

@omar-elmountassir

Bug

On gateway startup, the kanban dispatcher crashes with:

sqlite3.OperationalError: duplicate column name: consecutive_failures

Root Cause

Two async tasks are created concurrently in the gateway (gateway/run.py):

  • Line 3335: asyncio.create_task(self._kanban_notifier_watcher())
  • Line 3341: asyncio.create_task(self._kanban_dispatcher_watcher())

Both watchers call _kb.connect(board=slug)_migrate_add_optional_columns(conn) via asyncio.to_thread().

The _INITIALIZED_PATHS set (module-level, kanban_db.py:~917) is used as a cache to skip re-initialization, but it is not thread-safe. When both threads race on the first tick:

  1. Thread A checks needs_init = resolved not in _INITIALIZED_PATHSTrue
  2. Thread B checks needs_init = resolved not in _INITIALIZED_PATHSTrue (set not yet updated by A)
  3. Both threads run _migrate_add_optional_columns()
  4. Both read cols via PRAGMA table_info(tasks) — neither sees consecutive_failures yet
  5. Thread A succeeds with ALTER TABLE tasks ADD COLUMN consecutive_failures ...
  6. Thread B crashes with duplicate column name: consecutive_failures

The error is caught at the outer exception handler (gateway/run.py:3889) so the gateway keeps running, but the kanban dispatcher tick is lost.

Reproduction

Start the gateway with a fresh or existing kanban.db that already has consecutive_failures in the schema (i.e., after a previous successful migration). The race window is tight but triggers reliably on startup when both watchers hit their first tick close together.

Environment

  • Hermes v0.6+ (323 commits behind → updated to latest main as of bbff2f6)
  • Python 3.14, SQLite 3.x
  • Linux (NixOS)

Suggested Fix

Either:

  1. Quick fix: Wrap each ALTER TABLE in _migrate_add_optional_columns with try/except sqlite3.OperationalError catching only duplicate column errors. Other errors still propagate.
  2. Proper fix: Use a threading.Lock around the needs_init check + migration block in connect(), or use CREATE TABLE IF NOT EXISTS style guards.

I can submit a PR for option 1 or 2 if desired. Thanks!

Metadata

Metadata

Assignees

No one assigned

    Labels

    P3Low — cosmetic, nice to havecomp/pluginsPlugin system and bundled pluginstype/bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions