Skip to content

feat: stakeholder interview subagents (4-method post-simulation surveys)#643

Open
ChristianMoellmann wants to merge 26 commits into
666ghj:mainfrom
ChristianMoellmann:feat/interview-subagents
Open

feat: stakeholder interview subagents (4-method post-simulation surveys)#643
ChristianMoellmann wants to merge 26 commits into
666ghj:mainfrom
ChristianMoellmann:feat/interview-subagents

Conversation

@ChristianMoellmann

Copy link
Copy Markdown

Summary

New subsystem for interrogating simulated stakeholders after the OASIS simulation completes: four deterministic instrument runners + a cross-method synthesiser, exposed via a new Flask blueprint and a Vue Step4b with d3 visualisations.

The four interview subagents:

  • Longitudinal — 12-item Likert administered pre-OASIS (T0) and post-OASIS (T1) to measure opinion drift induced by simulated peer interaction
  • Diversity — 24-statement Q-sort + 6 multi-dim Likert axes → PCA + k-means → stakeholder typology
  • Delphi — 3 rounds (open → rate → revise with anonymised group medians) → convergence metrics
  • Scenario — 4 future scenarios × 4 dimensions (desirability, plausibility, group-impact, fairness) → polarity matrix

InterviewOrchestrator fans out subagents in parallel after COMPLETED; InterviewSynthesizer aggregates into a Markdown report + tidy CSV with an auto-emitted Limitations section. Auto-trigger on SimulationManager lifecycle hooks (READY → T0; COMPLETED → T1 + others).

Design and plan documents

  • Spec: docs/superpowers/specs/2026-05-23-stakeholder-interview-subagents-design.md
  • Plan: docs/superpowers/plans/2026-05-23-stakeholder-interview-subagents.md (21 bite-sized TDD tasks)

Notable changes to existing files

  • ZepGraphMemoryUpdater gains add_text_episode(graph_id, text) for direct text writes (bypasses the AgentActivity/batch path)
  • OasisProfileGenerator now writes source_entity_uuid to both reddit_profiles.json and twitter_profiles.csv (additive column on Twitter CSV; non-breaking)
  • SimulationManager lifecycle hooks (register_on_ready, register_on_completed) are class-level so they survive across instances
  • SimulationRunner exposes _on_completed_callbacks for the runner→manager bridge
  • New deps: PyYAML, scikit-learn, scipy, numpy, pandas
  • New config keys: INTERVIEW_MAX_TOKENS_PER_RUN, INTERVIEW_MAX_WORKERS, INTERVIEW_DEFAULT_LANGUAGE, LLM_STUB_MODE, UPLOADS_DIR
  • LLM stub mode in LLMClient for deterministic CI runs covering all four subagents

Stats

  • 23 commits, 64 files, ~3,800 LoC
  • 55 backend tests (53 unit + 2 integration), all passing
  • Frontend npm run build clean
  • All instruments bilingual DE/EN (German default, since the seed corpus is German fisheries discourse)

Test plan

  • cd backend && uv run pytest -q → 55/55 passing
  • cd frontend && npm run build → clean
  • Real-LLM smoke run against anthropic/claude-haiku-4-5 via OpenRouter: 3 personas (Thünen-style fisheries scientist, German Bundesrat, ICES) on a real simulation produced distinctly differentiated in-character German Likert responses + ~80-word open comments. Cost ~$0.02, wall time 10s.
  • Scale-up to all 23 agents on a real simulation (estimated ~$1–2, ~15 min)
  • Review by domain expert of the German Likert items and the 4 scenarios (currently drafted, not derived from a validated instrument)

Known follow-ups (none blocking)

  • INTERVIEW_MAX_TOKENS_PER_RUN defined but enforcement not implemented (config-only)
  • §8 instrument-health plausibility flags from the spec not implemented in synthesiser
  • LLM transport-layer retries (network 502s); schema-retry exists, transport-retry does not
  • In-app nav link to Step4b — currently reachable only by direct URL /interview/:simulationId
  • Polling loop in Step4bInterviews.vue is unbounded
  • Wilcoxon signed-rank promised in spec §5.1 but not yet implemented (scipy imported, unused)
  • instruments_used.json written at orchestrator-level rather than per-run-id directory

How to try locally

git checkout feat/interview-subagents
cd backend && uv sync --python 3.12 && uv run pytest -q
# Real run with stub LLM (free, deterministic):
LLM_STUB_MODE=true uv run pytest -m integration

🤖 Generated with Claude Code

ChristianMoellmann and others added 26 commits May 23, 2026 10:53
Approved design for a four-subagent post-simulation interview system
(Longitudinal, Diversity, Delphi, Scenario) over MiroFish-simulated
German fisheries stakeholders, with cross-method synthesiser. Includes
architecture, instrument design, data flow, API surface, error handling,
validation, testing, and methodological caveats.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Bite-sized TDD plan covering 21 tasks across 7 phases: setup → foundation
(models, YAML loader, LLM stub, base interviewer) → 4 subagents
(longitudinal, diversity Q-sort+PCA, Delphi 3-round, scenario) → storage
+ Zep writer → orchestrator + sim lifecycle hooks + synthesiser →
Flask /api/interview blueprint → end-to-end integration test → Vue Step4b
with d3 visualisations. Each task lists exact files, failing test code,
implementation code, run commands, and commit message.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…anguage, stub mode)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… hash freezing

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ting and schema retry

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…A/k-means typology

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…nvergence metrics

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…olarity matrix

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…nd latest pointer

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… per-agent + aggregate episodes

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…-out, isolated failures

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…Manager

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…limitations section

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…vices to interview subsystem

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…c + CSV export

Add /api/interview blueprint with POST pre/post/rerun, GET status/results/synthesis/export.csv endpoints. Background tasks tracked by UUID in module-level dict. Add register_blueprints() helper to api/__init__.py and wire app factory through it. Add UPLOADS_DIR to Config with env-override default.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…for all 4 subagents

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…lient, i18n keys

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…, Delphi, scenario polarity, synthesis

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ner→Manager on COMPLETED

- Add backend/app/services/interviews/lifecycle.py with install_hooks() that
  registers on_ready (pre-survey) and on_completed (post-survey + synthesis)
  daemon-thread callbacks on a SimulationManager.
- Add SimulationRunner.register_on_completed() / _fire_on_completed() so
  external callbacks can be notified when _monitor_simulation transitions to
  COMPLETED (both exit-code-0 path and simulation_end event path).
- Wire both in app/__init__.py: create singleton SimulationManager, install
  lifecycle hooks, and register its _notify_on_completed with SimulationRunner.
- Add test_lifecycle.py: verifies install_hooks registers one callable for each
  of ready and completed.
- All 40 unit tests + 2 integration tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…on runs (C1-C5)

Five tightly-coupled fixes that were causing the interview subsystem to silently
degrade in production:

- C1+C2: `_build_orchestrator` now resolves `graph_id` from
  `SimulationManager().get_simulation(sim_id).graph_id` (the real persisted
  state) instead of a `graph_id.txt` that nothing in the codebase writes.
  `ZepGraphMemoryUpdater(graph_id=...)` is now called with the correct
  positional argument; the bare `try/except Exception` that was swallowing the
  TypeError is replaced with a narrow fallback that logs explicitly.
- C3: `SimulationManager._on_ready_hooks` / `_on_completed_hooks` are now
  class-level (mirroring `SimulationRunner._on_completed_callbacks`).
  Hooks registered at app startup now survive across the per-request
  `SimulationManager()` instances created by the Flask API, so the T0
  longitudinal auto-survey actually fires.
- C4: `ZepGraphMemoryUpdater` gains an explicit `add_text_episode(graph_id, text)`
  method for synchronous text writes. `InterviewZepWriter._emit` no longer
  silently falls back to a dict-shaped `add_activity` call that the real
  implementation rejects (its `add_activity` requires an `AgentActivity`
  dataclass).
- C5: `FileSystemPersonaProvider.agent_to_entity()` builds an
  `{agent_id: zep_entity_uuid}` map from the persisted profile files; the map
  is now passed to `ZepMemoryProvider` so `get_entity_with_context` is called
  with real Zep UUIDs instead of `str(agent_id)`. To make this work,
  `OasisProfileGenerator._save_reddit_json` and `_save_twitter_csv` now persist
  `source_entity_uuid` (Reddit JSON: optional field; Twitter CSV: appended
  column).

Tests: 51 unit + 2 integration pass (was 40 + 2). New tests lock in each fix:
- `test_hooks_survive_across_instances` (C3)
- `test_build_orchestrator_reads_graph_id_from_state` (C1+C2+C5)
- `test_build_orchestrator_falls_back_when_state_missing` (C1+C2)
- `test_emit_uses_add_text_episode_with_graph_id`,
  `test_emit_raises_when_updater_lacks_add_text_episode`,
  `test_real_updater_exposes_add_text_episode` (C4)
- `test_agent_to_entity_from_reddit_json`,
  `test_agent_to_entity_empty_when_no_field`,
  `test_agent_to_entity_falls_back_to_twitter_csv`,
  `test_agent_to_entity_reddit_takes_precedence` (C5)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds SchemaValidationFailure exception carrying both retry attempts' raw
output, so audit.jsonl preserves what the model actually said when an
agent's response can't be coerced into the instrument schema. Lets us
diagnose persona-vs-format failures without re-running. Two new tests.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Real LLMs (observed with anthropic/claude-haiku-4-5 on a 23-agent run)
sometimes return Likert values as JSON strings ('3' not 3). The 4 subagent
validators rejected this with isinstance(v, int), losing ~30% of agents at
N=23. Added a shared coerce_int helper in base.py that accepts ints and
numeric strings, rejects bools/floats/garbage, and is now used by:

- Longitudinal: response values 1-5
- Diversity: Q-sort placements -3..+3 and 6 Likert axes 1-7
- Delphi: R2 and R3 importance/plausibility 1-5
- Scenario: 4 dimensions 1-7

Validators now coerce in place so downstream code sees ints regardless of
the wire format. Added 8 tests (4 unit on coerce_int + 4 per-subagent
contract tests showing stringified values are accepted).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant