Skip to content

feat(server): DM safety rails for proactive mentor sends #1259

@FelixTJDietrich

Description

@FelixTJDietrich

Part of #1204.

What ships

Four stacked safety rails enforced server-side before any proactive mentor DM leaves the application:

  1. Per-user min-interval. 24 hours between any two proactive DMs to the same user. Tightened to 18 hours for DAILY cadence users (so the daily cadence can fire on consecutive weekdays without falling foul of the floor). The interval is measured against the last proactive DM, not the last mentor turn — on-demand mentor (slash command, App Home CTA) is excluded.
  2. Per-workspace daily DM cap. Default 50 proactive DMs per workspace per UTC day across mentor + reflection + practice-finding delivery. The cap is the shared envelope feat(server): shared notification budget across mentor and finding delivery #1260 divides across surfaces; this sub-issue enforces it. Workspace-admin override is allowed but logged in the integration audit log (feat(server): append-only integration audit log #1218).
  3. Global kill switch. Spring property hephaestus.mentor.slack.dm.enabled. Off → every proactive DM is rejected with a structured INFO log carrying the rejection reason. On-demand mentor (slash command, App Home CTA) honors the same flag — the entire Slack DM mentor surface is one switch.
  4. Dry-run mode. Per-workspace boolean. When on, all proactive DMs route to the workspace admin's DM with a "would have been sent to " envelope instead of the user. Dry-run does not consume the daily cap and does not increment the per-user interval; it is for pilot tuning.

A DmSafetyRailsGuard Spring bean is the single check point. Every proactive DM dispatch flows through guard.check(workspaceId, userId, surface) -> Decision where Decision is ALLOW / DENY(reason) / DRY_RUN_DELIVER(adminUserId). The shared budget in #1260 calls the guard; the scheduler in #1258 does not call it directly.

Why

Slack-channel reputation is a one-strike resource. A runaway loop, a misconfigured cadence, or a bug in the practice-finding delivery path could flood a workspace and turn Hephaestus into the bot admins mute. Stacking four independent rails means no single bug or misconfiguration can produce a flood; the kill switch is the operational lever; dry-run is the pilot lever; the cap is the steady-state ceiling; the interval is the per-user dignity floor.

Acceptance criteria

  • DmSafetyRailsGuard bean exists; the only call path for proactive DM dispatch (mentor + reflection + practice-finding delivery in feat(server): formative-feedback DM flow with provenance line #1269) goes through guard.check(...)
  • An ArchUnit test asserts no proactive-DM code path bypasses the guard (mentor + reflection + practice-finding delivery packages are checked)
  • Per-user min-interval is enforced; an integration test fires two DMs 23 h apart and asserts the second is denied with MIN_INTERVAL reason; a DAILY-cadence user with 19 h gap is allowed
  • Per-workspace daily cap is enforced; an integration test fires 51 DMs in one UTC day to the same workspace and asserts the 51st is denied with WORKSPACE_DAILY_CAP reason
  • The kill switch (hephaestus.mentor.slack.dm.enabled) at off-state rejects all proactive DMs with KILL_SWITCH reason; on-demand mentor (slash command, App Home CTA) returns "Mentor temporarily unavailable"; an integration test toggles the property and asserts both paths
  • Dry-run mode routes DMs to the workspace admin's DM with the "would have been sent to " envelope; an integration test asserts the toggle does not consume the daily cap or the per-user interval and that the admin gets exactly one DM per would-be send
  • Workspace-admin overrides to the daily cap write a row to the integration audit log (feat(server): append-only integration audit log #1218); the row carries the previous + new value, the admin user id, and the timestamp
  • Per-decision metrics are exposed at /actuator/integrations (feat(server): per-integration health endpoint and structured-log MDC #1217): allow, deny_min_interval, deny_daily_cap, deny_kill_switch, dry_run_deliver

Tests to write

  • DmSafetyRailsGuardTest (unit) — each rail in isolation + interaction (min-interval + daily cap stacking).
  • DmSafetyRailsGuardIT — end-to-end from scheduler enqueue → guard → adapter; asserts the daily cap is per-workspace, not per-user.
  • A kill-switch toggle test that flips the Spring property at runtime.
  • A dry-run-mode test asserting admin DM receipt and absence of user-side DM.

Implementation notes

  • Min-interval state lives in a proactive_dm_audit(user_id, workspace_id, surface, sent_at, decision, reason) table; the guard's read path queries the most recent allowed row for (user_id, workspace_id). Denied attempts are written too — they are the evidence the rails are working and feed the metrics.
  • The kill switch is a @Value Spring property and a RefreshScope-equipped bean so an operator can flip it without restart (the Spring Boot Actuator refresh endpoint is gated). A property change is logged at WARN level with the previous + new value.
  • The daily cap is rolling-UTC-day for simplicity; per-workspace timezone caps are out of scope for v1 and would require a workspace-tz column that the registry from feat(server): workspace_integration schema + IntegrationCredentialProvider SPI #1215 does not own today.
  • Dry-run admin selection: the workspace admin is the user that installed the Slack integration (workspace_integration.installed_by from feat(server): append-only integration audit log #1218). If that user is no longer in the workspace, dry-run is degraded to "ALLOW with DRY_RUN flag in audit but no admin DM" and an admin-action row is created (feat(server): per-integration health endpoint and structured-log MDC #1217) — better degraded than off.
  • The 50-DM default is conservative; pilot workspaces have telemetry to tune it before non-pilot defaults change.

Dependencies

Depends on #1258. Depends on #1217. Depends on #1218. Blocks #1260. Blocks #1269.

Metadata

Metadata

Assignees

No one assigned

    Labels

    application-serverSpring Boot server: APIs, business logic, databasefeatureNew feature or enhancementpriority:highAddress this sprint - Significant impact

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions