Skip to content

Credential leakage in agent output: systemic failure to redact secrets in chat and reasoning blocks #20785

@frogwraps

Description

@frogwraps

Summary

Agents running Hermes consistently leak credential values in chat output and reasoning/thinking blocks. Despite having credential-safety instructions, the model writes literal passwords, API keys, and tokens into responses -- especially when explaining what it "fixed."

This is a systemic, recurring failure -- not a one-time mistake.

Root Cause

  1. Credentials stored in agent memory get quoted verbatim when referenced in explanations
  2. "Meta-discussion" is the most dangerous trigger -- an agent that just fixed a credential exposure will write the value again to "show what it fixed"
  3. Reasoning/thinking blocks leak credentials in interfaces where they are visible (Discord, Telegram, CLI)
  4. Post-hoc instructions are not sufficient -- the model violates them in the same turn when explaining past actions

Impact

  • Severity: Critical. Actual passwords, API keys, and auth tokens written into chat transcripts.
  • Requires credential rotation each time it occurs.
  • Erodes user trust permanently.

Proposed Solutions

Short-term (output pipeline)

  • Pre-delivery regex scrubber: scan model output for patterns matching known credential formats and auto-redact before delivery
  • Think-block redaction: apply same filtering to reasoning blocks, not just final output

Medium-term (framework)

  • Taint tracking: mark strings originating from credential sources (.env, credential pool) and refuse to emit tainted values in user-facing output

Long-term (architecture)

  • Abstract credentials: agents should never hold credential VALUES in context -- only opaque references resolved by the tool layer at execution time

Note

Users are trying to solve this with custom skills and system prompt hardening -- none of it works because the model violates its own instructions. The fix must be in the output pipeline, not the prompt.

Metadata

Metadata

Assignees

No one assigned

    Labels

    P0Critical — data loss, security, crash loopcomp/agentCore agent loop, run_agent.py, prompt buildercomp/gatewayGateway runner, session dispatch, deliverytype/securitySecurity vulnerability or hardening

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions