Summary
Agents running Hermes consistently leak credential values in chat output and reasoning/thinking blocks. Despite having credential-safety instructions, the model writes literal passwords, API keys, and tokens into responses -- especially when explaining what it "fixed."
This is a systemic, recurring failure -- not a one-time mistake.
Root Cause
- Credentials stored in agent memory get quoted verbatim when referenced in explanations
- "Meta-discussion" is the most dangerous trigger -- an agent that just fixed a credential exposure will write the value again to "show what it fixed"
- Reasoning/thinking blocks leak credentials in interfaces where they are visible (Discord, Telegram, CLI)
- Post-hoc instructions are not sufficient -- the model violates them in the same turn when explaining past actions
Impact
- Severity: Critical. Actual passwords, API keys, and auth tokens written into chat transcripts.
- Requires credential rotation each time it occurs.
- Erodes user trust permanently.
Proposed Solutions
Short-term (output pipeline)
- Pre-delivery regex scrubber: scan model output for patterns matching known credential formats and auto-redact before delivery
- Think-block redaction: apply same filtering to reasoning blocks, not just final output
Medium-term (framework)
- Taint tracking: mark strings originating from credential sources (.env, credential pool) and refuse to emit tainted values in user-facing output
Long-term (architecture)
- Abstract credentials: agents should never hold credential VALUES in context -- only opaque references resolved by the tool layer at execution time
Note
Users are trying to solve this with custom skills and system prompt hardening -- none of it works because the model violates its own instructions. The fix must be in the output pipeline, not the prompt.
Summary
Agents running Hermes consistently leak credential values in chat output and reasoning/thinking blocks. Despite having credential-safety instructions, the model writes literal passwords, API keys, and tokens into responses -- especially when explaining what it "fixed."
This is a systemic, recurring failure -- not a one-time mistake.
Root Cause
Impact
Proposed Solutions
Short-term (output pipeline)
Medium-term (framework)
Long-term (architecture)
Note
Users are trying to solve this with custom skills and system prompt hardening -- none of it works because the model violates its own instructions. The fix must be in the output pipeline, not the prompt.