A prompt injection detection layer for LLM apps and agents. Routes untrusted user input through a small, sandboxed sacrificial model before it reaches the main system, and returns a verdict (block, flag, or pass) with a risk score and per-layer trace. Designed to sit in front of an existing stack without replacing it. Intended for developers who can afford ~250ms of preflight latency and want inbound detection that goes beyond static regex rules.
- runs a two-layer check: structural filter first, canary probe second
- returns a
PipelineVerdictwithsafe,action,summary,canary_risk_score, and per-layerLayerResultlist - blocks input when structural patterns match in
blockorfullmode - never blocks in
advisorymode — returns aSecurityAdvisorywith severity and signal list instead - skips the canary probe when the structural filter fires and
skip_canary_if_structural_blocksis True - runs the structural filter against base64-decoded, hex-decoded, ROT13-decoded, and reversed variants of the input
- detects null bytes, Unicode homoglyphs, zero-width characters, and RTL override codepoints in layer 1
- accepts an optional
judge_modelto replace the regexBehavioralAnalyzerwith an LLM classifier - supports
provider="openai"for any OpenAI-compatible API (OpenAI, MiniMax, Together, Groq, etc.) - writes per-check audit records to
canary-audit.jsonlandcanary-alerts.jsonlwhenaudit_log_diris set - stores input as SHA-256 hash in audit logs, never as raw text
- applies trust tiers (TRUSTED / KNOWN / UNKNOWN) when used via
CanaryGuard - rate-limits override grants to 5 per hour per caller via
CanaryGuard - fires
on_block,on_flag,on_passcallbacks per verdict, caught and logged so they never crash the pipeline - fails open when the canary model is unavailable — returns
safe=Truerather than dropping traffic - exposes a persistent HTTP server (
little-canary serve) withPOST /checkandGET /healthendpoints - reports canary model availability and current pipeline mode via
health_check()
- protect downstream tool calls or agent actions — only checks inbound user input
- guarantee detection of all prompt injection variants or function as a formal security control
- remove or sanitize the flagged input before returning it —
safe_inputis empty on block, unchanged otherwise - store raw input text anywhere — audit logs hash-only
- execute or forward the canary's response — it is observed and discarded
- replace the production LLM or modify the application's generation logic
- require Ollama — also works with any OpenAI-compatible endpoint
- support concurrent requests from multiple threads without an external lock