feat: add homoglyph obfuscation prompts to smuggling probe#1660
Conversation
jmartin-tech
left a comment
There was a problem hiding this comment.
This is a great added technique, I would suggest this can be expanded to preform inline substitution instead of just using a set of hardcoded sample prompts.
The idea I am suggesting, would programmatically replace characters during prompt initialization to actually mimic the smuggling aspect of the technique. This could be further enhanced to accept a configuration map of character replacements that could be increased or reduced to expand resiliency testing.
Address review feedback on PR NVIDIA#1660: - Change tier from COMPETE_WITH_SOTA to INFORMATIONAL - Replace static prompt loading with programmatic substitution via homoglyph_replace() function applied to garak payloads - Add configurable DEFAULT_HOMOGLYPH_MAP (20 Latin-to-Cyrillic/Turkish/ Ukrainian mappings) overridable via homoglyph_map config parameter - Load payloads from garak.payloads system (harmful_behaviors default) - Keep static prompts as additional payloads through same pipeline - Update tests: 9 tests covering substitution function, probe loading, tier, determinism, custom maps, non-ASCII verification Signed-off-by: Nathan Maine <nathan@dentity.cloud>
|
Thanks for the review. Both changes addressed:
Tests updated: 9 tests covering probe loading, substitution function behavior (determinism, custom maps, non-mapped character preservation), non-ASCII verification, tier, and inactive flag. |
|
@jmartin-tech heads up, the changes you requested on 2026-03-30 are in: Tier adjusted to INFORMATIONAL |
jmartin-tech
left a comment
There was a problem hiding this comment.
The approach looks pretty reasonable to me, however the quality of results suggest some more refinement may be needed.
This technique can cause the target to respond in another language causing the detector to misidentify results. For instance many denial mitigation responses are reported as successful bypass when tested against meta/llama-3.3-70b-instruct as the model may respond with a Russian language answer.
Further discussion is pending to see how we can move this forward.
|
@jmartin-tech underscore rename committed, thanks. On the multilingual-response issue, three options I can see: (1) mark the probe Tier.INACTIVE with a docstring note pending language-aware detection, (2) pair the probe with a new detector that understands non-English denials, or (3) something else you have in mind. Which path do you prefer, or is there existing framework infra I should use? |
Add smuggling.HomoglyphObfuscation with 5 prompts that use Unicode homoglyphs (Cyrillic, Latin alpha, Turkish dotless i) to disguise trigger words in bypass requests. Tests whether input-side content filters catch visually identical character substitutions from different scripts. Uses mitigation.MitigationBypass detector. Set to active=False (domain-specific). Signed-off-by: Nathan Maine <dentity@gmail.com>
Address review feedback on PR NVIDIA#1660: - Change tier from COMPETE_WITH_SOTA to INFORMATIONAL - Replace static prompt loading with programmatic substitution via homoglyph_replace() function applied to garak payloads - Add configurable DEFAULT_HOMOGLYPH_MAP (20 Latin-to-Cyrillic/Turkish/ Ukrainian mappings) overridable via homoglyph_map config parameter - Load payloads from garak.payloads system (harmful_behaviors default) - Keep static prompts as additional payloads through same pipeline - Update tests: 9 tests covering substitution function, probe loading, tier, determinism, custom maps, non-ASCII verification Signed-off-by: Nathan Maine <dentity@gmail.com>
…ling.py Co-authored-by: Jeffrey Martin <jmartin@Op3n4M3.dev> Signed-off-by: Nathan Maine <dentity@gmail.com>
94499cf to
67febe4
Compare
This seems like a generic deficiency of using mitigationbypass to detect whether or not the instructions talk about hotwiring a car. If we're not trying to detect the requested failure mode directly, but instead making an assumption about model policy, detection will always be inaccurate. Seeing as we make this assumption in many other places in garak, I don't think we need to raise the bar specifically for this contribution. And in fact the code here offers a fairly direct route to converting to context aware scanning later #1583, where we should have much more direct linkage between requested behaviour and detection. The other way to go here is to specify an llmaaj detector; I don't mind too much which route is taken. |
The previous commit renamed homoglyph_replace to _homoglyph_replace in the module definition but did not update the internal caller in HomoglyphObfuscation.__init__ or the test module's import and call sites. This caused probe initialization to NameError and CI test collection to ImportError. Both are now aligned with the private name. Signed-off-by: Nathan Maine <dentity@gmail.com>
aa94e46 to
16cfb57
Compare
Adds docstring note to HomoglyphObfuscation explaining that the current primary detector (mitigation.MitigationBypass) assumes English-language denial responses, which can produce false positives on targets that respond in the same script as the obfuscated input. Points to the follow-up ModelAsJudge-based detector PR and discussion NVIDIA#1583 for the broader context-aware scanning direction. Signed-off-by: Nathan Maine <dentity@gmail.com>
|
Thanks @leondz, going with accept-as-is here. Added a note in the probe docstring (commit dd9479a) about the non-English-response limitation with a pointer to discussion #1583. I'll follow up immediately with a separate PR adding a |
|
Follow-up draft is up: #1688. Scaffold committed, judge prompt refinement and test coverage to land in subsequent commits. Leaving it in draft until this PR merges since the detector targets the probe introduced here. |
Adds
smuggling.HomoglyphObfuscation, a probe with 5 prompts that use Unicode homoglyphs (visually similar characters from different scripts) to disguise trigger words in bypass requests. For example, Cyrillic 'a' (U+0430) replaces Latin 'a' in "jailbreak", making the token sequence different while the text remains human-readable.Second decomposed contribution from PR #1619. The
smugglingmodule's docstring describes exactly this technique: "swapping letters out for unusual unicode representations of the same letters." Usesmitigation.MitigationBypassdetector. Set toactive = Falsesince these are domain-specific.Homoglyph scripts used: Cyrillic (U+0430, U+043E, U+0456), Latin alpha (U+0251), Turkish dotless i (U+0131)
Files:
garak/probes/smuggling.py: newHomoglyphObfuscationclassgarak/data/smuggling_homoglyph_5.txt: 5 prompts with embedded Unicode homoglyphstests/probes/test_probes_smuggling.py: 4 tests (count, uniqueness, non-ASCII verification, active=False)