Skip to content

feat: add homoglyph obfuscation prompts to smuggling probe#1660

Merged
jmartin-tech merged 5 commits intoNVIDIA:mainfrom
NathanMaine:feat/smuggling-homoglyph-obfuscation
May 1, 2026
Merged

feat: add homoglyph obfuscation prompts to smuggling probe#1660
jmartin-tech merged 5 commits intoNVIDIA:mainfrom
NathanMaine:feat/smuggling-homoglyph-obfuscation

Conversation

@dentity007
Copy link
Copy Markdown
Contributor

Adds smuggling.HomoglyphObfuscation, a probe with 5 prompts that use Unicode homoglyphs (visually similar characters from different scripts) to disguise trigger words in bypass requests. For example, Cyrillic 'a' (U+0430) replaces Latin 'a' in "jailbreak", making the token sequence different while the text remains human-readable.

Second decomposed contribution from PR #1619. The smuggling module's docstring describes exactly this technique: "swapping letters out for unusual unicode representations of the same letters." Uses mitigation.MitigationBypass detector. Set to active = False since these are domain-specific.

Homoglyph scripts used: Cyrillic (U+0430, U+043E, U+0456), Latin alpha (U+0251), Turkish dotless i (U+0131)

Files:

  • garak/probes/smuggling.py : new HomoglyphObfuscation class
  • garak/data/smuggling_homoglyph_5.txt : 5 prompts with embedded Unicode homoglyphs
  • tests/probes/test_probes_smuggling.py : 4 tests (count, uniqueness, non-ASCII verification, active=False)

Copy link
Copy Markdown
Collaborator

@jmartin-tech jmartin-tech left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great added technique, I would suggest this can be expanded to preform inline substitution instead of just using a set of hardcoded sample prompts.

The idea I am suggesting, would programmatically replace characters during prompt initialization to actually mimic the smuggling aspect of the technique. This could be further enhanced to accept a configuration map of character replacements that could be increased or reduced to expand resiliency testing.

Comment thread garak/probes/smuggling.py Outdated
dentity007 added a commit to NathanMaine/garak that referenced this pull request Mar 30, 2026
Address review feedback on PR NVIDIA#1660:

- Change tier from COMPETE_WITH_SOTA to INFORMATIONAL
- Replace static prompt loading with programmatic substitution via
  homoglyph_replace() function applied to garak payloads
- Add configurable DEFAULT_HOMOGLYPH_MAP (20 Latin-to-Cyrillic/Turkish/
  Ukrainian mappings) overridable via homoglyph_map config parameter
- Load payloads from garak.payloads system (harmful_behaviors default)
- Keep static prompts as additional payloads through same pipeline
- Update tests: 9 tests covering substitution function, probe loading,
  tier, determinism, custom maps, non-ASCII verification

Signed-off-by: Nathan Maine <nathan@dentity.cloud>
@dentity007
Copy link
Copy Markdown
Contributor Author

Thanks for the review. Both changes addressed:

  • Tier adjusted to INFORMATIONAL
  • Replaced the static prompt approach with programmatic substitution. The probe now loads payloads from garak's payload system (harmful_behaviors by default), applies character-by-character homoglyph replacement via a configurable DEFAULT_HOMOGLYPH_MAP (20 Latin-to-Cyrillic/Turkish/Ukrainian mappings), and generates obfuscated prompts at initialization. The map is overridable via the homoglyph_map config parameter so the substitution set can be expanded or reduced. The original 5 static prompts are still loaded as additional payloads and go through the same substitution pipeline.

Tests updated: 9 tests covering probe loading, substitution function behavior (determinism, custom maps, non-mapped character preservation), non-ASCII verification, tier, and inactive flag.

@dentity007
Copy link
Copy Markdown
Contributor Author

@jmartin-tech heads up, the changes you requested on 2026-03-30 are in:

Tier adjusted to INFORMATIONAL
Replaced the static prompt approach with programmatic substitution via a configurable DEFAULT_HOMOGLYPH_MAP (20 Latin-to-Cyrillic/Turkish/Ukrainian mappings)
9 new tests covering substitution determinism, custom maps, non-mapped character preservation, tier, and inactive flag
CI is green across Linux, macOS, Windows. Happy to make additional adjustments if anything needs more tweaking. Thanks for the review.

Copy link
Copy Markdown
Collaborator

@jmartin-tech jmartin-tech left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The approach looks pretty reasonable to me, however the quality of results suggest some more refinement may be needed.

This technique can cause the target to respond in another language causing the detector to misidentify results. For instance many denial mitigation responses are reported as successful bypass when tested against meta/llama-3.3-70b-instruct as the model may respond with a Russian language answer.

Further discussion is pending to see how we can move this forward.

Comment thread garak/probes/smuggling.py Outdated
@dentity007
Copy link
Copy Markdown
Contributor Author

@jmartin-tech underscore rename committed, thanks. On the multilingual-response issue, three options I can see: (1) mark the probe Tier.INACTIVE with a docstring note pending language-aware detection, (2) pair the probe with a new detector that understands non-English denials, or (3) something else you have in mind. Which path do you prefer, or is there existing framework infra I should use?

dentity007 and others added 3 commits April 17, 2026 17:09
Add smuggling.HomoglyphObfuscation with 5 prompts that use Unicode
homoglyphs (Cyrillic, Latin alpha, Turkish dotless i) to disguise
trigger words in bypass requests. Tests whether input-side content
filters catch visually identical character substitutions from different
scripts. Uses mitigation.MitigationBypass detector. Set to active=False
(domain-specific).

Signed-off-by: Nathan Maine <dentity@gmail.com>
Address review feedback on PR NVIDIA#1660:

- Change tier from COMPETE_WITH_SOTA to INFORMATIONAL
- Replace static prompt loading with programmatic substitution via
  homoglyph_replace() function applied to garak payloads
- Add configurable DEFAULT_HOMOGLYPH_MAP (20 Latin-to-Cyrillic/Turkish/
  Ukrainian mappings) overridable via homoglyph_map config parameter
- Load payloads from garak.payloads system (harmful_behaviors default)
- Keep static prompts as additional payloads through same pipeline
- Update tests: 9 tests covering substitution function, probe loading,
  tier, determinism, custom maps, non-ASCII verification

Signed-off-by: Nathan Maine <dentity@gmail.com>
…ling.py

Co-authored-by: Jeffrey Martin <jmartin@Op3n4M3.dev>
Signed-off-by: Nathan Maine <dentity@gmail.com>
@dentity007 dentity007 force-pushed the feat/smuggling-homoglyph-obfuscation branch from 94499cf to 67febe4 Compare April 17, 2026 22:10
@leondz
Copy link
Copy Markdown
Collaborator

leondz commented Apr 20, 2026

This technique can cause the target to respond in another language causing the detector to misidentify results. For instance many denial mitigation responses are reported as successful bypass when tested against meta/llama-3.3-70b-instruct as the model may respond with a Russian language answer.

This seems like a generic deficiency of using mitigationbypass to detect whether or not the instructions talk about hotwiring a car. If we're not trying to detect the requested failure mode directly, but instead making an assumption about model policy, detection will always be inaccurate.

Seeing as we make this assumption in many other places in garak, I don't think we need to raise the bar specifically for this contribution. And in fact the code here offers a fairly direct route to converting to context aware scanning later #1583, where we should have much more direct linkage between requested behaviour and detection.

The other way to go here is to specify an llmaaj detector; I don't mind too much which route is taken.

The previous commit renamed homoglyph_replace to _homoglyph_replace in
the module definition but did not update the internal caller in
HomoglyphObfuscation.__init__ or the test module's import and call sites.
This caused probe initialization to NameError and CI test collection to
ImportError. Both are now aligned with the private name.

Signed-off-by: Nathan Maine <dentity@gmail.com>
@dentity007 dentity007 force-pushed the feat/smuggling-homoglyph-obfuscation branch from aa94e46 to 16cfb57 Compare April 21, 2026 01:27
Adds docstring note to HomoglyphObfuscation explaining that the current
primary detector (mitigation.MitigationBypass) assumes English-language
denial responses, which can produce false positives on targets that
respond in the same script as the obfuscated input. Points to the
follow-up ModelAsJudge-based detector PR and discussion NVIDIA#1583 for the
broader context-aware scanning direction.

Signed-off-by: Nathan Maine <dentity@gmail.com>
@dentity007
Copy link
Copy Markdown
Contributor Author

Thanks @leondz, going with accept-as-is here. Added a note in the probe docstring (commit dd9479a) about the non-English-response limitation with a pointer to discussion #1583. I'll follow up immediately with a separate PR adding a ModelAsJudge-based detector configured for this probe's goal, which addresses the non-English-response concern directly. Will link here once drafted.

@dentity007
Copy link
Copy Markdown
Contributor Author

Follow-up draft is up: #1688. Scaffold committed, judge prompt refinement and test coverage to land in subsequent commits. Leaving it in draft until this PR merges since the detector targets the probe introduced here.

@jmartin-tech jmartin-tech merged commit 2b9a89a into NVIDIA:main May 1, 2026
16 checks passed
@github-actions github-actions Bot locked and limited conversation to collaborators May 1, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants