Skip to content

Moonshine: repetition guard fails on hyphenated patterns and clause-level loops #12

@shuv1337

Description

@shuv1337

Summary

The Moonshine backend's _guard_repetition() fails to catch two classes of
hallucinated repetition that occur in practice, producing garbage output that
can be hundreds of characters long.

Reproduction

Run the roundtrip benchmark with difficult phrases:

.venv312/bin/python scripts/tts_roundtrip.py \
  --asr-backend moonshine --moonshine-model-name moonshine/base \
  --phrase "The sixth sick sheik's sixth sheep's sick" \
  --phrase "We still have issues with recording cutting out on long sentences, and we need deterministic regression tests to catch regressions before they ship"

Observed output

Hyphenated repetition (base model, tongue twister):

REF: The sixth sick sheik's sixth sheep's sick
HYP: The six-six-hake-hake-hake-hake-hake-hake-hake-hake-hake-hake-hake-hake-hake-hake-hake-hake-hake-hake-hake-hake-hake-hake-hake-hake-hake-hake-hake-hake-hake-hake-hake-hake-hake-hake-...
similarity: 0.067

Clause-level repetition (base model, long sentence):

REF: We still have issues with recording cutting out on long sentences...
HYP: We still have issues with recording cutting out on long sentences. and we still have issues with recording cutting out on long sentences, and we still have issues with recording cutting out on long sentences, and we need, and we need, and we need, and we need, and we need, and we need, and
similarity: 0.342

Numeric hallucination (tiny model):

REF: Invoice 4827 totals one hundred and fifty three dollars and twelve cents
HYP: Invoice 4827 totals 153 dollars and 12700000000000000000000000000000000000000000000000000000...
similarity: 0.232

Root cause

The repetition guard in asr_moonshine.py:_guard_repetition() has two blind spots:

  1. Hyphenated tokens bypass word-level n-gram detection.
    "hake-hake-hake-hake" is a single word when split by spaces, so the
    n-gram loop (which splits on whitespace) never sees a repeating pattern.
    The word-count cap (_MAX_WORDS_PER_SEC = 6.0) can't fire either because
    there's only 1 "word".

  2. Clause-level repetition exceeds the n-gram window (max 4 words).
    A repeating unit like "and we still have issues with recording cutting out on long sentences" is ~11 words. The guard only checks patterns of 1–4
    words, so the full-clause loop is invisible to it.

  3. Repeated digits/characters within a single token (e.g. 127000000...)
    are not words at all — they're character-level repetition inside one token.

Relevant code

  • shuvoice/asr_moonshine.py_guard_repetition() (line ~266)
  • _REPETITION_THRESHOLD = 4 (line 61)
  • _MAX_WORDS_PER_SEC = 6.0 (line 60)

Proposed fixes

1. Character-level repetition detection

Before the word-level checks, scan for any character or short substring
repeating more than N times consecutively:

# Catch "hake-hake-hake..." and "000000000..."
import re
char_rep = re.search(r'(.{1,10}?)\1{5,}', text)
if char_rep:
    # Truncate at first repetition run
    text = text[:char_rep.start() + len(char_rep.group(1))]

2. Expand n-gram window or use substring matching

Either increase the max pattern length from 4 to ~15 words, or use a
suffix-based approach that detects when the last N words appeared earlier
in the output.

3. Output length cap relative to input duration

The existing _MAX_WORDS_PER_SEC cap works for word-count but not
character-count. Add a parallel character-count cap:

max_chars = max(100, int(audio_seconds * 40))  # ~40 chars/sec generous cap
if len(text) > max_chars:
    text = text[:max_chars].rsplit(' ', 1)[0]

Acceptance criteria

  • Tongue twister phrase does not produce >50 characters of repeated pattern
  • Long sentence does not produce clause-level repetition loops
  • Numeric/digit repetition is truncated (e.g. 000000...)
  • Existing test_asr.py tests still pass
  • Add regression test cases for each repetition class

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions