Moonshine: repetition guard fails on hyphenated patterns and clause-level loops

## Summary

The Moonshine backend's `_guard_repetition()` fails to catch two classes of
hallucinated repetition that occur in practice, producing garbage output that
can be hundreds of characters long.

## Reproduction

Run the roundtrip benchmark with difficult phrases:

```bash
.venv312/bin/python scripts/tts_roundtrip.py \
  --asr-backend moonshine --moonshine-model-name moonshine/base \
  --phrase "The sixth sick sheik's sixth sheep's sick" \
  --phrase "We still have issues with recording cutting out on long sentences, and we need deterministic regression tests to catch regressions before they ship"
```

### Observed output

**Hyphenated repetition (base model, tongue twister):**
```
REF: The sixth sick sheik's sixth sheep's sick
HYP: The six-six-hake-hake-hake-hake-hake-hake-hake-hake-hake-hake-hake-hake-hake-hake-hake-hake-hake-hake-hake-hake-hake-hake-hake-hake-hake-hake-hake-hake-hake-hake-hake-hake-hake-hake-...
similarity: 0.067
```

**Clause-level repetition (base model, long sentence):**
```
REF: We still have issues with recording cutting out on long sentences...
HYP: We still have issues with recording cutting out on long sentences. and we still have issues with recording cutting out on long sentences, and we still have issues with recording cutting out on long sentences, and we need, and we need, and we need, and we need, and we need, and we need, and
similarity: 0.342
```

**Numeric hallucination (tiny model):**
```
REF: Invoice 4827 totals one hundred and fifty three dollars and twelve cents
HYP: Invoice 4827 totals 153 dollars and 12700000000000000000000000000000000000000000000000000000...
similarity: 0.232
```

## Root cause

The repetition guard in `asr_moonshine.py:_guard_repetition()` has two blind spots:

1. **Hyphenated tokens bypass word-level n-gram detection.**
   `"hake-hake-hake-hake"` is a single word when split by spaces, so the
   n-gram loop (which splits on whitespace) never sees a repeating pattern.
   The word-count cap (`_MAX_WORDS_PER_SEC = 6.0`) can't fire either because
   there's only 1 "word".

2. **Clause-level repetition exceeds the n-gram window (max 4 words).**
   A repeating unit like `"and we still have issues with recording cutting out
   on long sentences"` is ~11 words. The guard only checks patterns of 1–4
   words, so the full-clause loop is invisible to it.

3. **Repeated digits/characters within a single token** (e.g. `127000000...`)
   are not words at all — they're character-level repetition inside one token.

### Relevant code

- `shuvoice/asr_moonshine.py` — `_guard_repetition()` (line ~266)
- `_REPETITION_THRESHOLD = 4` (line 61)
- `_MAX_WORDS_PER_SEC = 6.0` (line 60)

## Proposed fixes

### 1. Character-level repetition detection
Before the word-level checks, scan for any character or short substring
repeating more than N times consecutively:

```python
# Catch "hake-hake-hake..." and "000000000..."
import re
char_rep = re.search(r'(.{1,10}?)\1{5,}', text)
if char_rep:
    # Truncate at first repetition run
    text = text[:char_rep.start() + len(char_rep.group(1))]
```

### 2. Expand n-gram window or use substring matching
Either increase the max pattern length from 4 to ~15 words, or use a
suffix-based approach that detects when the last N words appeared earlier
in the output.

### 3. Output length cap relative to input duration
The existing `_MAX_WORDS_PER_SEC` cap works for word-count but not
character-count. Add a parallel character-count cap:

```python
max_chars = max(100, int(audio_seconds * 40))  # ~40 chars/sec generous cap
if len(text) > max_chars:
    text = text[:max_chars].rsplit(' ', 1)[0]
```

## Acceptance criteria

- [ ] Tongue twister phrase does not produce >50 characters of repeated pattern
- [ ] Long sentence does not produce clause-level repetition loops
- [ ] Numeric/digit repetition is truncated (e.g. `000000...`)
- [ ] Existing `test_asr.py` tests still pass
- [ ] Add regression test cases for each repetition class

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Moonshine: repetition guard fails on hyphenated patterns and clause-level loops #12

Summary

Reproduction

Observed output

Root cause

Relevant code

Proposed fixes

1. Character-level repetition detection

2. Expand n-gram window or use substring matching

3. Output length cap relative to input duration

Acceptance criteria

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Moonshine: repetition guard fails on hyphenated patterns and clause-level loops #12

Description

Summary

Reproduction

Observed output

Root cause

Relevant code

Proposed fixes

1. Character-level repetition detection

2. Expand n-gram window or use substring matching

3. Output length cap relative to input duration

Acceptance criteria

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions