Skip to content

Encoding mixups: stop assuming UTF-8 by default and record encoding decisions at lifecycle boundaries #644

@robertDouglass

Description

@robertDouglass

Problem

We hit encoding failures on Windows where content ended up mixed between Windows-1252/cp1252 and UTF-8. The deeper problem is broader than one code path: Spec Kitty appears biased toward assuming UTF-8 across many reads/writes instead of first establishing what encoding incoming content is actually using.

This showed up in a Windows + Gemini workflow, but the issue is not Gemini-specific and not limited to one artifact. Charter content, mission artifacts, generated markdown, templates, and other persisted text can all be affected if the system decodes too early under a UTF-8 assumption.

Core issue

The system needs an explicit encoding contract, not scattered best-effort assumptions.

Right now the repo already has some validation/sanitization utilities, but the lifecycle still appears to have gaps around:

  • detecting the source encoding when content is first ingested or generated,
  • recording the encoding decision or normalization decision as provenance/metadata,
  • re-checking that contract at important lifecycle boundaries,
  • failing clearly when content is mixed, ambiguous, or already corrupted,
  • avoiding silent propagation of mis-decoded text into downstream prompts and artifacts.

Why this matters

If content is decoded under the wrong assumption once, corruption spreads. By the time a user notices garbled characters, the bad text may already be embedded in charter state, mission files, prompts, logs, or synced artifacts.

Windows-originated content makes this easier to trigger because cp1252 is still common in some editors, shells, copy/paste paths, and generated output. But the real bug is the product-level assumption that UTF-8 can be treated as the default truth without first detecting and recording the contract.

Requested behavior

  1. Introduce a general encoding-detection chokepoint for externally sourced or newly ingested text content.
  2. Record the decided encoding or normalization result in provenance/metadata where the lifecycle depends on it.
  3. Re-validate that contract at critical boundaries such as charter load/compile and mission begin/start.
  4. If content is mixed or ambiguous, fail with a targeted diagnostic that says what was detected, where, and how to repair it.
  5. Normalize persisted markdown/text to UTF-8 only after the source encoding decision is known.
  6. Audit broad UTF-8 assumptions so the system does not silently mis-decode content before validation happens.

Important scope note

This is not just a charter bug.

Charter and mission-begin are important checkpoints because they are high-leverage lifecycle boundaries, but the underlying issue is more general: encoding assumptions are distributed across the system and need a canonical policy plus provenance.

Acceptance criteria

  • Windows-originated cp1252 content is either safely normalized to UTF-8 or rejected with a precise diagnostic before corruption spreads.
  • The system records what encoding contract or normalization decision it relied on for critical lifecycle inputs.
  • Mission start and charter-related lifecycle steps do not silently consume mixed-encoding text.
  • Users can inspect the recorded encoding decision later when debugging provenance.
  • Broad UTF-8 assumptions are reduced behind explicit detection/validation chokepoints rather than remaining ad hoc.

Notes

Observed in a Windows + Gemini workflow, but the issue should be treated as a general encoding-mixup problem across Spec Kitty rather than a one-off Windows artifact or a single charter-path bug.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions