Skip to content

fix(channel,tool): use floor_char_boundary for UTF-8 safe string truncation#5458

Open
WanZheng wants to merge 4 commits into
zeroclaw-labs:masterfrom
WanZheng:fix/utf8-safe-string-truncation
Open

fix(channel,tool): use floor_char_boundary for UTF-8 safe string truncation#5458
WanZheng wants to merge 4 commits into
zeroclaw-labs:masterfrom
WanZheng:fix/utf8-safe-string-truncation

Conversation

@WanZheng
Copy link
Copy Markdown

@WanZheng WanZheng commented Apr 7, 2026

Summary

  • Base branch target (master for all contributions): master
  • Problem: Direct byte-index slicing (e.g. &text[..200]) panics on multibyte UTF-8 characters (CJK, emoji, box-drawing) when the byte index falls inside a multi-byte code point.
  • Why it matters: Any user sending non-ASCII content through Bluesky, Lark, Slack, Twitter channels or LinkedIn tool would trigger a runtime panic, crashing the agent loop.
  • What changed: Replaced all bare byte-index truncation sites with str::floor_char_boundary() to round down to the nearest valid char boundary before slicing. Strengthened the Slack split_text_into_chunks regression test with mixed ASCII/multibyte input, a long line ending near the chunk limit with a trailing multibyte char, and per-chunk is_char_boundary checks; rustfmt-only cleanup on that test call.
  • What did not change (scope boundary): No logic changes to message routing, API calls, or channel/tool behavior beyond truncation safety. No new dependencies.

Label Snapshot (required)

  • Risk label (risk: low|medium|high): risk: low
  • Size label (size: XS|S|M|L|XL, auto-managed/read-only): size: S
  • Scope labels: channel, tool
  • Module labels: channel: bluesky, channel: lark, channel: slack, channel: twitter, tool: linkedin
  • Contributor tier label: N/A
  • If any auto-label is incorrect, note requested correction: N/A

Change Metadata

  • Change type (bug|feature|refactor|docs|security|chore): bug
  • Primary scope (runtime|provider|channel|memory|security|ci|docs|multi): multi (channel + tool)

Linked Issue

  • Closes #
  • Related #
  • Depends on #
  • Supersedes #

Supersede Attribution (required when Supersedes # is used)

N/A

Validation Evidence (required)

Commands and result summary:

cargo fmt --all -- --check   # pass (prior validation)
cargo clippy --all-targets -- -D warnings   # pass (prior validation)
cargo test -p zeroclawlabs channels::slack   # pass (90 tests × lib + bin); includes split_text_into_chunks_safe_on_multibyte_utf8
  • Evidence provided: Slack regression test exercises chunk boundaries with mixed scripts (CJK, Cyrillic, emoji, musical symbol, box-drawing) and asserts each chunk ends on a UTF-8 char boundary; full channels::slack unit suite re-run after the test hardening commits.
  • If any command is intentionally skipped, explain why: Full-repo cargo test not re-run in this update session; Slack-scoped suite covers the changed test and module.

Security Impact (required)

  • New permissions/capabilities? No
  • New external network calls? No
  • Secrets/tokens handling changed? No
  • File system access scope changed? No
  • If any Yes, describe risk and mitigation: N/A

Privacy and Data Hygiene (required)

  • Data-hygiene status: pass
  • Redaction/anonymization notes: Test fixture uses neutral multilingual/symbol content, no personal data.
  • Neutral wording confirmation: Yes — all test data uses project-neutral wording.

Compatibility / Migration

  • Backward compatible? Yes
  • Config/env changes? No
  • Migration needed? No
  • If yes, exact upgrade steps: N/A

i18n Follow-Through (required when docs or user-facing wording changes)

  • i18n follow-through triggered? No (code-only change, no user-facing wording changes)

Human Verification (required)

What was personally validated beyond CI:

  • Verified scenarios: Slack message splitting with multibyte UTF-8 at chunk edges; reassembled prefix invariant.
  • Edge cases checked: Per-chunk max length, char-boundary safety, mixed single/multi-byte runs.
  • What was not verified: Live channel integration tests (Bluesky, Twitter, Lark) — covered by unit tests and identical truncation pattern elsewhere.

Side Effects / Blast Radius (required)

  • Affected subsystems/workflows: Bluesky, Lark, Slack, Twitter channels; LinkedIn tool and LinkedIn client image card generation.
  • Potential unintended effects: Truncated strings may be slightly shorter than before (up to 3 bytes less for 4-byte chars), which is cosmetic and correct behavior.
  • Guardrails/monitoring for early detection: Existing channel tests + Slack UTF-8 chunking regression test. Runtime panics would show in agent logs.

Agent Collaboration Notes (recommended)

  • Agent tools used: Cursor agent / local validation
  • Workflow/plan summary: UTF-8-safe truncation fix; follow-up test hardening and channels::slack test run; PR synced via push to fork.
  • Verification focus: Slack chunking regression and full slack unit filter.
  • Confirmation: naming + architecture boundaries followed (AGENTS.md + CONTRIBUTING.md): Yes

Rollback Plan (required)

  • Fast rollback command/path: git revert <merge-commit> — single logical change set, no config/state changes.
  • Feature flags or config toggles (if any): None needed.
  • Observable failure symptoms: If rollback is needed, multibyte truncation panics would return. Monitor agent logs for panic traces in channel send paths.

Risks and Mitigations

  • Risk: floor_char_boundary requires a sufficiently new stable Rust.
    • Mitigation: ZeroClaw rust-version in Cargo.toml is >= API availability; CI enforces toolchain.

Made with Cursor

Direct byte-index slicing could panic on multibyte UTF-8 characters
(e.g. CJK, emoji). Replace all truncation sites with
floor_char_boundary() to round down to the nearest char boundary.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@github-actions github-actions Bot added channel Auto scope: src/channels/** changed. tool Auto scope: src/tools/** changed. channel:slack Auto module: channel/slack changed. channel:lark Auto module: channel/lark changed. channel:bluesky channel:twitter labels Apr 7, 2026
WanZheng added 3 commits April 7, 2026 19:33
Replace str::floor_char_boundary (stable 1.91) with crate::util::floor_char_boundary so clippy::incompatible_msrv passes under rust-version 1.87. Deduplicate agent/history and composio helpers.

Made-with: Cursor
@github-actions github-actions Bot added core Auto scope: root src/*.rs files changed. agent Auto scope: src/agent/** changed. tool:composio Auto module: tool/composio changed. labels Apr 7, 2026
@WanZheng
Copy link
Copy Markdown
Author

WanZheng commented Apr 7, 2026

Pushed follow-up commit fix: centralize UTF-8 floor_char_boundary for MSRV 1.87.

Why: CI Lint runs clippy -D warnings. str::floor_char_boundary is stable only since Rust 1.91, while Cargo.toml declares rust-version = "1.87", so clippy::incompatible_msrv failed. The new crate::util::floor_char_boundary keeps the same UTF-8 behavior without raising MSRV (bumping MSRV would also surface many collapsible_if lints).

@singlerider singlerider requested a review from Audacity88 April 29, 2026 08:18
@singlerider singlerider added bug Something isn't working risk: medium Auto risk: src/** or dependency/config changes. size: S Auto size: 81-250 non-doc changed lines. needs-author-action Author action required before merge labels Apr 29, 2026
@Audacity88
Copy link
Copy Markdown
Collaborator

I checked the current queue state for this PR. It is still DIRTY against master and has needs-author-action, so I’m going to hold off on a full review until it is rebased.

The MSRV follow-up noted in the comment makes sense conceptually: centralizing UTF-8 floor-boundary logic is preferable to relying on str::floor_char_boundary while the project declares Rust 1.87. Once the branch is current, this should be a relatively focused review of the helper and the touched truncation call sites.

@Audacity88
Copy link
Copy Markdown
Collaborator

Hi @WanZheng, just checking in on this one. The PR is still conflicting with master and still has needs-author-action, and my May 1 note was waiting for a rebase before doing the full review.

Are you still planning to update this branch? The UTF-8 truncation fix looks useful, so if you want to keep it moving, the next step is rebasing onto current master and letting CI rerun. If you are no longer planning to work on it, that is fine too; just let us know so we can decide whether to close this PR and recover the fix separately later.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agent Auto scope: src/agent/** changed. bug Something isn't working channel:bluesky channel:lark Auto module: channel/lark changed. channel:slack Auto module: channel/slack changed. channel:twitter channel Auto scope: src/channels/** changed. core Auto scope: root src/*.rs files changed. needs-author-action Author action required before merge risk: medium Auto risk: src/** or dependency/config changes. size: S Auto size: 81-250 non-doc changed lines. tool:composio Auto module: tool/composio changed. tool Auto scope: src/tools/** changed.

Projects

Status: Backlog

Development

Successfully merging this pull request may close these issues.

3 participants