Replace io::Write with fmt::Write to avoid revalidating UTF-8. #601

kivikakk · 2025-09-23T07:31:00Z

Warning

This breaks many external APIs!

This PR is an attempt at doing something I've wanted to do basically forever — remove a bunch of std::str::from_utf8 and String::from_utf8 calls by not forcing strings through byte-oriented interfaces.

Motivation

Currently, formatters use std::io::Write at the boundary, and internally deal with that and a lot of &[u8]s and some Vec<u8>s. There are many calls to str::as_bytes and String::as_bytes, essentially throwing away the information that this source data was valid UTF-8, and then oftentimes places where we later have to use from_utf8 variants on buffers to reconstruct a &str or String from them.

One of the main ways of working with the library is to create a Vec<u8>, format into it, and then use a from_utf8 variant on it. This re-validates the buffer contents as UTF-8, even though there should be no way of non-UTF-8 data getting into that buffer.

The library also uses unsafe with from_utf8_unchecked variants at times, including in formatters, for the same reason.

Proposal

We replace almost every use of std::io::Write with std::fmt::Write. This means you can hand a &mut String to comrak::format_html_with_plugins.

Users no longer need to revalidate the buffer as UTF-8 to get a &str or String, and many internal interfaces now don't need to assert the same thing — that information is preserved from the input document all the way to the output.

Downsides

You can't hand an std::io::Write directly to these functions any more, such as an std::fs::File or an std::io::BufWriter. We add the fmt2io dependency for the binary (only) to do this, and end-users can do similar if they want. The implementation if you want to DIY is about 10 lines: forward std::fmt::Write::write_str to std::io::Write::write_all.

Alternatives considered

Do nothing. We keep revalidating (or requiring users to).
Use unsafe + from_utf8_unchecked variants more widely to avoid such revalidation, and hide it from the calling code.

Still to do

comrak::cm internally still writes out to a Vec<u8> and then revalidates it. I just haven't gotten to it yet.
We still do some string slicing — before we had to assert the sliced buffer was UTF-8, now the SliceIndex<str> impl only needs to check that the slice starts and ends at a character boundary. There are some places where we could possibly (safely) obviate the need to do that.

I'm interested in thoughts from users. I know it's annoying to keep up-to-date with the APIs of your dependencies changing, but there's bits and pieces like this I'd like to get cleaned up — maybe Comrak 1.0 can be a thing sometime soon?

github-actions · 2025-09-23T07:35:22Z

Command	Mean [ms]	Min [ms]	Max [ms]	Relative
`./bench.sh ./comrak-8c6e8de`	311.4 ± 4.9	309.0	334.1	2.94 ± 0.05
`./bench.sh ./comrak-main`	315.6 ± 2.2	313.1	324.4	2.98 ± 0.03
`./bench.sh ./pulldown-cmark`	105.8 ± 0.6	104.8	107.8	1.00
`./bench.sh ./cmark-gfm`	116.3 ± 1.2	114.3	119.0	1.10 ± 0.01
`./bench.sh ./markdown-it`	475.5 ± 2.1	471.8	480.8	4.49 ± 0.03

Run on Tue Sep 23 07:35:21 UTC 2025

kivikakk · 2025-09-23T07:46:27Z

This branch's cargo bench:

test bench_progit ... bench:   1,974,452.07 ns/iter (+/- 59,249.69)

main's:

test bench_progit ... bench:   2,093,428.12 ns/iter (+/- 56,840.20)

5.7% runtime decrease!

(The benchmarks reported in #601 (comment) go straight to stdout and so don't do the largest amount of revalidation anyway.)

gjtorikian · 2025-09-24T04:19:11Z

I'm interested in thoughts from users. I know it's annoying to keep up-to-date with the APIs of your dependencies changing

As a downstream API consumer, I would much rather have dependencies changed—and presumably, improved—than the opposite.

liamwhite · 2025-09-24T12:28:15Z

As an API consumer, definitely would like to see this breaking change land

digitalmoksha · 2025-09-24T17:24:12Z

This looks like it would be a great change. I say go for it.

BufWriter gets us nothing when writing to a Vec; see https://doc.rust-lang.org/std/io/struct.BufWriter.html: BufWriter<W> can improve the speed of programs that make small and repeated write calls to the same file or network socket. It does not help when writing very large amounts at once, or writing just one or a few times. It also provides no advantage when writing to a destination that is in memory, like a Vec<u8>.

kivikakk · 2025-09-26T04:30:19Z

Thanks all! Let's do it.

Turbo87 · 2025-10-14T05:52:25Z

I might be reading this wrong, but I think this change broke the "Usage" snippet in the Readme: https://github.com/kivikakk/comrak?tab=readme-ov-file#usage 🤔

kivikakk · 2025-10-14T06:24:30Z

!! Thanks for mentioning this, you're reading it exactly right! I need to sync that in an automated fashion or something; those examples match examples/sample.rs, but I didn't copy them back to the README 🤦‍♀️

kivikakk added the api breaking change label Sep 23, 2025

kivikakk force-pushed the push-lsmztupslltn branch from 04bf03b to 119986c Compare September 24, 2025 01:10

kivikakk force-pushed the push-lsmztupslltn branch 2 times, most recently from 60171b8 to 438e2b9 Compare September 24, 2025 07:49

kivikakk added 5 commits September 26, 2025 14:27

s/io::Write/fmt::Write/ in external interfaces.

0f9c69b

Makefile: add bench-comrak-vs-main.

99fa272

.gitignore: add bench-output.md.

802d154

changelog: add entry.

0f030ac

kivikakk force-pushed the push-lsmztupslltn branch from 438e2b9 to 0f030ac Compare September 26, 2025 04:27

kivikakk enabled auto-merge September 26, 2025 04:30

kivikakk merged commit 451fe12 into main Sep 26, 2025
44 checks passed

kivikakk deleted the push-lsmztupslltn branch September 26, 2025 04:31

BrewTestBot mentioned this pull request Sep 29, 2025

comrak 0.43.0 Homebrew/homebrew-core#246142

Merged

kivikakk mentioned this pull request Oct 14, 2025

README: sync to sample (and improve sample quality). #621

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Replace io::Write with fmt::Write to avoid revalidating UTF-8. #601

Replace io::Write with fmt::Write to avoid revalidating UTF-8. #601

Uh oh!

kivikakk commented Sep 23, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Sep 23, 2025

Uh oh!

kivikakk commented Sep 23, 2025 •

edited

Loading

Uh oh!

gjtorikian commented Sep 24, 2025

Uh oh!

liamwhite commented Sep 24, 2025

Uh oh!

digitalmoksha commented Sep 24, 2025

Uh oh!

kivikakk commented Sep 26, 2025

Uh oh!

Uh oh!

Turbo87 commented Oct 14, 2025

Uh oh!

kivikakk commented Oct 14, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Uh oh!

Replace io::Write with fmt::Write to avoid revalidating UTF-8. #601

Replace io::Write with fmt::Write to avoid revalidating UTF-8. #601

Uh oh!

Conversation

kivikakk commented Sep 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Proposal

Downsides

Alternatives considered

Still to do

Uh oh!

github-actions bot commented Sep 23, 2025

Uh oh!

kivikakk commented Sep 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gjtorikian commented Sep 24, 2025

Uh oh!

liamwhite commented Sep 24, 2025

Uh oh!

digitalmoksha commented Sep 24, 2025

Uh oh!

kivikakk commented Sep 26, 2025

Uh oh!

Uh oh!

Turbo87 commented Oct 14, 2025

Uh oh!

kivikakk commented Oct 14, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

kivikakk commented Sep 23, 2025 •

edited

Loading

kivikakk commented Sep 23, 2025 •

edited

Loading