Skip to content

Conversation

@kivikakk
Copy link
Owner

@kivikakk kivikakk commented Sep 23, 2025

Warning

This breaks many external APIs!

This PR is an attempt at doing something I've wanted to do basically forever — remove a bunch of std::str::from_utf8 and String::from_utf8 calls by not forcing strings through byte-oriented interfaces.

Motivation

Currently, formatters use std::io::Write at the boundary, and internally deal with that and a lot of &[u8]s and some Vec<u8>s. There are many calls to str::as_bytes and String::as_bytes, essentially throwing away the information that this source data was valid UTF-8, and then oftentimes places where we later have to use from_utf8 variants on buffers to reconstruct a &str or String from them.

One of the main ways of working with the library is to create a Vec<u8>, format into it, and then use a from_utf8 variant on it. This re-validates the buffer contents as UTF-8, even though there should be no way of non-UTF-8 data getting into that buffer.

The library also uses unsafe with from_utf8_unchecked variants at times, including in formatters, for the same reason.

Proposal

We replace almost every use of std::io::Write with std::fmt::Write. This means you can hand a &mut String to comrak::format_html_with_plugins.

Users no longer need to revalidate the buffer as UTF-8 to get a &str or String, and many internal interfaces now don't need to assert the same thing — that information is preserved from the input document all the way to the output.

Downsides

You can't hand an std::io::Write directly to these functions any more, such as an std::fs::File or an std::io::BufWriter. We add the fmt2io dependency for the binary (only) to do this, and end-users can do similar if they want. The implementation if you want to DIY is about 10 lines: forward std::fmt::Write::write_str to std::io::Write::write_all.

Alternatives considered

  1. Do nothing. We keep revalidating (or requiring users to).
  2. Use unsafe + from_utf8_unchecked variants more widely to avoid such revalidation, and hide it from the calling code.

Still to do

  • comrak::cm internally still writes out to a Vec<u8> and then revalidates it. I just haven't gotten to it yet.
  • We still do some string slicing — before we had to assert the sliced buffer was UTF-8, now the SliceIndex<str> impl only needs to check that the slice starts and ends at a character boundary. There are some places where we could possibly (safely) obviate the need to do that.

I'm interested in thoughts from users. I know it's annoying to keep up-to-date with the APIs of your dependencies changing, but there's bits and pieces like this I'd like to get cleaned up — maybe Comrak 1.0 can be a thing sometime soon?

@github-actions
Copy link
Contributor

Command Mean [ms] Min [ms] Max [ms] Relative
./bench.sh ./comrak-8c6e8de 311.4 ± 4.9 309.0 334.1 2.94 ± 0.05
./bench.sh ./comrak-main 315.6 ± 2.2 313.1 324.4 2.98 ± 0.03
./bench.sh ./pulldown-cmark 105.8 ± 0.6 104.8 107.8 1.00
./bench.sh ./cmark-gfm 116.3 ± 1.2 114.3 119.0 1.10 ± 0.01
./bench.sh ./markdown-it 475.5 ± 2.1 471.8 480.8 4.49 ± 0.03

Run on Tue Sep 23 07:35:21 UTC 2025

@kivikakk
Copy link
Owner Author

kivikakk commented Sep 23, 2025

This branch's cargo bench:

test bench_progit ... bench:   1,974,452.07 ns/iter (+/- 59,249.69)

main's:

test bench_progit ... bench:   2,093,428.12 ns/iter (+/- 56,840.20)

5.7% runtime decrease!

(The benchmarks reported in #601 (comment) go straight to stdout and so don't do the largest amount of revalidation anyway.)

@gjtorikian
Copy link
Collaborator

I'm interested in thoughts from users. I know it's annoying to keep up-to-date with the APIs of your dependencies changing

As a downstream API consumer, I would much rather have dependencies changed—and presumably, improved—than the opposite.

@kivikakk kivikakk force-pushed the push-lsmztupslltn branch 2 times, most recently from 60171b8 to 438e2b9 Compare September 24, 2025 07:49
@liamwhite
Copy link
Contributor

As an API consumer, definitely would like to see this breaking change land

@digitalmoksha
Copy link
Collaborator

This looks like it would be a great change. I say go for it.

BufWriter gets us nothing when writing to a Vec; see
https://doc.rust-lang.org/std/io/struct.BufWriter.html:

    BufWriter<W> can improve the speed of programs that make small and
    repeated write calls to the same file or network socket. It does
    not help when writing very large amounts at once, or writing just
    one or a few times. It also provides no advantage when writing to a
    destination that is in memory, like a Vec<u8>.
@kivikakk
Copy link
Owner Author

Thanks all! Let's do it.

@kivikakk kivikakk enabled auto-merge September 26, 2025 04:30
@kivikakk kivikakk merged commit 451fe12 into main Sep 26, 2025
44 checks passed
@kivikakk kivikakk deleted the push-lsmztupslltn branch September 26, 2025 04:31
@Turbo87
Copy link
Contributor

Turbo87 commented Oct 14, 2025

I might be reading this wrong, but I think this change broke the "Usage" snippet in the Readme: https://github.com/kivikakk/comrak?tab=readme-ov-file#usage 🤔

@kivikakk
Copy link
Owner Author

!! Thanks for mentioning this, you're reading it exactly right! I need to sync that in an automated fashion or something; those examples match examples/sample.rs, but I didn't copy them back to the README 🤦‍♀️

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants