-
-
Notifications
You must be signed in to change notification settings - Fork 169
Replace io::Write with fmt::Write to avoid revalidating UTF-8. #601
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Run on Tue Sep 23 07:35:21 UTC 2025 |
|
This branch's main's: 5.7% runtime decrease! (The benchmarks reported in #601 (comment) go straight to stdout and so don't do the largest amount of revalidation anyway.) |
04bf03b to
119986c
Compare
As a downstream API consumer, I would much rather have dependencies changed—and presumably, improved—than the opposite. |
60171b8 to
438e2b9
Compare
|
As an API consumer, definitely would like to see this breaking change land |
|
This looks like it would be a great change. I say go for it. |
BufWriter gets us nothing when writing to a Vec; see https://doc.rust-lang.org/std/io/struct.BufWriter.html: BufWriter<W> can improve the speed of programs that make small and repeated write calls to the same file or network socket. It does not help when writing very large amounts at once, or writing just one or a few times. It also provides no advantage when writing to a destination that is in memory, like a Vec<u8>.
438e2b9 to
0f030ac
Compare
|
Thanks all! Let's do it. |
|
I might be reading this wrong, but I think this change broke the "Usage" snippet in the Readme: https://github.com/kivikakk/comrak?tab=readme-ov-file#usage 🤔 |
|
!! Thanks for mentioning this, you're reading it exactly right! I need to sync that in an automated fashion or something; those examples match |
Warning
This breaks many external APIs!
This PR is an attempt at doing something I've wanted to do basically forever — remove a bunch of
std::str::from_utf8andString::from_utf8calls by not forcing strings through byte-oriented interfaces.Motivation
Currently, formatters use
std::io::Writeat the boundary, and internally deal with that and a lot of&[u8]s and someVec<u8>s. There are many calls tostr::as_bytesandString::as_bytes, essentially throwing away the information that this source data was valid UTF-8, and then oftentimes places where we later have to usefrom_utf8variants on buffers to reconstruct a&strorStringfrom them.One of the main ways of working with the library is to create a
Vec<u8>, format into it, and then use afrom_utf8variant on it. This re-validates the buffer contents as UTF-8, even though there should be no way of non-UTF-8 data getting into that buffer.The library also uses
unsafewithfrom_utf8_uncheckedvariants at times, including in formatters, for the same reason.Proposal
We replace almost every use of
std::io::Writewithstd::fmt::Write. This means you can hand a&mut Stringtocomrak::format_html_with_plugins.Users no longer need to revalidate the buffer as UTF-8 to get a
&strorString, and many internal interfaces now don't need to assert the same thing — that information is preserved from the input document all the way to the output.Downsides
You can't hand an
std::io::Writedirectly to these functions any more, such as anstd::fs::Fileor anstd::io::BufWriter. We add thefmt2iodependency for the binary (only) to do this, and end-users can do similar if they want. The implementation if you want to DIY is about 10 lines: forwardstd::fmt::Write::write_strtostd::io::Write::write_all.Alternatives considered
unsafe+from_utf8_uncheckedvariants more widely to avoid such revalidation, and hide it from the calling code.Still to do
comrak::cminternally still writes out to aVec<u8>and then revalidates it. I just haven't gotten to it yet.SliceIndex<str>impl only needs to check that the slice starts and ends at a character boundary. There are some places where we could possibly (safely) obviate the need to do that.I'm interested in thoughts from users. I know it's annoying to keep up-to-date with the APIs of your dependencies changing, but there's bits and pieces like this I'd like to get cleaned up — maybe Comrak 1.0 can be a thing sometime soon?