Releases: kivikakk/comrak
v0.48.0
The breaking changes are listed right at the top! Please note that AST content now represents NUL bytes (codepoint number zero) as they were in the input; these used to be translated to the lovely � character at the very beginning of the input process, presumably so the rest of the reference C parser didn't have to deal with the possibility of strings containing NUL bytes. We can do better, though, so let's! The � character is now emitted by our formatters in place of NUL, but if you use custom or manual formatters and emit any part of the AST content directly (without using comrak::html::escape, context::html::escape_href, or the same-named functions on Context), you may need to do the same translation yourself.
We also no longer append a newline to the end of the file where there wasn't one originally, which meant a lot of places in the parser had to adapt to their strings not necessarily containing a newline before they ended. Careful review and extensive fuzzing should have squeaked out any unexpected overruns, but consider my eyes peeled for reports regarding this. (Ew.) We've cleaned up some sourcepos calculation which depended on this behaviour in odd ways, but there may yet be more to discover which our test suite didn't catch.
Did you know November is Trans Month? I didn't! I'm guessing it's because Trans Awareness Week falls within it, and we've been having a pretty bad time of it rights-wise around the world lately!
Happy Trans Month, and if you happen to typo it as Trans Moth, we can be happy about that too! 🏳️⚧️ ᖭི༏ᖫྀ
Parser changes:
- No longer translate
NULbytes intoU+FFFD REPLACEMENT CHARACTERin the parse stage; do it in formatters instead. (by @kivikakk in #681)- This means the AST now contains
NULbytes where they were present in input, preserving the difference betweenNULand literally-entered�characters.
- This means the AST now contains
- No longer append a virtual newline at the end of the file where missing. (by @kivikakk in #682)
- The spec allows a line to end with either a newline or EOF; the reference parser would assume any given input string will always have a terminating linefeed and forced that to be the case, and so Comrak used to. Comrak no longer does.
- We also now handle line feed, carriage return, and carriage return plus line feed (as allow'd by the spec) without pretending they're all just a line feed, meaning e.g. sourcepos for softbreaks now correctly spans two bytes when it was produced by a carriage return plus line feed.
Changed APIs:
- Remove mandatory space before fenced codeblock info string in CommonMark output. (by @kivikakk in #686)
- Write out
%25in hrefs where not part of a percent-encode sequence. (by @kivikakk in #687)- We used to leave any
%character alone, such that[link](%%20)would roundtrip without change. It now roundtrips to[link](%25%20).
- We used to leave any
- Relaxed tasklist matching now supports a full Unicode scalar for the character between the
[…], and no longer turns single-byte UTF-8 characters into the Unicode codepoint numbered at the UTF-8 byte (!). (by @kivikakk in #689)
New APIs:
- Add
highlightextension,==for highlights==! These render with<mark>in the HTML formatter. (by @pferreir in #672) - Add
comrak::Node<'a>as an alias forcomrak::nodes::Node<'a>. (by @kivikakk in #673) - Add
comrak::Arena<'a>as an alias fortyped_arena::Arena<comrak::nodes::AstNode<'a>>. (by @kivikakk in #675) - Add
From<(LineColumn, LineColumn)>impl forSourcepos. (by @kivikakk in #675) - Make
comrak::nodes::NodeValue::xml_node_namepublic, when you want a handy-to-access name for a node type. (by @kivikakk in #673) - Add
options.parse.leave_footnote_definitions; this option causes footnote definitions to not be relocated to the bottom of the document, and unused references not to be garbage collected, for use with custom formatters. (by @kivikakk in #673)
Bug fixes:
- Fix relaxed autolink email in footnote edge case/panic. (by @kivikakk in #677)
- Prevent unexpected post-processing, such as
\[x]still being eligible for tasklist inclusion despite the escaped[. (by @kivikakk in #679)
Performance:
- Simplify internal feed function, no longer requiring any allocation before the block parser. (by @kivikakk in #679, #681)
- Don't buffer CommonMark output unless necessary. (by @kivikakk in #684)
- The full output was always buffered in a string before being written to the destination, which in many cases is going to be another string. Buffering is now done only to the extent required by output options, which often will be "not at all."
- Use SIMD for core line feed process. (by @kivikakk in #688)
Build changes:
- We now build Linux release binaries against musl, making them actually useful for anyone not running my exact Nix build :') (by @kivikakk in #671)
- The benchmark CI job no longer causes the whole PR to fail checks if it can't post its comment. (by @kivikakk in #674)
Behind the scenes:
- Factor out
inlines::Scanner, reducing some needless allocations. (by @kivikakk in #675) - The
all_optionsfuzzer now fuzzes across all options, and not just with most of them switched on. (by @kivikakk in #678)
New Contributors
Diff: v0.47.0...v0.48.0
v0.47.0
Martin Chrástek has fixed all known sourcepos issues in Comrak, while closing a number of other bugs at the same time! I'm so happy.
New APIs:
NodeCodeBlocknow has aclosedproperty. (by @Martin005 in #661)NodeHeadingnow has aclosedproperty, for closed ATX-style headings. (by @Martin005 in #665)
Bug fixes:
- Source position information for lists and their children is fixed. (by @Martin005 in #666)
- Source position information for unclosed fenced code blocks is fixed. (by @Martin005 in #661)
EscapedandEscapedTagno longer fail AST validation when formatting as CommonMark with debug assertions. (by @kivikakk in #662, #664)
Build changes:
Diff: v0.46.0...v0.47.0
v0.46.0
Please note the MSRV has been bumped from 1.65 to 1.70; see the pull request for more details. It's a kind of sticky and awkward situation — thanks to the inevitability of Progress — with no particularly clean solution. (wherein telling GCC 15 users "sorry it just won't build from source for you without messing with dependencies" is not a solution.)
Security:
- Footnote resolution no longer recurses over the document tree; on documents with deeply nested elements, this could cause a stack overflow, with resultant denial of service. (by @kivikakk in #659)
- Inline footnotes are restricted to a depth of 5 for similar reasons. An iterative rewrite here to avoid a limit is possible, but for now I'm hoping we can all pretend to be responsible adult human beings and limit our recursive inline footnote usage accordingly. (PRs welcome tho, non-human users are very welcome!) (by @kivikakk in #659)
Parser changes:
- U+2069 POP DIRECTIONAL ISOLATE will be treated as terminating an autolink, rather than included as part of the link, making autolinks much easier to use correctly in RTL text. (by @SethFalco in #654)
- HTML start condition 4 is correctly detected when non-capital letters follow "<!". (by @kivikakk in #658)
New APIs:
Bug fixes:
- Source position information is corrected for description lists, HTML blocks, multiline block quotes, links with newlines following the destination, tables with leading indentation, and escaped character spans. (by @Martin005 in #646, #651, #652, #653, #656, #657)
escaped_char_spanusers can now successfully format to CommonMark with debug assertions enabled. These ASTs previously did not validate, which currently is enabled experimentally only in CommonMark output in debug. (by @kivikakk in #659)
Build changes:
New Contributors
- @Martin005 made their first contribution in #646
- @Kuuuube made their first contribution in #648
- @SethFalco made their first contribution in #654
Diff: v0.45.0...v0.46.0
v0.45.0
Welcome to v0.45.0! This is a big update, much of them part of from rc.1 from last week. More context on the size of the update in the changelog there.
The biggest library user-facing changes are ergonomic: Node<'a> instead of &'a AstNode<'a>, is nice, and so likewise node.data() instead of node.data.borrow(). They're small, but I appreciate them a lot in my own work.
You'll also notice more bovine creatures in the Comrak pasture: there's a few Cow<str> instead of String, such as in NodeValue::Text. At most an extra .into() will be required; take note if you use any 'static str, as they'll no longer need to be heap-allocated. Some Boxes have been added, too, to reduce the size of every NodeValue. Let the types guide you.
Other than this, the options have been put in their own module (comrak::options), and a lot of things generally cleaned up. Read below for all the deets! Here's the final performance comparison to v0.44.0 on aarch64:
Benchmark 1: ./bench.sh ./comrak-0.44.0
Time (mean ± σ): 88.1 ms ± 1.9 ms [User: 71.2 ms, System: 17.8 ms]
Range (min … max): 86.2 ms … 93.2 ms 31 runs
Benchmark 2: ./bench.sh ./comrak-0.45.0
Time (mean ± σ): 67.0 ms ± 1.2 ms [User: 51.2 ms, System: 17.0 ms]
Range (min … max): 65.2 ms … 70.0 ms 42 runs
Summary
./bench.sh ./comrak-0.45.0 ran
1.32 ± 0.04 times faster than ./bench.sh ./comrak-0.44.0
Be well!
Parser changes:
- Runs of more than two
~are no longer recognised as valid delimiters, meaning they will not prevent strikethrough recognition when they occur within correct delimiters. See the PR for discussion. (by @miketheman in #635)- This does not impact spec compatibility, matches
cmark-gfm, and follows the intent of the original implementation and implementor (hi!).
- This does not impact spec compatibility, matches
Changed APIs:
r#unsafeis used instead ofunsafe_. (by @kivikakk in #640)--gemojisis renamed to--gemoji. (by @kivikakk in #641)NodeValue::Textnow contains aCow<'static, str>instead of aString. This is a pretty major change, but means we can now create text nodes with static content without duplicating the string on the heap. This particularly benefits smart quotes and HTML entity resolution. (by @kivikakk in #627)- Adapting to this change usually means nothing on the read-only side (you can use it as a
&strwithout issues); to write in-place, use.to_mut()on theCowto get a&mut String. To assign, use.into()on a&strorString, likeNodeValue::Text("moo".into()). NodeValue::text()now returns a&str. It used to return a&String(!).NodeValue::text_mut()now returns a&mut Cow<'static, str>, instead of a&mut String. This permits writing a borrowed reference.- I am experimenting with parameterising the lifetime on the
Cow; it'd be amazing to refer continuously to the input where possible.
- Adapting to this change usually means nothing on the read-only side (you can use it as a
NodeValue'sCodeBlock,Table,Link,Image,ShortCodeandAlertvariants' payloads are now boxed. (by @kivikakk in #632)- Adapting to this change usually means adding a
Box::newcall when constructing these nodes, and on matches, pulling the box out and then just dereferencing it directly (e.g.NodeValue::Table(nt) => &nt.alignmentsinstead ofNodeValue::Table(NodeTable { ref alignments }). - These payloads were larger than average, increasing the size of every node considerably. The changes reduce an
Astto 128 bytes, and a fullAstNode<'_>to 176 bytes. - This produces a performance sweet spot: boxing the whole
NodeValueresults in worse performance than doing nothing at all. This change appreciably improves matters. - We now assert the size of a node during build to ensure future payload changes don't increase the total size of an
Ast.
- Adapting to this change usually means adding a
- Options now live in
comrak::options. Structs have been renamed to removeOptionsfrom their name:comrak::RenderOptionsis nowcomrak::options::Render, etc. The old names are marked deprecated. (@kivikakk in #636)- Traits cannot be aliased yet :(
URLRewriterandBrokenLinkCallbackhave been moved, without a deprecation period.
- Traits cannot be aliased yet :(
SyntaxHighlighterAdapter'sattributesarguments now takeHashMap<&'static str, Cow<'s, str>>; they used to takeHashMap<String, String>. (by @kivikakk in #633)html::write_opening_tagcan now take differentAsRef<str>types for the attribute key and value.parse_document_with_broken_link_callbackhas been removed! This entrypoint has been deprecated since 0.25.0. (by @kivikakk in #623)options.render.ignore_setextwas moved tooptions.parse.ignore_setext, as its effect takes place only in the parse stage. (by @kivikakk in #623)nodes::can_contain_typeis nowNode::can_contain_type. (by @kivikakk in #625)
New APIs:
node.data()andnode.data_mut()are added as short-hand fornode.data.borrow()andnode.data.borrow_mut()respectively. (by @kivikakk in #643)comrak::nodes::Node<'a>is introduced as an alias for&'a comrak::nodes::AstNode<'a>. (by @kivikakk in #627)options.parse.tasklist_in_tableadded: parse a tasklist item if it's the only content of a table cell. (by @kivikakk in #622)
Performance:
- Inline content is transferred to Text nodes without copying where possible. (by @kivikakk in #642).
- Have you looked at your 7 year old code lately? A detail in the C-to-Rust translation meant essentially every line of input was being copied completely unnecessarily at the very beginning of the line processing stage. This no longer happens. We regret the error. (by @kivikakk in #629)
- Preprocess entity data at build-time so we don't spend time doing a linear search over an unsorted array, some of which we will never match. (by @kivikakk in #631)
- Inline content is consumed by the inline processor, instead of being borrowed by it and retained in memory indefinitely. (by @kivikakk in #631)
- Don't try to do better than the stdlib at guessing buffer sizes; it's very good at it. (by @kivikakk in #626)
- Use
strinternally in block and inline processing, eliminating many UTF-8 rechecks. Thestringsmodule actually operates on strings now. (by @kivikakk in #626) - Many, many needless clones have been removed in almost every subsystem.
Dependency updates:
memchrremoved fromCargo.toml; it wasn't used directly, though it still is included unconditionally due tocaseless. (by @kivikakk in #630)slugis moved to a development-only dependency; it's only used in an example. (by @kivikakk in #630)jetsciiis added for faster string searching, including SIMD on x86_64. (by @kivikakk in #630)- I'm experimenting with aarch64 SIMD.
Documentation:
- The CLI help text has been copy-edited to a consistent style. (by @kivikakk in #641)
- The
READMEexample code is updated to build with recent API changes. (by @kivikakk in #621)
Build changes:
shortcodesis enabled by default (but still optional) for CLI builds. (by @kivikakk in #641)syntectis now optional (but still default) in CLI builds. (by @kivikakk in #624)
Behind the scenes:
- Much of the block parser code has been re-organised, and many C-isms from the original port have been refactored into readable Rust. (by @kivikakk in #627)
- Likewise the inline parser has been re-organised. (by @kivikakk in #644)
- All
unsafeblocks now have aSAFETYcomment describing why their actions are safe.
New Contributors
- @miketheman made their first contribution in #635
Diff: v0.44.0...v0.45.0
0.45.0-rc.1
This is a release candidate for v0.45.0. I've never made a release candidate for Comrak before, but then I've probably never made a release of this size before either.
Why the big changes? Quite simply, for the first time in over five years I'm once again working on CommonMark in my day job, and for the first time ever using Comrak in it too, and so I find myself thinking about it more, cutting myself on the sharp edges, wishing it were easier to maintain, and wishing it were more efficient. For a little while now, too, I've been speculating about calling version 1.0.
So let's get there. This weekend I found myself profiling and reworking Obviously 2018 Code, and low-hanging fruit, oh my, they are aplenty.
It's awkward to do an apples-to-apples comparison of speed between 0.44.0 and 0.45.0-rc.1, because 0.44.0's Comrak benchmark neglected to turn off syntax highlighting, while the benchmark input has something like 14,000 code blocks in it. We were kiiiinda benchmarking syntect. The benchmark also started
the target process 33 times per run, which adds a lot of undesireable pair.
Anyway, I compared the pair using the new benchmark strategy of running the process just once per run, syntax highlighting disabled. Here's what we get on aarch64; on x86_64 the improvement is slightly greater thanks to SIMD:
Benchmark 1: ./bench.sh ./comrak-0.44.0
Time (mean ± σ): 90.0 ms ± 1.0 ms [User: 71.9 ms, System: 18.9 ms]
Range (min … max): 88.3 ms … 92.9 ms 31 runs
Benchmark 2: ./bench.sh ./comrak-0.45.0-rc.1
Time (mean ± σ): 70.4 ms ± 0.9 ms [User: 53.5 ms, System: 17.9 ms]
Range (min … max): 69.1 ms … 73.7 ms 40 runs
Summary
./bench.sh ./comrak-0.45.0-rc.1 ran
1.28 ± 0.02 times faster than ./bench.sh ./comrak-0.44.0
LGTM!
Changed APIs:
NodeValue::Textnow contains aCow<'static, str>instead of aString. This is a pretty major change, but means we can now create text nodes with static content without duplicating the string on the heap. This particularly benefits smart quotes and HTML entity resolution. (by @kivikakk in #627)- Adapting to this change usually means nothing on the read-only side (you can use it as a
&strwithout issues); to write in-place, use.to_mut()on theCowto get a&mut String. To assign, use.into()on a&strorString, likeNodeValue::Text("moo".into()). NodeValue::text()now returns a&str. It used to return a&String(!).NodeValue::text_mut()now returns a&mut Cow<'static, str>, instead of a&mut String. This permits writing a borrowed reference.- I am experimenting with parameterising the lifetime on the
Cow; it'd be amazing to refer continuously to the input where possible.
- Adapting to this change usually means nothing on the read-only side (you can use it as a
NodeValue'sCodeBlock,Table,Link,Image,ShortCodeandAlertvariants' payloads are now boxed. (by @kivikakk in #632)- Adapting to this change usually means adding a
Box::newcall when constructing these nodes, and on matches, pulling the box out and then just dereferencing it directly (e.g.NodeValue::Table(nt) => &nt.alignmentsinstead ofNodeValue::Table(NodeTable { ref alignments }). - These payloads were larger than average, increasing the size of every node considerably. The changes reduce an
Astto 128 bytes, and a fullAstNode<'_>to 176 bytes. - This produces a performance sweet spot: boxing the whole
NodeValueresults in worse performance than doing nothing at all. This change appreciably improves matters. - We now assert the size of a node during build to ensure future payload changes don't increase the total size of an
Ast.
- Adapting to this change usually means adding a
- Options now live in
comrak::options. Structs have been renamed to removeOptionsfrom their name:comrak::RenderOptionsis nowcomrak::options::Render, etc. The old names are marked deprecated. (@kivikakk in #636)- Traits cannot be aliased yet :(
URLRewriterandBrokenLinkCallbackhave been moved, without a deprecation period.
- Traits cannot be aliased yet :(
SyntaxHighlighterAdapter'sattributesarguments now takeHashMap<&'static str, Cow<'s, str>>; they used to takeHashMap<String, String>. (by @kivikakk in #633)html::write_opening_tagcan now take differentAsRef<str>types for the attribute key and value.parse_document_with_broken_link_callbackhas been removed! This entrypoint has been deprecated since 0.25.0. (by @kivikakk in #623)options.render.ignore_setextwas moved tooptions.parse.ignore_setext, as its effect takes place only in the parse stage. (by @kivikakk in #623)nodes::can_contain_typeis nowNode::can_contain_type. (by @kivikakk in #625)
New APIs:
comrak::nodes::Node<'a>is introduced as an alias for&'a comrak::nodes::AstNode<'a>. (by @kivikakk in #627)options.parse.tasklist_in_tableadded: parse a tasklist item if it's the only content of a table cell. (by @kivikakk in #622)
Performance:
- Have you looked at your 7 year old code lately? A detail in the C-to-Rust translation meant essentially every line of input was being copied completely unnecessarily at the very beginning of the line processing stage. This no longer happens. We regret the error. (by @kivikakk in #629)
- Preprocess entity data at build-time so we don't spend time doing a linear search over an unsorted array, some of which we will never match. (by @kivikakk in #631)
- Inline content is consumed by the inline processor, instead of being borrowed by it and retained in memory indefinitely. (by @kivikakk in #631)
- Don't try to do better than the stdlib at guessing buffer sizes; it's very good at it. (by @kivikakk in #626)
- Use
strinternally in block and inline processing, eliminating many UTF-8 rechecks. Thestringsmodule actually operates on strings now. (by @kivikakk in #626) - Many, many needless clones have been removed in almost every subsystem.
Dependency updates:
memchrremoved fromCargo.toml; it wasn't used directly, though it still is included unconditionally due tocaseless. (by @kivikakk in #630)slugis moved to a development-only dependency; it's only used in an example. (by @kivikakk in #630)jetsciiis added for faster string searching, including SIMD on x86_64. (by @kivikakk in #630)- I'm experimenting with aarch64 SIMD.
Documentation:
Build changes:
Behind the scenes:
- Much of the block parser code has been re-organised, and many C-isms from the original port have been refactored into readable Rust. (by @kivikakk in #627)
- All
unsafeblocks now have aSAFETYcomment describing why their actions are safe.
Diff: v0.44.0...v0.45.0-rc.1
v0.44.0
Parser changes:
- Autolink validation is now stricter in the default mode, to maintain conformance with the GitHub Flavored Markdown autolinks extension spec. Those parses which previously worked but no longer do --- such as
http://localhost(!),www.com(!?), orhttps://(!?!) --- are now part of therelaxed_autolinksoption. See more discussion in the PR. (by @chamlis in #618)
New APIs:
- You can write footnotes with their body inline by enabling the
inline_footnotesextension and using the syntax^[footnote content](by @sheremetyev in #619)
New Contributors
- @sheremetyev made their first contribution in #619
- @chamlis made their first contribution in #618
Diff: v0.43.0...v0.44.0
v0.43.0
Parser changes:
superscriptorsubscriptextensions only: punctuation following a superscript or subscript delimiter no longer disqualifies the delimiter from being considered left-flanking, such thate^-i^andn~-i~now parse as superscript or subscript respectively (by @kivikakk in #593)
Changed APIs:
html::format_document,xml::format_document,cm::format_documentand friends now take anstd::fmt::Writeas theiroutputargument, instead of anstd::io::Write, to avoid revalidating UTF-8 (by @kivikakk in #601)- bin: allow
--header-ids ''for prefix-less headers (by @kivikakk in #610)
New APIs:
Documentation updates:
Diff: v0.42.0...v0.43.0
v0.42.0
New APIs:
cm::escape_inline(aliased at crate level asescape_commonmark_inline) is added; escapes input text suitable for inclusion in a CommonMark document where regular inline processing takes place. (by @kivikakk in #602)cm::escape_link_destination(aliased at crate level asescape_commonmark_link_destination) is added; escapes input URL suitable for use as a link destination in a CommonMark document. (by @kivikakk in #605)
Changed APIs:
html::collect_textnow returns aString.html::collect_text_appendis added if you still want to start with your own (String) buffer. (by @kivikakk in #600)- There was no particular reason for this populating a
Vec<u8>instead of aString; it was just old.
- There was no particular reason for this populating a
Anchorizer::anchorizernow takes&strinstead of aString. (by @kivikakk in #603)- As above.
Updates:
Behind the scenes:
Diff: v0.41.1...v0.42.0
v0.41.1
Bug fixes:
Stability:
Behind the scenes:
- Cleanup fuzzers, add unresolved
relaxed_autolink_email_in_footnotetest by @Mrmaxmeier in #594 - build(deps): bump actions/checkout from 4 to 5 by @dependabot[bot] in #590
New Contributors
- @Mrmaxmeier made their first contribution in #594
Diff: v0.41.0...v0.41.1
v0.41.0
New features:
Build changes:
- Use syntect's default-fancy feature for ios by @tvanderstad in #585
New Contributors
- @tats-u made their first contribution in #582
- @tvanderstad made their first contribution in #585
Diff: v0.40.0...v0.41.0