Skip to content

Releases: kivikakk/comrak

v0.48.0

13 Nov 07:43
6337099

Choose a tag to compare

The breaking changes are listed right at the top! Please note that AST content now represents NUL bytes (codepoint number zero) as they were in the input; these used to be translated to the lovely � character at the very beginning of the input process, presumably so the rest of the reference C parser didn't have to deal with the possibility of strings containing NUL bytes. We can do better, though, so let's! The � character is now emitted by our formatters in place of NUL, but if you use custom or manual formatters and emit any part of the AST content directly (without using comrak::html::escape, context::html::escape_href, or the same-named functions on Context), you may need to do the same translation yourself.

We also no longer append a newline to the end of the file where there wasn't one originally, which meant a lot of places in the parser had to adapt to their strings not necessarily containing a newline before they ended. Careful review and extensive fuzzing should have squeaked out any unexpected overruns, but consider my eyes peeled for reports regarding this. (Ew.) We've cleaned up some sourcepos calculation which depended on this behaviour in odd ways, but there may yet be more to discover which our test suite didn't catch.

Did you know November is Trans Month? I didn't! I'm guessing it's because Trans Awareness Week falls within it, and we've been having a pretty bad time of it rights-wise around the world lately!

Happy Trans Month, and if you happen to typo it as Trans Moth, we can be happy about that too! 🏳️‍⚧️ ᖭི༏ᖫྀ

Parser changes:

  • No longer translate NUL bytes into U+FFFD REPLACEMENT CHARACTER in the parse stage; do it in formatters instead. (by @kivikakk in #681)
    • This means the AST now contains NUL bytes where they were present in input, preserving the difference between NUL and literally-entered characters.
  • No longer append a virtual newline at the end of the file where missing. (by @kivikakk in #682)
    • The spec allows a line to end with either a newline or EOF; the reference parser would assume any given input string will always have a terminating linefeed and forced that to be the case, and so Comrak used to. Comrak no longer does.
    • We also now handle line feed, carriage return, and carriage return plus line feed (as allow'd by the spec) without pretending they're all just a line feed, meaning e.g. sourcepos for softbreaks now correctly spans two bytes when it was produced by a carriage return plus line feed.

Changed APIs:

  • Remove mandatory space before fenced codeblock info string in CommonMark output. (by @kivikakk in #686)
  • Write out %25 in hrefs where not part of a percent-encode sequence. (by @kivikakk in #687)
    • We used to leave any % character alone, such that [link](%%20) would roundtrip without change. It now roundtrips to [link](%25%20).
  • Relaxed tasklist matching now supports a full Unicode scalar for the character between the […], and no longer turns single-byte UTF-8 characters into the Unicode codepoint numbered at the UTF-8 byte (!). (by @kivikakk in #689)

New APIs:

  • Add highlight extension, ==for highlights==! These render with <mark> in the HTML formatter. (by @pferreir in #672)
  • Add comrak::Node<'a> as an alias for comrak::nodes::Node<'a>. (by @kivikakk in #673)
  • Add comrak::Arena<'a> as an alias for typed_arena::Arena<comrak::nodes::AstNode<'a>>. (by @kivikakk in #675)
  • Add From<(LineColumn, LineColumn)> impl for Sourcepos. (by @kivikakk in #675)
  • Make comrak::nodes::NodeValue::xml_node_name public, when you want a handy-to-access name for a node type. (by @kivikakk in #673)
  • Add options.parse.leave_footnote_definitions; this option causes footnote definitions to not be relocated to the bottom of the document, and unused references not to be garbage collected, for use with custom formatters. (by @kivikakk in #673)

Bug fixes:

  • Fix relaxed autolink email in footnote edge case/panic. (by @kivikakk in #677)
  • Prevent unexpected post-processing, such as \[x] still being eligible for tasklist inclusion despite the escaped [. (by @kivikakk in #679)

Performance:

  • Simplify internal feed function, no longer requiring any allocation before the block parser. (by @kivikakk in #679, #681)
    • This is possible due to not translating NUL and not appending a virtual EOF newline; compare before and after.
  • Don't buffer CommonMark output unless necessary. (by @kivikakk in #684)
    • The full output was always buffered in a string before being written to the destination, which in many cases is going to be another string. Buffering is now done only to the extent required by output options, which often will be "not at all."
  • Use SIMD for core line feed process. (by @kivikakk in #688)

Build changes:

  • We now build Linux release binaries against musl, making them actually useful for anyone not running my exact Nix build :') (by @kivikakk in #671)
  • The benchmark CI job no longer causes the whole PR to fail checks if it can't post its comment. (by @kivikakk in #674)

Behind the scenes:

  • Factor out inlines::Scanner, reducing some needless allocations. (by @kivikakk in #675)
  • The all_options fuzzer now fuzzes across all options, and not just with most of them switched on. (by @kivikakk in #678)

New Contributors

Diff: v0.47.0...v0.48.0

v0.47.0

30 Oct 04:37

Choose a tag to compare

Martin Chrástek has fixed all known sourcepos issues in Comrak, while closing a number of other bugs at the same time! I'm so happy.

New APIs:

  • NodeCodeBlock now has a closed property. (by @Martin005 in #661)
  • NodeHeading now has a closed property, for closed ATX-style headings. (by @Martin005 in #665)

Bug fixes:

  • Source position information for lists and their children is fixed. (by @Martin005 in #666)
  • Source position information for unclosed fenced code blocks is fixed. (by @Martin005 in #661)
  • Escaped and EscapedTag no longer fail AST validation when formatting as CommonMark with debug assertions. (by @kivikakk in #662, #664)

Build changes:

  • The fuzzer now also runs on CommonMark and XML output formats. (by @kivikakk in #663)

Diff: v0.46.0...v0.47.0

v0.46.0

28 Oct 06:33
7b1dcd7

Choose a tag to compare

Please note the MSRV has been bumped from 1.65 to 1.70; see the pull request for more details. It's a kind of sticky and awkward situation — thanks to the inevitability of Progress — with no particularly clean solution. (wherein telling GCC 15 users "sorry it just won't build from source for you without messing with dependencies" is not a solution.)

Security:

  • Footnote resolution no longer recurses over the document tree; on documents with deeply nested elements, this could cause a stack overflow, with resultant denial of service. (by @kivikakk in #659)
  • Inline footnotes are restricted to a depth of 5 for similar reasons. An iterative rewrite here to avoid a limit is possible, but for now I'm hoping we can all pretend to be responsible adult human beings and limit our recursive inline footnote usage accordingly. (PRs welcome tho, non-human users are very welcome!) (by @kivikakk in #659)

Parser changes:

  • U+2069 POP DIRECTIONAL ISOLATE will be treated as terminating an autolink, rather than included as part of the link, making autolinks much easier to use correctly in RTL text. (by @SethFalco in #654)
  • HTML start condition 4 is correctly detected when non-capital letters follow "<!". (by @kivikakk in #658)

New APIs:

  • Discord-style subtext support is added as the subtext extension. (by @Kuuuube in #648, #650)

Bug fixes:

  • Source position information is corrected for description lists, HTML blocks, multiline block quotes, links with newlines following the destination, tables with leading indentation, and escaped character spans. (by @Martin005 in #646, #651, #652, #653, #656, #657)
  • escaped_char_span users can now successfully format to CommonMark with debug assertions enabled. These ASTs previously did not validate, which currently is enabled experimentally only in CommonMark output in debug. (by @kivikakk in #659)

Build changes:

  • Comrak's MSRV is bumped from 1.65 to 1.70. (by @kivikakk in #649)

New Contributors

Diff: v0.45.0...v0.46.0

v0.45.0

23 Oct 01:55
fdb17fc

Choose a tag to compare

Welcome to v0.45.0! This is a big update, much of them part of from rc.1 from last week. More context on the size of the update in the changelog there.

The biggest library user-facing changes are ergonomic: Node<'a> instead of &'a AstNode<'a>, is nice, and so likewise node.data() instead of node.data.borrow(). They're small, but I appreciate them a lot in my own work.

You'll also notice more bovine creatures in the Comrak pasture: there's a few Cow<str> instead of String, such as in NodeValue::Text. At most an extra .into() will be required; take note if you use any 'static str, as they'll no longer need to be heap-allocated. Some Boxes have been added, too, to reduce the size of every NodeValue. Let the types guide you.

Other than this, the options have been put in their own module (comrak::options), and a lot of things generally cleaned up. Read below for all the deets! Here's the final performance comparison to v0.44.0 on aarch64:

Benchmark 1: ./bench.sh ./comrak-0.44.0
  Time (mean ± σ):      88.1 ms ±   1.9 ms    [User: 71.2 ms, System: 17.8 ms]
  Range (min … max):    86.2 ms …  93.2 ms    31 runs

Benchmark 2: ./bench.sh ./comrak-0.45.0
  Time (mean ± σ):      67.0 ms ±   1.2 ms    [User: 51.2 ms, System: 17.0 ms]
  Range (min … max):    65.2 ms …  70.0 ms    42 runs

Summary
  ./bench.sh ./comrak-0.45.0 ran
    1.32 ± 0.04 times faster than ./bench.sh ./comrak-0.44.0

Be well!

Parser changes:

  • Runs of more than two ~ are no longer recognised as valid delimiters, meaning they will not prevent strikethrough recognition when they occur within correct delimiters. See the PR for discussion. (by @miketheman in #635)
    • This does not impact spec compatibility, matches cmark-gfm, and follows the intent of the original implementation and implementor (hi!).

Changed APIs:

  • r#unsafe is used instead of unsafe_. (by @kivikakk in #640)
  • --gemojis is renamed to --gemoji. (by @kivikakk in #641)
  • NodeValue::Text now contains a Cow<'static, str> instead of a String. This is a pretty major change, but means we can now create text nodes with static content without duplicating the string on the heap. This particularly benefits smart quotes and HTML entity resolution. (by @kivikakk in #627)
    • Adapting to this change usually means nothing on the read-only side (you can use it as a &str without issues); to write in-place, use .to_mut() on the Cow to get a &mut String. To assign, use .into() on a &str or String, like NodeValue::Text("moo".into()).
    • NodeValue::text() now returns a &str. It used to return a &String (!).
    • NodeValue::text_mut() now returns a &mut Cow<'static, str>, instead of a &mut String. This permits writing a borrowed reference.
    • I am experimenting with parameterising the lifetime on the Cow; it'd be amazing to refer continuously to the input where possible.
  • NodeValue's CodeBlock, Table, Link, Image, ShortCode and Alert variants' payloads are now boxed. (by @kivikakk in #632)
    • Adapting to this change usually means adding a Box::new call when constructing these nodes, and on matches, pulling the box out and then just dereferencing it directly (e.g. NodeValue::Table(nt) => &nt.alignments instead of NodeValue::Table(NodeTable { ref alignments }).
    • These payloads were larger than average, increasing the size of every node considerably. The changes reduce an Ast to 128 bytes, and a full AstNode<'_> to 176 bytes.
    • This produces a performance sweet spot: boxing the whole NodeValue results in worse performance than doing nothing at all. This change appreciably improves matters.
    • We now assert the size of a node during build to ensure future payload changes don't increase the total size of an Ast.
  • Options now live in comrak::options. Structs have been renamed to remove Options from their name: comrak::RenderOptions is now comrak::options::Render, etc. The old names are marked deprecated. (@kivikakk in #636)
    • Traits cannot be aliased yet :( URLRewriter and BrokenLinkCallback have been moved, without a deprecation period.
  • SyntaxHighlighterAdapter's attributes arguments now take HashMap<&'static str, Cow<'s, str>>; they used to take HashMap<String, String>. (by @kivikakk in #633)
  • html::write_opening_tag can now take different AsRef<str> types for the attribute key and value.
  • parse_document_with_broken_link_callback has been removed! This entrypoint has been deprecated since 0.25.0. (by @kivikakk in #623)
  • options.render.ignore_setext was moved to options.parse.ignore_setext, as its effect takes place only in the parse stage. (by @kivikakk in #623)
  • nodes::can_contain_type is now Node::can_contain_type. (by @kivikakk in #625)

New APIs:

  • node.data() and node.data_mut() are added as short-hand for node.data.borrow() and node.data.borrow_mut() respectively. (by @kivikakk in #643)
  • comrak::nodes::Node<'a> is introduced as an alias for &'a comrak::nodes::AstNode<'a>. (by @kivikakk in #627)
  • options.parse.tasklist_in_table added: parse a tasklist item if it's the only content of a table cell. (by @kivikakk in #622)

Performance:

  • Inline content is transferred to Text nodes without copying where possible. (by @kivikakk in #642).
  • Have you looked at your 7 year old code lately? A detail in the C-to-Rust translation meant essentially every line of input was being copied completely unnecessarily at the very beginning of the line processing stage. This no longer happens. We regret the error. (by @kivikakk in #629)
  • Preprocess entity data at build-time so we don't spend time doing a linear search over an unsorted array, some of which we will never match. (by @kivikakk in #631)
  • Inline content is consumed by the inline processor, instead of being borrowed by it and retained in memory indefinitely. (by @kivikakk in #631)
  • Don't try to do better than the stdlib at guessing buffer sizes; it's very good at it. (by @kivikakk in #626)
  • Use str internally in block and inline processing, eliminating many UTF-8 rechecks. The strings module actually operates on strings now. (by @kivikakk in #626)
  • Many, many needless clones have been removed in almost every subsystem.

Dependency updates:

  • memchr removed from Cargo.toml; it wasn't used directly, though it still is included unconditionally due to caseless. (by @kivikakk in #630)
  • slug is moved to a development-only dependency; it's only used in an example. (by @kivikakk in #630)
  • jetscii is added for faster string searching, including SIMD on x86_64. (by @kivikakk in #630)

Documentation:

  • The CLI help text has been copy-edited to a consistent style. (by @kivikakk in #641)
  • The README example code is updated to build with recent API changes. (by @kivikakk in #621)

Build changes:

  • shortcodes is enabled by default (but still optional) for CLI builds. (by @kivikakk in #641)
  • syntect is now optional (but still default) in CLI builds. (by @kivikakk in #624)

Behind the scenes:

  • Much of the block parser code has been re-organised, and many C-isms from the original port have been refactored into readable Rust. (by @kivikakk in #627)
  • Likewise the inline parser has been re-organised. (by @kivikakk in #644)
  • All unsafe blocks now have a SAFETY comment describing why their actions are safe.

New Contributors

Diff: v0.44.0...v0.45.0

0.45.0-rc.1

20 Oct 11:26
6145d8e

Choose a tag to compare

0.45.0-rc.1 Pre-release
Pre-release

This is a release candidate for v0.45.0. I've never made a release candidate for Comrak before, but then I've probably never made a release of this size before either.

Why the big changes? Quite simply, for the first time in over five years I'm once again working on CommonMark in my day job, and for the first time ever using Comrak in it too, and so I find myself thinking about it more, cutting myself on the sharp edges, wishing it were easier to maintain, and wishing it were more efficient. For a little while now, too, I've been speculating about calling version 1.0.

So let's get there. This weekend I found myself profiling and reworking Obviously 2018 Code, and low-hanging fruit, oh my, they are aplenty.

It's awkward to do an apples-to-apples comparison of speed between 0.44.0 and 0.45.0-rc.1, because 0.44.0's Comrak benchmark neglected to turn off syntax highlighting, while the benchmark input has something like 14,000 code blocks in it. We were kiiiinda benchmarking syntect. The benchmark also started
the target process 33 times per run, which adds a lot of undesireable pair.

Anyway, I compared the pair using the new benchmark strategy of running the process just once per run, syntax highlighting disabled. Here's what we get on aarch64; on x86_64 the improvement is slightly greater thanks to SIMD:

Benchmark 1: ./bench.sh ./comrak-0.44.0
  Time (mean ± σ):      90.0 ms ±   1.0 ms    [User: 71.9 ms, System: 18.9 ms]
  Range (min … max):    88.3 ms …  92.9 ms    31 runs

Benchmark 2: ./bench.sh ./comrak-0.45.0-rc.1
  Time (mean ± σ):      70.4 ms ±   0.9 ms    [User: 53.5 ms, System: 17.9 ms]
  Range (min … max):    69.1 ms …  73.7 ms    40 runs

Summary
  ./bench.sh ./comrak-0.45.0-rc.1 ran
    1.28 ± 0.02 times faster than ./bench.sh ./comrak-0.44.0

LGTM!

Changed APIs:

  • NodeValue::Text now contains a Cow<'static, str> instead of a String. This is a pretty major change, but means we can now create text nodes with static content without duplicating the string on the heap. This particularly benefits smart quotes and HTML entity resolution. (by @kivikakk in #627)
    • Adapting to this change usually means nothing on the read-only side (you can use it as a &str without issues); to write in-place, use .to_mut() on the Cow to get a &mut String. To assign, use .into() on a &str or String, like NodeValue::Text("moo".into()).
    • NodeValue::text() now returns a &str. It used to return a &String (!).
    • NodeValue::text_mut() now returns a &mut Cow<'static, str>, instead of a &mut String. This permits writing a borrowed reference.
    • I am experimenting with parameterising the lifetime on the Cow; it'd be amazing to refer continuously to the input where possible.
  • NodeValue's CodeBlock, Table, Link, Image, ShortCode and Alert variants' payloads are now boxed. (by @kivikakk in #632)
    • Adapting to this change usually means adding a Box::new call when constructing these nodes, and on matches, pulling the box out and then just dereferencing it directly (e.g. NodeValue::Table(nt) => &nt.alignments instead of NodeValue::Table(NodeTable { ref alignments }).
    • These payloads were larger than average, increasing the size of every node considerably. The changes reduce an Ast to 128 bytes, and a full AstNode<'_> to 176 bytes.
    • This produces a performance sweet spot: boxing the whole NodeValue results in worse performance than doing nothing at all. This change appreciably improves matters.
    • We now assert the size of a node during build to ensure future payload changes don't increase the total size of an Ast.
  • Options now live in comrak::options. Structs have been renamed to remove Options from their name: comrak::RenderOptions is now comrak::options::Render, etc. The old names are marked deprecated. (@kivikakk in #636)
    • Traits cannot be aliased yet :( URLRewriter and BrokenLinkCallback have been moved, without a deprecation period.
  • SyntaxHighlighterAdapter's attributes arguments now take HashMap<&'static str, Cow<'s, str>>; they used to take HashMap<String, String>. (by @kivikakk in #633)
  • html::write_opening_tag can now take different AsRef<str> types for the attribute key and value.
  • parse_document_with_broken_link_callback has been removed! This entrypoint has been deprecated since 0.25.0. (by @kivikakk in #623)
  • options.render.ignore_setext was moved to options.parse.ignore_setext, as its effect takes place only in the parse stage. (by @kivikakk in #623)
  • nodes::can_contain_type is now Node::can_contain_type. (by @kivikakk in #625)

New APIs:

  • comrak::nodes::Node<'a> is introduced as an alias for &'a comrak::nodes::AstNode<'a>. (by @kivikakk in #627)
  • options.parse.tasklist_in_table added: parse a tasklist item if it's the only content of a table cell. (by @kivikakk in #622)

Performance:

  • Have you looked at your 7 year old code lately? A detail in the C-to-Rust translation meant essentially every line of input was being copied completely unnecessarily at the very beginning of the line processing stage. This no longer happens. We regret the error. (by @kivikakk in #629)
  • Preprocess entity data at build-time so we don't spend time doing a linear search over an unsorted array, some of which we will never match. (by @kivikakk in #631)
  • Inline content is consumed by the inline processor, instead of being borrowed by it and retained in memory indefinitely. (by @kivikakk in #631)
  • Don't try to do better than the stdlib at guessing buffer sizes; it's very good at it. (by @kivikakk in #626)
  • Use str internally in block and inline processing, eliminating many UTF-8 rechecks. The strings module actually operates on strings now. (by @kivikakk in #626)
  • Many, many needless clones have been removed in almost every subsystem.

Dependency updates:

  • memchr removed from Cargo.toml; it wasn't used directly, though it still is included unconditionally due to caseless. (by @kivikakk in #630)
  • slug is moved to a development-only dependency; it's only used in an example. (by @kivikakk in #630)
  • jetscii is added for faster string searching, including SIMD on x86_64. (by @kivikakk in #630)

Documentation:

  • The README example code is updated to build with recent API changes. (by @kivikakk in #621)

Build changes:

  • syntect is now optional (but still default) in CLI builds. (by @kivikakk in #624)

Behind the scenes:

  • Much of the block parser code has been re-organised, and many C-isms from the original port have been refactored into readable Rust. (by @kivikakk in #627)
  • All unsafe blocks now have a SAFETY comment describing why their actions are safe.

Diff: v0.44.0...v0.45.0-rc.1

v0.44.0

14 Oct 05:26
98cc53c

Choose a tag to compare

Parser changes:

  • Autolink validation is now stricter in the default mode, to maintain conformance with the GitHub Flavored Markdown autolinks extension spec. Those parses which previously worked but no longer do --- such as http://localhost (!), www.com (!?), or https:// (!?!) --- are now part of the relaxed_autolinks option. See more discussion in the PR. (by @chamlis in #618)

New APIs:

  • You can write footnotes with their body inline by enabling the inline_footnotes extension and using the syntax ^[footnote content] (by @sheremetyev in #619)

New Contributors

Diff: v0.43.0...v0.44.0

v0.43.0

29 Sep 02:32
e626b7c

Choose a tag to compare

Parser changes:

  • superscript or subscript extensions only: punctuation following a superscript or subscript delimiter no longer disqualifies the delimiter from being considered left-flanking, such that e^-i^ and n~-i~ now parse as superscript or subscript respectively (by @kivikakk in #593)

Changed APIs:

  • html::format_document, xml::format_document, cm::format_document and friends now take an std::fmt::Write as their output argument, instead of an std::io::Write, to avoid revalidating UTF-8 (by @kivikakk in #601)
  • bin: allow --header-ids '' for prefix-less headers (by @kivikakk in #610)

New APIs:

  • Add CJK Friendly Emphasis to CLI option (by @tats-u in #607)

Documentation updates:

Diff: v0.42.0...v0.43.0

v0.42.0

24 Sep 07:14
8919b6d

Choose a tag to compare

New APIs:

  • cm::escape_inline (aliased at crate level as escape_commonmark_inline) is added; escapes input text suitable for inclusion in a CommonMark document where regular inline processing takes place. (by @kivikakk in #602)
  • cm::escape_link_destination (aliased at crate level as escape_commonmark_link_destination) is added; escapes input URL suitable for use as a link destination in a CommonMark document. (by @kivikakk in #605)

Changed APIs:

  • html::collect_text now returns a String. html::collect_text_append is added if you still want to start with your own (String) buffer. (by @kivikakk in #600)
    • There was no particular reason for this populating a Vec<u8> instead of a String; it was just old.
  • Anchorizer::anchorizer now takes &str instead of a String. (by @kivikakk in #603)
    • As above.

Updates:

  • Update is_cjk in CJK Friendly Emphasis to Unicode 17. (by @tats-u in #598)

Behind the scenes:

Diff: v0.41.1...v0.42.0

v0.41.1

14 Sep 06:38
ef2f268

Choose a tag to compare

Bug fixes:

  • Fix the range of non-emoji general purpose variation selector by @tats-u in #596

Stability:

  • html: remove some panics on unusual ASTs, and document others. by @kivikakk in #589

Behind the scenes:

  • Cleanup fuzzers, add unresolved relaxed_autolink_email_in_footnote test by @Mrmaxmeier in #594
  • build(deps): bump actions/checkout from 4 to 5 by @dependabot[bot] in #590

New Contributors

Diff: v0.41.0...v0.41.1

v0.41.0

09 Aug 03:28
8f76b4d

Choose a tag to compare

New features:

  • Add CJK friendly emphasis extension by @tats-u in #582
    • Add CJK friendly emphasis to README by @tats-u in #583

Build changes:

New Contributors

Diff: v0.40.0...v0.41.0