Skip to content

Conversation

@digitalmoksha
Copy link
Collaborator

Most of the changes from gfm.6 to gfm.7 have been previously incorporated. This adds one minor thing that was missing, and fixes footnote handling to properly pass the regressions.txt tests, which should be passing at this version.

Most of the changes from .6 to .7 were
already incorporated.
@digitalmoksha
Copy link
Collaborator Author

@kivikakk you're right, you already added just about everything from .6 to .7 💯 . I found one minor missing change (the re-addition of source in blocktagname).

And I fixed some failures that should be working for this version.

Add anything that I may have missed for this version upgrade. But we might be able to call this particular upgrade done? 🤔 🤞

@kivikakk kivikakk merged commit 8eac2c3 into kivikakk:main Jun 14, 2023
@kivikakk
Copy link
Owner

Thanks so much for this as always, @digitalmoksha!

But we might be able to call this particular upgrade done? 🤔 🤞

Having had a look through the diffs, I think so! I'm now looking forward from here:

  • 7–8 is adding the quadratic fuzzer (which we already have) and internal changes (which we can ignore).
  • 8—9 adds sourcepos for autolinked URLs (we currently lack), footnote rendering changes (we have!), more internal changes.
  • 9–10 does some counting and bookkeeping which fix a quadratic perf issue or two, and also makes it so directly nested emphasis doesn't render. This is the currently outstanding spec failure.
  • 10—11 appears to back out a fair bit of the counting from 9—10 and does it a different way, so we should be careful not to spend time and effort on 9—10 and instead consider github/cmark-gfm@0.29.0.gfm.9...0.29.0.gfm.11 in one go. There's some additions to the quadratic fuzzer we should incorporate too.

Everything remaining is pretty minor, and I'm happy to divide it up in whatever way you like.

@digitalmoksha
Copy link
Collaborator Author

Yeah I was noticing that some of the changes added to 9-10 were removed in 10-11.

Would you be up for tackling 8-9, since you recently added the sourcepos code and know auto link. I can tackle 9-11. I have to admit I'm a little disappointed in the nested emphasis code - that directly breaks running against the CommonMark specs. But I haven't yet looked at why they did that.

@kivikakk
Copy link
Owner

Would you be up for tackling 8-9, since you recently added the sourcepos code and know auto link.

I'd be happy to! Sometime in the next day or two.

I have to admit I'm a little disappointed in the nested emphasis code - that directly breaks running against the CommonMark specs. But I haven't yet looked at why they did that.

It's a bit of an interesting one. I don't really understand the motivation myself: github/cmark-gfm@5c75d23

There are some changes to how the tight/non-tight distinction is respected in CommonMark/plaintext output, which I would assume is actually responsible for the speed up mentioned in the commit message. The HTML output timing didn't change because there was no $\mathcal{O}(n^2)$ behaviour there to begin with.

We had to deal with this change in the cmark-gfm branch of Commonmarker, too, and the rationale is still not clear: gjtorikian/commonmarker#236

@kivikakk
Copy link
Owner

In this respect, it's kind of funny — cmark-gfm doesn't actually pass CommonMark any more, nor does it pass the GitHub Flavored Markdown Spec as last published, since the last version of that was published by me 🙃

The cmark-gfm test suite passes, but only because they've modified the spec file used by the suite in the repo: https://github.com/github/cmark-gfm/commits/master/test/spec.txt This file should be identical to the one used to produce the HTML spec above (given it's the GFM spec reference!), but isn't.

@kivikakk
Copy link
Owner

I just started looking into the autolink sourcepos stuff, and pretty quickly remembered why we don't do it. I had actually gone and mostly implemented it anyway, and then things started getting hairy when combining options. The autolink tests describe it:

// There's unsoundness in trying to maintain and adjust sourcepos
// when doing autolinks in the light of:
//
// a) Some source elements introducing a different number of characters
// to the content text than they take in source, i.e. smart
// punctuation.
//
// b) Text node consolidation happening before autolinking.
//
// (b) is obviously non-optional, but it means we end up with Text
// nodes with different byte counts than their sourcepos span lengths.
//
// One possible solution would be to actually accumulate multiple
// sourcepos spans per Text node, each also tracking the number of
// bytes of content text it's responsible for. This would work well
// enough as long as we never had to adjust a sourcepos into a spot
// within a sourcepos span that had a target text width where it
// wasn't equal. That probably wouldn't happen, though -- i.e. we're
// never autolinking into the middle of a rendered smart punctuation.
//
// For now the desired sourcepos is documented in comment. What we
// have currently (after backing out the adjustments, having hit the
// above case) matches cmark-gfm.

It's worth noting that cmark-gfm doesn't actually include sourcepos on links in its HTML output, even with --sourcepos. The data is stored in the AST, though, and XML mode shows it.

That said, its autolink sourcepos is incorrect:

$ echo 'Hi! www.github.com' | build/src/cmark-gfm -e autolink --sourcepos -t xml
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE document SYSTEM "CommonMark.dtd">
<document sourcepos="1:1-1:18" xmlns="http://commonmark.org/xml/1.0">
  <paragraph sourcepos="1:1-1:18">
    <text sourcepos="1:1-1:4" xml:space="preserve">Hi! </text>
    <link sourcepos="1:4-1:18" destination="http://www.github.com" title="">
      <text sourcepos="1:4-1:18" xml:space="preserve">www.github.com</text>
    </link>
  </paragraph>
</document>

$ echo 'Hi! http://www.github.com' | build/src/cmark-gfm -e autolink --sourcepos -t xml
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE document SYSTEM "CommonMark.dtd">
<document sourcepos="1:1-1:25" xmlns="http://commonmark.org/xml/1.0">
  <paragraph sourcepos="1:1-1:25">
    <text sourcepos="1:1-1:8" xml:space="preserve">Hi! </text>
    <link sourcepos="1:4-1:25" destination="http://www.github.com" title="">
      <text sourcepos="1:4-1:25" xml:space="preserve">http://www.github.com</text>
    </link>
  </paragraph>
</document>

The Hi! node should be 1:1-1:4 in both cases, and the link nodes should start at 1:5. It might be possible to get more broken results by nesting inside different elements.

These issues in cmark-gfm look fairly fixable (looks like some rewinds not being catered for correctly), and it doesn't seem to have any issues with smart punctuation. Our fix will be a bit hairier. I'll think on it.

@digitalmoksha
Copy link
Collaborator Author

It's a bit of an interesting one. I don't really understand the motivation myself: github/cmark-gfm@5c75d23

Yeah it's bizarre, there doesn't seem to be a real reason to change the HTML renderer, and it breaks compatibility. Not cool. Same here, had to deal with it when upgrading commonmarker, https://gitlab.com/gitlab-org/gitlab/-/merge_requests/121602.

In this respect, it's kind of funny — cmark-gfm doesn't actually pass CommonMark any more, nor does it pass the GitHub Flavored Markdown Spec as last published, since the last version of that was published by me

Yeah, funny/not funny 😄. I'm against deviating from the CommonMark spec. It's bad enough we're not using 0.30 (I know, there aren't that many real changes, but it's been 2 years).

@digitalmoksha
Copy link
Collaborator Author

These issues in cmark-gfm look fairly fixable (looks like some rewinds not being catered for correctly), and it doesn't seem to have any issues with smart punctuation. Our fix will be a bit hairier. I'll think on it.

Oof - I see how that can get complicated real fast.

I've been able to put together the 9->11 PR (which I'll put up here in a bit). All the tests pass locally, including those in script/cibuild.

So we might be able to go ahead and say that we're synced up to 11, while the autolink is worked on. As you said, we've already got the footnote changes, so I don't really see anything else important in 8->9.

Interesting, I don't actually see any tests in cmark-gfm for sourcepos

@kivikakk
Copy link
Owner

I've been able to put together the 9->11 PR (which I'll put up here in a bit). All the tests pass locally, including those in script/cibuild.

So we might be able to go ahead and say that we're synced up to 11, while the autolink is worked on. As you said, we've already got the footnote changes, so I don't really see anything else important in 8->9.

Amazing, thank you so much! And yes, I think that's probably the way to go.

Interesting, I don't actually see any tests in cmark-gfm for sourcepos

There are some; they're kind of hidden since they're part of the API tests. See https://github.com/github/cmark-gfm/blob/2d65cd3c4bfbbdddc7accefc76392c16bb0cfb6d/api_test/main.c#L996-L1130.

@digitalmoksha digitalmoksha deleted the bw-sync-with-0.29.0.gfm.7 branch June 17, 2023 16:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants