Skip to content

Handling HTML character references in embedded JSON-LD #100

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
gkellogg opened this issue Dec 1, 2018 · 5 comments
Closed

Handling HTML character references in embedded JSON-LD #100

gkellogg opened this issue Dec 1, 2018 · 5 comments
Assignees

Comments

@gkellogg
Copy link
Member

gkellogg commented Dec 1, 2018

PR #97 and w3c/json-ld-api#51 resolved embedding JSON-LD in provide for HTML character references (entities) to be encoded within a data block so that critical elements are not mishandled by HTML parsers. As written this requires authors to encode sequences such as <script></script> which may occur in JSON-LD strings, and for these to be decoded when being extracted by the API.

It seems that the structured-data testing tool does not do this, and it's not clear what provisions are made for potentially problematic content.

One way to resolve this would be to provide direction to content authors to avoid any JSON-LD content embedded in HTML which could cause such a problem, which seems to be current practice. If character references are encoded, decoding is required at some layer.

cc/ @danbri @BigBlueHat

@danbri
Copy link

danbri commented Dec 2, 2018 via email

@gkellogg
Copy link
Member Author

gkellogg commented Dec 2, 2018

For example, consider the following JSON-LD:

{
  "http://schema.org/description": "You can't embed a <script></script> snippet in JSON-LD embedded in HTML"
}

Embedding this in HTML creates errors due to illegal data-block content:

<script type="application/ld+json">
{
  "http://schema.org/description": "You can't embed a <script></script> snippet in JSON-LD embedded in HTML"
}
</script>

As noted in Restrictions for contents of JSON-LD script elements, such data must have HTML Character references encoded, such as the following:

<script type="application/ld+json">
{
  "http://schema.org/description": "You can't embed a &lt;script&gt;&lt;/script&gt; snippet in JSON-LD embedded in HTML"
}
</script>

This feature is at risk, as it generates different results after encoding, unless the algorithm also decodes. It's apparent that the SDTT does not do such decoding.

@iherman
Copy link
Member

iherman commented Dec 8, 2018

This issue was discussed in a meeting.

  • RESOLVED: close #100 with note about authoring concern related to HTML characters and </script> and remove existing text on JSON-LD processors handling character encodings
View the transcript 6.1. Handling HTML character references in embedded JSON-LD
Benjamin Young: #100
Benjamin Young: so far most of work has been Greg and Dan
Gregg Kellogg: this is a relatively narrow issue - we should solve by not resolving html entities in the API
… the narrow use case is script tags and comments
… so we should document by clarifying in spec language but not resolving the html entities
… if an html embedded in json (which is inside a script tag) closes the script tag, then rest of json will be treated as body
… in theory you can end up embedding javascript
… I suggest we document and say don’t do this
… and leave browser to defend against this
Benjamin Young: agree this seems not our problem
… people who put end script tag in their json-ld will find that they’ve broken their html
… if people need to escape character entities, they’ll do, but it will become an html issue not our problem
… while we don’t want to avoid edge cases, this still doesn’t seem to be our issue
… but would love to her back from Dan on this
… just add note that you could disrupt html parsing by doing this and leave it at that
Adam Soroka: I don’t want to minimize the security concern, but since this is coming from server side, it seems not as threatening
… not as bad from accepting user input
Gregg Kellogg: I would propose that we close this issue and remove all except a non-normative caution note
… that data block should remain valid
… I don’t anticipate anything different on this from schema.org
… this removes some text from API doc and maybe some changes in syntax document where we currently talk about how to escape
Proposed resolution: close #100 with note about authoring concern related to HTML characters and </script> and remove existingtext on JSON-LD processors handling character encodings (Benjamin Young)
Gregg Kellogg: +1
Tim Cole: +1
Benjamin Young: +1
Ivan Herman: +1
Adam Soroka: +1
David Newbury: +1
Pierre-Antoine Champin: +1
Resolution #3: close #100 with note about authoring concern related to HTML characters and </script> and remove existing text on JSON-LD processors handling character encodings

@iherman iherman closed this as completed Dec 8, 2018
@ghost ghost removed the needs discussion label Dec 8, 2018
@gkellogg
Copy link
Member Author

gkellogg commented Dec 8, 2018

Reopening until edits are done.

@iherman
Copy link
Member

iherman commented Dec 16, 2018

This issue was discussed in a meeting.

  • RESOLVED: close #100 with note about authoring concern related to HTML characters and </script> and remove existing text on JSON-LD processors handling character encodings
View the transcript Benjamin Young: so far most of work has been Greg and Dan
Gregg Kellogg: this is a relatively narrow issue - we should solve by not resolving html entities in the API
… the narrow use case is script tags and comments
… so we should document by clarifying in spec language but not resolving the html entities
… if an html embedded in json (which is inside a script tag) closes the script tag, then rest of json will be treated as body
… in theory you can end up embedding javascript
… I suggest we document and say don’t do this
… and leave browser to defend against this
Benjamin Young: agree this seems not our problem
… people who put end script tag in their json-ld will find that they’ve broken their html
… if people need to escape character entities, they’ll do, but it will become an html issue not our problem
… while we don’t want to avoid edge cases, this still doesn’t seem to be our issue
… but would love to her back from Dan on this
… just add note that you could disrupt html parsing by doing this and leave it at that
Adam Soroka: I don’t want to minimize the security concern, but since this is coming from server side, it seems not as threatening
… not as bad from accepting user input
Gregg Kellogg: I would propose that we close this issue and remove all except a non-normative caution note
… that data block should remain valid
… I don’t anticipate anything different on this from schema.org
… this removes some text from API doc and maybe some changes in syntax document where we currently talk about how to escape
Proposed resolution: close #100 with note about authoring concern related to HTML characters and </script> and remove existingtext on JSON-LD processors handling character encodings (Benjamin Young)
Gregg Kellogg: +1
Tim Cole: +1
Benjamin Young: +1
Ivan Herman: +1
Adam Soroka: +1
David Newbury: +1
Pierre-Antoine Champin: +1
Resolution #3: close #100 with note about authoring concern related to HTML characters and </script> and remove existing text on JSON-LD processors handling character encodings

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants