-
Notifications
You must be signed in to change notification settings - Fork 23
Handling HTML character references in embedded JSON-LD #100
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
re Google SDTT, can you point me (well, [email protected]) at a testcase
and its expected vs actual outcome?
…On Sat, 1 Dec 2018 at 14:36, Gregg Kellogg ***@***.***> wrote:
PR #97 <#97> and
w3c/json-ld-api#51 <w3c/json-ld-api#51> resolved
embedding JSON-LD in provide for HTML character references (entities) to be
encoded within a data block so that critical elements are not mishandled by
HTML parsers. As written this requires authors to encode sequences such as
<script></script> which may occur in JSON-LD strings, and for these to be
decoded when being extracted by the API.
It seems that the structured-data testing tool does not do this, and it's
not clear what provisions are made for potentially problematic content.
One way to resolve this would be to provide direction to content authors
to avoid any JSON-LD content embedded in HTML which could cause such a
problem, which seems to be current practice. If character references are
encoded, decoding is required at some layer.
cc/ @danbri <https://github.com/danbri> @BigBlueHat
<https://github.com/BigBlueHat>
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#100>, or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAKZGawg7FSWrf8xI7_eh03WUsT7toWKks5u0wR4gaJpZM4Y9Ian>
.
|
For example, consider the following JSON-LD: {
"http://schema.org/description": "You can't embed a <script></script> snippet in JSON-LD embedded in HTML"
} Embedding this in HTML creates errors due to illegal data-block content: <script type="application/ld+json">
{
"http://schema.org/description": "You can't embed a <script></script> snippet in JSON-LD embedded in HTML"
}
</script> As noted in Restrictions for contents of JSON-LD <script type="application/ld+json">
{
"http://schema.org/description": "You can't embed a <script></script> snippet in JSON-LD embedded in HTML"
}
</script> This feature is at risk, as it generates different results after encoding, unless the algorithm also decodes. It's apparent that the SDTT does not do such decoding. |
This issue was discussed in a meeting.
View the transcript6.1. Handling HTML character references in embedded JSON-LDBenjamin Young: #100 Benjamin Young: so far most of work has been Greg and Dan Gregg Kellogg: this is a relatively narrow issue - we should solve by not resolving html entities in the API … the narrow use case is script tags and comments … so we should document by clarifying in spec language but not resolving the html entities … if an html embedded in json (which is inside a script tag) closes the script tag, then rest of json will be treated as body … in theory you can end up embedding javascript … I suggest we document and say don’t do this … and leave browser to defend against this Benjamin Young: agree this seems not our problem … people who put end script tag in their json-ld will find that they’ve broken their html … if people need to escape character entities, they’ll do, but it will become an html issue not our problem … while we don’t want to avoid edge cases, this still doesn’t seem to be our issue … but would love to her back from Dan on this … just add note that you could disrupt html parsing by doing this and leave it at that Adam Soroka: I don’t want to minimize the security concern, but since this is coming from server side, it seems not as threatening … not as bad from accepting user input Gregg Kellogg: I would propose that we close this issue and remove all except a non-normative caution note … that data block should remain valid … I don’t anticipate anything different on this from schema.org … this removes some text from API doc and maybe some changes in syntax document where we currently talk about how to escape Proposed resolution: close #100 with note about authoring concern related to HTML characters and </script> and remove existingtext on JSON-LD processors handling character encodings (Benjamin Young) Gregg Kellogg: +1 Tim Cole: +1 Benjamin Young: +1 Ivan Herman: +1 Adam Soroka: +1 David Newbury: +1 Pierre-Antoine Champin: +1 Resolution #3: close #100 with note about authoring concern related to HTML characters and </script> and remove existing text on JSON-LD processors handling character encodings |
Reopening until edits are done. |
This issue was discussed in a meeting.
View the transcriptBenjamin Young: so far most of work has been Greg and DanGregg Kellogg: this is a relatively narrow issue - we should solve by not resolving html entities in the API … the narrow use case is script tags and comments … so we should document by clarifying in spec language but not resolving the html entities … if an html embedded in json (which is inside a script tag) closes the script tag, then rest of json will be treated as body … in theory you can end up embedding javascript … I suggest we document and say don’t do this … and leave browser to defend against this Benjamin Young: agree this seems not our problem … people who put end script tag in their json-ld will find that they’ve broken their html … if people need to escape character entities, they’ll do, but it will become an html issue not our problem … while we don’t want to avoid edge cases, this still doesn’t seem to be our issue … but would love to her back from Dan on this … just add note that you could disrupt html parsing by doing this and leave it at that Adam Soroka: I don’t want to minimize the security concern, but since this is coming from server side, it seems not as threatening … not as bad from accepting user input Gregg Kellogg: I would propose that we close this issue and remove all except a non-normative caution note … that data block should remain valid … I don’t anticipate anything different on this from schema.org … this removes some text from API doc and maybe some changes in syntax document where we currently talk about how to escape Proposed resolution: close #100 with note about authoring concern related to HTML characters and </script> and remove existingtext on JSON-LD processors handling character encodings (Benjamin Young) Gregg Kellogg: +1 Tim Cole: +1 Benjamin Young: +1 Ivan Herman: +1 Adam Soroka: +1 David Newbury: +1 Pierre-Antoine Champin: +1 Resolution #3: close #100 with note about authoring concern related to HTML characters and </script> and remove existing text on JSON-LD processors handling character encodings |
Per https://www.w3.org/TR/json-ld11/#restrictions-for-contents-of-json-ld-script-elements See the parent issue in w3c/json-ld-syntax#100 where this was created. Fixes #9
PR #97 and w3c/json-ld-api#51 resolved embedding JSON-LD in provide for HTML character references (entities) to be encoded within a data block so that critical elements are not mishandled by HTML parsers. As written this requires authors to encode sequences such as
<script></script>
which may occur in JSON-LD strings, and for these to be decoded when being extracted by the API.It seems that the structured-data testing tool does not do this, and it's not clear what provisions are made for potentially problematic content.
One way to resolve this would be to provide direction to content authors to avoid any JSON-LD content embedded in HTML which could cause such a problem, which seems to be current practice. If character references are encoded, decoding is required at some layer.
cc/ @danbri @BigBlueHat
The text was updated successfully, but these errors were encountered: