Skip to content

Make processing of embedded HTML normative #57

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
gkellogg opened this issue Aug 23, 2018 · 10 comments
Closed

Make processing of embedded HTML normative #57

gkellogg opened this issue Aug 23, 2018 · 10 comments

Comments

@gkellogg
Copy link
Member

Currently, Embedding JSON-LD in HTML Documents is entirely informative. We've discussed making this normative, requiring JSON-LD processors to be able to identify and extract JSON-LD from a script tag with type application/ld+json within the HTML document.

  • Given multiple such script tags, which one is used?
  • Should we define a parameterized content-type to allow the version to be specified (e.g., application/ld+json;version=1.1)
  • Does the current document base affect the base for JSON-LD processing?
    • location of HTML document
    • html>head>base@href
  • xml:base of closest ancestor element
  • Does the document language affect the default language for JSON-LD processing?
    • HTTP header- Content-Language
    • @lang, @xml:lang
@iherman
Copy link
Member

iherman commented Aug 24, 2018

I believe we must do that. Embedded JSON-LD is, currently, the only way schema.org data is used, and there are a number of application (e.g., Web Publications) where this is the sensible way to go.

My take on the the questions:

  • Given multiple such script tags, which one is used?

I see two options:

  1. Take the first script element in tree order
  2. Take all of them and merge the resulting graphs in the RDF sense

Both approaches provide a clear specification; I am more favor of No. 1

  • Should we define a parameterized content-type to allow the version to be specified (e.g., application/ld+json;version=1.1)

If we define that for HTTP, then I guess it is necessary to follow that, yes.

  • Does the current document base affect the base for JSON-LD processing?
  • location of HTML document
    • html>head>base@href
    • xml:base of closest ancestor element

I do not think we should go there. Per HTML spec, the <script> element's DOM Note has a baseURI property, whose exact specification is in the hands of the HTML spec. We ought just to take that one.

  • Does the document language affect the default language for JSON-LD processing?
    • HTTP header- Content-Language
    • @lang, @xml:lang

Yes. I believe if, for example, somebody uses <script ... lang="fr"> (which is a perfectly valid HTML statement), we ought to use that. So again, whatever is valid for the script element as a node in HTML should be valid for the content of script.

@ajs6f
Copy link
Member

ajs6f commented Aug 24, 2018

@iherman I have to disagree a bit about which approach to take for multiple <script/>s. As a consumer of JSON-LD, I would find it surprising that I could "read" all these assertions in the document (every <script/>), but that only some would be read by machinery (those in the first wrt document order).

But I can also imagine situations in which (e.g. via CMS action) many <script/> elements wind up in a document with no real provenance, but I can clearly identify the one or few of interest to me.

So if we do make processing JSON-LD in HTML normative, do we need to offer a mechanism by which one or more (up to all) <script/>s can be selected from a document at processing time?

@gkellogg
Copy link
Member Author

I agree that if an HTML has multiple script elements that they should all be considered and merged into a common dataset. My own RDFa processor looks for any script element with a type attribute associated with an RDF reader, along with Microdata and RDF/XML and extracts triples from all.

The issue about choosing among script tags was surfaced for the use case where the context references an HTML document with embedded JSON-LD script(s). In this case, which one would be used as the context, or would they all be used?

@ajs6f
Copy link
Member

ajs6f commented Aug 24, 2018

In this case, which one would be used as the context, or would they all be used?

Just off the top of my head, I would be a bit worried about a merge in that situation because at least one of those <script/>s might contain a context meant for use with metadata for the page itself (e.g. publishing info, etc.). Perhaps we can offer a syntactic form that prioritizes sources within some larger context?

@iherman
Copy link
Member

iherman commented Aug 24, 2018

@ajs6f I do sympathize with accepting several scripts, but I am not sure we have a clear story on how we would merge several JSON-LD snippets into one; hence my original proposal of keeping it to one. Would they be like several top level JSON-LD objects in an array? Are the JSON content simply concatenated as strings? What would the user expect?

I am fine accepting several scripts if we have a clear story on this.

@iherman
Copy link
Member

iherman commented Aug 24, 2018

@gkellogg

I agree that if an HTML has multiple script elements that they should all be considered and merged into a common dataset. My own RDFa processor looks for any script element with a type attribute associated with an RDF reader, along with Microdata and RDF/XML and extracts triples from all.

I guess what you do is to merge these as RDF Graphs. This is also what I do in my RDFa+microdata processor. We can of course do that for several scripts, too, but I am a bit concerned whether this is something working with our user audience...

@ajs6f
Copy link
Member

ajs6f commented Aug 24, 2018

@iherman You make a good point. For instance documents, we can indeed go to RDF merge, but contexts... have to think about that! 🤔

@gkellogg
Copy link
Member Author

but I am a bit concerned whether this is something working with our user audience...

I've actually fielded Linter issues because of automatic creating of many (100's) of JSON-LD scripts in a document; I needed to encourage them to consolidate, but yes, it can happen for SEO.

@BigBlueHat
Copy link
Member

I believe we must do that. Embedded JSON-LD is, currently, the only way schema.org data is used, and there are a number of application (e.g., Web Publications) where this is the sensible way to go.

@iherman for the record, JSON-LD is the recommended way, but Google (at least) supports RDFa and Microdata for Schema.org extraction: https://developers.google.com/search/docs/guides/intro-structured-data#structured-data-format Additionally, Bing only recently (this past quarter) added JSON-LD support, but prior to that processed both RDFa and Microdata (afaik). Lastly, Open Graph Protocol is popular with sites targeting "social embedding" on Facebook, LinkedIn, etc (it's even in use on this page).

Consequently, I'd love to explore a WG Note (or some such) that helps resolve some of the vagueness around mixing these things together (which happens often).

@BigBlueHat
Copy link
Member

The issue about choosing among script tags was surfaced for the use case where the context references an HTML document with embedded JSON-LD script(s). In this case, which one would be used as the context, or would they all be used?

Ignoring (for now) the inherent risks of depending on embedded JSON-LD for storing (and extracting) a context expression from within HTML, we could "upgrade" the https://www.w3.org/ns/json-ld#context string from only defined as a link relationship (as currently defined) and expand it to include using it as a profile (or other) media type parameter.

<script type="application/ld+json;profile=https://www.w3.org/ns/json-ld#context">
{"@context": {}}
</script>

That could have interesting potential use in future markup-based graph expressions also--one can imagine an RDFa 2.0 which could lean on JSON-LD based contexts so that any expressed graph content maps to the same names throughout the document. But now I'm probably day dreaming. 😁

gkellogg added a commit that referenced this issue Sep 23, 2018
…extraction, how to deal with multiple script elements and script element targeting using fragments.

Fixes #23 and fixes #57.
gkellogg added a commit that referenced this issue Sep 24, 2018
…extraction, how to deal with multiple script elements and script element targeting using fragments.

Fixes #23 and fixes #57.
gkellogg added a commit that referenced this issue Sep 24, 2018
…extraction, how to deal with multiple script elements and script element targeting using fragments.

Fixes #23 and fixes #57.
gkellogg added a commit that referenced this issue Sep 25, 2018
…extraction, how to deal with multiple script elements and script element targeting using fragments.

Fixes #23 and fixes #57.
gkellogg added a commit that referenced this issue Sep 25, 2018
…extraction, how to deal with multiple script elements and script element targeting using fragments.

Fixes #23 and fixes #57.
gkellogg added a commit that referenced this issue Sep 26, 2018
…extraction, how to deal with multiple script elements and script element targeting using fragments.

Fixes #23 and fixes #57.
gkellogg added a commit that referenced this issue Oct 3, 2018
…extraction, how to deal with multiple script elements and script element targeting using fragments.

Fixes #23 and fixes #57.
gkellogg added a commit that referenced this issue Oct 17, 2018
…extraction, how to deal with multiple script elements and script element targeting using fragments.

Fixes #23 and fixes #57.
gkellogg added a commit that referenced this issue Nov 5, 2018
…extraction, how to deal with multiple script elements and script element targeting using fragments.

Fixes #23 and fixes #57.
gkellogg added a commit that referenced this issue Nov 16, 2018
…extraction, how to deal with multiple script elements and script element targeting using fragments.

Fixes #23 and fixes #57.
@azaroth42 azaroth42 added the satisfied Requirement Satisfied label Nov 18, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants