Skip to content

Problematic/confusing algorithm text re: the handling of URL @context values #265

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
kasei opened this issue Dec 20, 2019 · 19 comments
Closed

Comments

@kasei
Copy link
Contributor

kasei commented Dec 20, 2019

Context Processing step 5.2.1 says:

5.2.1) Initialize context to the result of resolving context against the base IRI of the document containing the local context which is established as specified in section 5.1 Establishing a Base URI of [RFC3986]. Only the basic algorithm in section 5.2 of [RFC3986] is used; neither Syntax-Based Normalization nor Scheme-Based Normalization are performed. Characters additionally allowed in IRI references are treated in the same way that unreserved characters are treated in URI references, per section 6.5 of [RFC3987].

This seems rather verbose, as "the document containing the local context" will already have a base IRI. However, there doesn't seem to be any mention of maintaining provenance of base IRIs of local contexts in this algorithm. I believe this is something that could be passed as an optional input variable (defaulting to the base IRI of the active context), but it would require cooperation at the call-sites of the context processing algorithm in the expansion algorithm and create term definition.

Steps 5.2.3 through 5.2.5 say:

5.2.3) If context was previously dereferenced, then the processor MUST NOT do a further dereference, and context is set to the previously established internal representation.
5.2.4) Otherwise, dereference context using the LoadDocumentCallback, passing context for url, and http://www.w3.org/ns/json-ld#context for profile and for requestProfile.
5.2.5) If context cannot be dereferenced, or cannot be transformed into the internal representation, a loading remote context failed error has been detected and processing is aborted. If the dereferenced document has no top-level map with an @context entry, an invalid remote context has been detected and processing is aborted; otherwise, set context to the value of that entry.

These steps are confusing for a few reasons:

  • 5.2.3 talks about a stored version of the "established internal representation" of the dereferenced context IRI, but this seems to be hidden state rather than being passed as an input variable
  • There doesn't seem to be a reason to store the entire representation of dereferenced context documents instead of just the @context entry
  • The error checking portion of 5.2.5 (which is the bulk of the text) seems to only be applicable if the "otherwise" branch of 5.2.4 was taken. Perhaps it should be 5.2.4.1, leaving the setting of context in 5.2.4 (with or without extracting the @context entry, depending on the previous bullet point)
@gkellogg
Copy link
Member

Context Processing step 5.2.1 says:

5.2.1) Initialize context to the result of resolving context against the base IRI of the document containing the local context which is established as specified in section 5.1 Establishing a Base URI of [RFC3986]. Only the basic algorithm in section 5.2 of [RFC3986] is used; neither Syntax-Based Normalization nor Scheme-Based Normalization are performed. Characters additionally allowed in IRI references are treated in the same way that unreserved characters are treated in URI references, per section 6.5 of [RFC3987].

This seems rather verbose, as "the document containing the local context" will already have a base IRI. ...

The wording, while verbose, is consistent with how we have described IRI resolution since 1.0. What's changed is to make it clear that it is the base IRI of the document, rather than any @base which is used to do the resolution.

... However, there doesn't seem to be any mention of maintaining provenance of base IRIs of local contexts in this algorithm. I believe this is something that could be passed as an optional input variable (defaulting to the base IRI of the active context), but it would require cooperation at the call-sites of the context processing algorithm in the expansion algorithm and create term definition.

I don't think that it's necessary to burden the algorithm with more options for description the location of the document being processed, as it's available given that it was dereferenced. Implementations may, of course, use any internal mechanisms they need to remember this.

Steps 5.2.3 through 5.2.5 say:

5.2.3) If context was previously dereferenced, then the processor MUST NOT do a further dereference, and context is set to the previously established internal representation.
5.2.4) Otherwise, dereference context using the LoadDocumentCallback, passing context for url, and http://www.w3.org/ns/json-ld#context for profile and for requestProfile.
5.2.5) If context cannot be dereferenced, or cannot be transformed into the internal representation, a loading remote context failed error has been detected and processing is aborted. If the dereferenced document has no top-level map with an @context entry, an invalid remote context has been detected and processing is aborted; otherwise, set context to the value of that entry.

These steps are confusing for a few reasons:

  • 5.2.3 talks about a stored version of the "established internal representation" of the dereferenced context IRI, but this seems to be hidden state rather than being passed as an input variable

It is implicit that implementations maintain such a memory, without being overly explicit on how they do it.

  • There doesn't seem to be a reason to store the entire representation of dereferenced context documents instead of just the @context entry

Sounds like a good optimization would be to only retain the @context part, do we need to be explicit about this?

  • The error checking portion of 5.2.5 (which is the bulk of the text) seems to only be applicable if the "otherwise" branch of 5.2.4 was taken. Perhaps it should be 5.2.4.1, leaving the setting of context in 5.2.4 (with or without extracting the @context entry, depending on the previous bullet point)

It could have been 5.2.4.1, but it doesn't change the control flow.

IMO, the text is clear as written and implementable without being overly prescriptive.

@kasei
Copy link
Contributor Author

kasei commented Dec 20, 2019

I don't think that it's necessary to burden the algorithm with more options for description the location of the document being processed, as it's available given that it was dereferenced. Implementations may, of course, use any internal mechanisms they need to remember this.

But the URL in question isn't the one that was just dereferenced, it's the one that may have been dereferenced to pass the local contexts into the context processing algorithm. That may end up being a dereferenced URL, but not in all cases I think, and may pass through several layers of indirection between the dereference and its eventual use in context processing.

Or have I misunderstood how it's being used? I thought I had it sorted as my current understanding led me to an implementation that does pass the test...

@kasei
Copy link
Contributor Author

kasei commented Dec 20, 2019

Sounds like a good optimization would be to only retain the @context part, do we need to be explicit about this?

It's not always clear in the text where data is going to be used (either elsewhere in the same algorithm or in other algorithms). This is especially true for implicit state data like this (and somewhat related to the discussion in #260, where at least you can visualize repeated uses of explicit variables in the EDs). Without explicit text, I'm not sure as an implementor that I'd be confident based on just the text in 5.2 that I could only store the @context value and throw away the rest of the document.

@kasei
Copy link
Contributor Author

kasei commented Dec 20, 2019

It could have been 5.2.4.1, but it doesn't change the control flow.

My issue with 5.2.5 is that what seems like the primary action of the step (setting context) is hidden behind a large amount of unrelated error-handling text that is only relevant if the control flow happened to pass through 5.2.4. I think it would be clearer to have context be set in its own stage, and keep error handling related to 5.2.4 code only within 5.2.4.

I can live with this as it is if there's resistance to changing it, but I reported an issue because the text as written tripped me up and only walking through tests and stepping through the algorithms with @azaroth42 got me to a point of understanding (a benefit independent implementors who come along after REC probably won't have!).

@dlongley
Copy link
Contributor

@kasei,

Could you perhaps recommend some specific text that could go into an informative note such that it wouldn't cause a normative change to the spec but would offer the clarifications you're seeking? For example:

"Note: In order to prevent being overly prescriptive, this algorithm makes no statement about how internal state is kept for dereferenced context documents and broadly applies error handling checks. Provided that the output of the algorithm is unchanged, implementations may employ optimizations such as retaining only the @context part of the dereferenced document or restructuring steps to avoid error handling code that is superfluous under certain conditions."

@kasei
Copy link
Contributor Author

kasei commented Dec 24, 2019

@dlongley

I think your "retaining only the @context part " language might be reasonable.

I do not think the "broadly [applying] error handling checks" language is appropriate. The thing that needs to be checked in 5.2.5 does not even happen in some invocations of the algorithm. The previous invocation which passed through 5.2.4 carried out the error checking, but the current invocation which passes through 5.2.3 does not dereference or transform anything into an internal representation. I really do think this error checking language should be in 5.2.4, but given how recent updates to my other issues have been handled, I don't see why this would be considered a normative change.

Regarding the IRI resolution using an out-of-bound base IRI, however, I do think this needs more bookkeeping steps (as new input variables to Context Processing and Create Term Defintion, and in the related invocations in Context Processing steps 5.2.6 and 5.13, and in Create Term Defintion 23.3). I would have thought that these new variables/steps would represent a normative change, but since it seems the consensus is that changes that reinforce the expected output of the test suite are not normative, perhaps this is merely editorial.

@gkellogg
Copy link
Member

@kasei I added a note, based on @dlongley's suggested text, and a couple of others to provide some non-prescriptive guidance to implementers. It's true that this could have been handled by adding parameters to the algorithm, but would have had a broad impact, and is not the only way that it can be implemented, given that it's contained in the RemoteDocument which is being processed as a context.

Please let us know if this works, or suggest changes that would make it more acceptable.

@kasei
Copy link
Contributor Author

kasei commented Dec 30, 2019

The note about retaining @context, and the move of error handling in 5.2.4.1 look good. Thank you.

It's true that this could have been handled by adding parameters to the algorithm, but would have had a broad impact, and is not the only way that it can be implemented, given that it's contained in the RemoteDocument which is being processed as a context.

I'm still not convinced about this provenance tracking issue. Agreed that it's "not the only way that it can be implemented," but surely that's true of the entire document? My concern is that (to the best of my understanding) the provenance tracking here is required for code paths that handle both dereferenced content and non-dereferenced content. In some cases it's important to keep track of the provenance, and in others it's not. I don't think that's a good situation to be in while assuming that it's clear how such tracking can/should be done.

My belief is that these changes (some of these were mentioned above) are sufficient to perform the tracking:

  • addition of a key to track a base IRI in an expanded term definition
  • new input variable to Context Processing
  • pass the tracked base IRI to the recursive call in Context Processing 5.2.5
  • pass the tracked base IRI to Create Term Definition in Context Processing 5.13
  • new input variable to Create Term Defintion
  • pass the tracked base IRI in the call to Context Processing in Create Term Definition 23.3
  • set the tracked base IRI during term definition initialization in Create Term Definition 10
  • pass the tracked base IRI found in the term definition (if present) to Context Processing in Expansion 8

While these changes are certainly not trivial, I think they are relatively contained, and (assuming they are correct) provide explicit information on where provenance tracking must occur and where the resulting data is used.

@gkellogg
Copy link
Member

We'll have to take it up on the call; I don't believe such changes are necessary.

@iherman
Copy link
Member

iherman commented Jan 10, 2020

This issue was discussed in a meeting.

  • RESOLVED: Reject #265 as overreaching at the current stage of the process; changing algorithm signatures has too far-reaching consequences for a clarity change
View the transcript Handling @context IRI wording
Rob Sanderson: link: #265
Rob Sanderson: the reason we’ve called this one out in particular
… is that the two Greg(g)’s have different views on what the exact wording should be
Gregg Kellogg: yes. we went back and forth a bit
… and what Greg was pushing for seemed like a very large rewrite
… and dlongley chimed it a bit also
… and that size change didn’t seem warranted
… the core question is how to resolve a relative context URL
… that text is certainly more vague
… the base of the document is used to resolve that
… but that may not be explicitly passed in
… so dlongley suggested we add a note that we were avoiding being overly prescriptive
… and a few more notes to clarify the intent and use throughout that section
… and without changing the API signature
… given where we are in the publishing process, I don’t think a heavier change is warranted
Rob Sanderson: it wouldn’t be a normative change would it be?
Gregg Kellogg: it depends on what you believe the boundary of the change is
… if we’re changing API signatures and data that’s being passed around
… then that would be a normative change
Pierre-Antoine Champin: and that we might introduce new bugs…
Gregg Kellogg: there are ways implementations can work around this without our changing the API
Rob Sanderson: any questions/
… k. I would suggest that we reject this issue on the grounds that we believe this would be a breaking change
… while it might be clearer to do it the way Greg suggests, the current way works without needing to change API signatures
Gregg Kellogg: no. it actually change the API signature, but the algorithm signature
… but nonetheless it would have far reaching API ramifications
Rob Sanderson: pchampin what are your thoughts on this one?
Pierre-Antoine Champin: I’m with Greg on this one
… with gkellogg on this one
… the potential for new bugs is quite high
Proposed resolution: Reject #265 as overreaching at the current stage of the process; changing algorithm signatures has too far-reaching consequences for a clarity change (Rob Sanderson)
Ivan Herman: +1
Gregg Kellogg: +1
Rob Sanderson: +1
Pierre-Antoine Champin: +1
Tim Cole: +1
Benjamin Young: +1
Jeff Mixter: +1
Resolution #2: Reject #265 as overreaching at the current stage of the process; changing algorithm signatures has too far-reaching consequences for a clarity change
Gregg Kellogg: I did mark the issue as rejected
… but I didn’t know if there was more to do to close this up
Rob Sanderson: we should call it out as an issue that was not accepted
… and point to this resolution as the rational for the rejection
… we’ve discussed it, considered it, and rejected the full extent of the change
… but we plan to add notes to clarify things to avoid future confusion
Ivan Herman: I checked with Ralph about what triggers a full CR or not
… and it’s largely under the control of the WG
… there’s no algorithm for when it should go back to the Director
… this one does seem to be on the borderline
… and it would restructure the algorithm heavily
… and all the other implementers would have to cross check things against the new changes
… if it’s a bug, then it would be justifiable
Rob Sanderson: it’s not wrong to implement it the way gkellogg has described it
Ivan Herman: no. of course not
… no one said they have to follow these algorithm lines specifically
… they just have to pass the tests
Gregg Kellogg: yes. we say that explicitly actually
Ivan Herman: and if they pass it without following all the algorithm steps, that’s fine
Rob Sanderson: any more on this one?

@kasei
Copy link
Contributor Author

kasei commented Jan 10, 2020

@gkellogg @iherman I have some questions regarding how this was handled. From today's minutes:

@gkellogg : there are ways implementations can work around this without our changing the API

Can you please explain how other ways of implementing this could happen without the approach I've described? To keep track of the provenance of downloaded contexts, mustn't the current algorithm implicitly be passing around this information as I've explicitly described?

As this has been rejected by the WG, can you at least confirm for me that the approach I've described will handle the tracking of context provenance as intended? Given that this happens implicitly in recursive code paths that only sometimes require this tracking, I am not entirely confident that what I've described will cover the intention of the spec. If the spec text can't be changed, I'd be much more comfortable if somebody from the WG could at least confirm that I've understood the intention correctly.

@azaroth42
Copy link
Contributor

For the second part, we did discuss that your approach is not only correct but even a better way to do it ... but that the extent of the change was too high, given that the algorithm is not wrong, just not as clear or transparent.

Explaining the current text I leave to @gkellogg :)

@gkellogg
Copy link
Member

@kasei It's entirely reasonable for you to use the approach you've outlined. It's not necessary for implementations to precisely follow the algorithithms, and this is a case where developer initiative is necessary to meet the intention of the algorithm. There may be other state or options that an implementation may find useful for implementing the algorithms which aren't strictly spelled out.

My own implementation keeps a context_base accessor as part of a Context object, which set to the location the context was loaded from, and used to resolve subsequent loads. The original call to create a Context uses a base option, set to the location of the document which is loading the context.

The algorithm text makes it clear that relative IRIs for contexts are loaded relative to the location of the document containing the reference, so these mechanisms help to maintain that. Another implementation may do it differently.

It was felt that the changes to the algorithm signature, and all the places it is invoked could have a destabilizing effect, and weren't strictly necessary for implementations to be able to interpret the spec and pass tests.

@azaroth42
Copy link
Contributor

Re-opening this issue on the basis that we have now committed to re-entering CR, and thus such a change should be reconsidered in this new context.

@azaroth42 azaroth42 reopened this Jan 17, 2020
@iherman
Copy link
Member

iherman commented Jan 24, 2020

This issue was discussed in a meeting.

  • RESOLVED: Add text to api to clarify the need for tracking of base IRIs during context processing
View the transcript Confusing context URL handling,
link: #265
Rob Sanderson: We concluded that the change would be good, but was too big.
… Now that we dropped out of CR, we should discuss whether the change is useful.
… Should we do that now?
… gkellogg, will this be an editorial change that doesn’t need a resolution?
Gregg Kellogg: It will change the algorithm signature.
… We need a way to keep track of the resolved context URL to use when resolving future context URLs.
… It impacts the places that call it to pass a value around.
Ivan Herman: If you change the parameters, how many implementations will have to change?
Gregg Kellogg: We don’t change tests. Implementations do not have to follow the algorithm exactly.
Ivan Herman: Does anyone else next to kasei follow the spec exactly?
… Do we know if we are affecting anyone else?
Gregg Kellogg: This was behaviour that was expected in 1.0.
Ivan Herman: If you do this change, do the implementation of Ruben and Dave have to change.
Ruben Taelman: I don’t follow the algorithm.
Dave Longley: agree with gregg, i don’t think we need to change anything in JS
Proposed resolution: Add text to api to clarify the need for tracking of base IRIs during context processing (Rob Sanderson)
Ivan Herman: +1
Gregg Kellogg: +1
Rob Sanderson: +1
Ruben Taelman: +1
Dave Longley: +1
Benjamin Young: +1
Harold Solbrig: +1
Adam Soroka: +1
Resolution #5: Add text to api to clarify the need for tracking of base IRIs during context processing
Pierre-Antoine Champin: +1

@gkellogg
Copy link
Member

@kasei I mostly followed the steps you indicated, but there were many ripple affects. Please give it a look over and see if it seems right.

@gkellogg
Copy link
Member

@kasei still waiting to see if this solves the issue for you.

@kasei
Copy link
Contributor Author

kasei commented Feb 10, 2020

@gkellogg Apologies. #356 seems to handle this differently in some places than the way I implemented it, so it's taking me some time to sort out the effects. I'll try to get you an answer ASAP.

@kasei
Copy link
Contributor Author

kasei commented Feb 14, 2020

@gkellogg I am also going to create new issues representing problems in #356. Beyond these, the PR has broken some tests for me and I'm not yet clear if they represent actual problems; I'll keep you updated as I try to get back to passing the tests.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants