Rethinking datasets and graphs? #30

iherman · 2018-07-02T15:06:04Z

I decided to start with some controversy:-)

In short, I have always been confused by the ways Datasets are treated in JSON-LD, and I propose to re-open that can of worms. I have jotted down my idea in a separate wiki page (it would have been too long for an issue).

TL;DR: My proposal is to start from scratch, ie, deprecating @graph and replacing the functionalities with something cleaner. See the wiki page...

The text was updated successfully, but these errors were encountered:

iherman · 2018-07-02T15:09:16Z

I realize we have an issue with backward compatibility. I would therefore propose that we declare@graph as deprecated, but not removed. Ie, the old features remain valid, and we use a new keyword instead (@dataset). This means that, alas!, 1.1 implementations should implement @graph, too, though we should not include, imho, the graph container feature (we will have to see what to replace it with if necessary).

gkellogg · 2018-07-02T21:25:19Z

I updated the wiki with my thoughts. I think we should continue to use @graph, and we can adapt for some of the issues you mention.

The graph container feature is required for Verifiable Credentials, and I suspect that @msporny and @dlongley would object to it's being removed.

Creating a top-level dataset using a map structure could be accomplished by leveraging the existing container semantics as reproduced below:

{
  "@context": {
    ...
    "dataset": {"@id": "@graph", "@container": ["@graph", "@id"]}
  },
  "dataset" : {
    "URL1" : {
        Some RDF statements here
    },
    "URL2" : [
        {
            We could also define a bush just like above
        },
        {

        }
    ],
    "@none" : [{
        Default graph statements here
    }]
  }
}

This just says that the use of the "dataset" term treats it like @graph, except to use the "@container": ["@graph", "@id"] mechanism to define a graph map. This avoids needing to introduce values in the default graph that reference the named graph identifiers.

msporny · 2018-07-02T21:38:54Z

TL;DR: My proposal is to start from scratch, ie, deprecating @graph and replacing the functionalities with something cleaner.

What is the problem or documented author issue we are attempting to solve here? (You will see that I will repeat this question for every new feature/deprecation proposed for JSON-LD 1.1). :)

@graph is something that was designed to be used in @contexts... now, some may be using it in JSON-LD markup, which is okay... but I hesitate to say that it's a best practice. We had originally designed @graph to hold data that is digitally signed and needed to exist in a separate graph from the signature. So, @graph was primarily designed so we can digitally sign information (and the most natural way to do that is to use datasets). @graph wasn't something that most developers/authors would be exposed to.

I get the conceptual purity argument, but I haven't seen folks complaining about @graph. I fully admit that we may not have been exposed to those authors/developers... but again, I'd like to see them writing about this issue rather than deprecating a JSON-LD feature before seeing that data.

also,

I decided to start with some controversy:-)

😆 -- nice to see that your sense of humor hasn't changed, @iherman.

msporny · 2018-07-02T21:44:58Z

in a format [JSON-LD] that is, at the end of the day, the serialization of RDF

I know I'm sounding like a broken record at this point, but JSON-LD is not primarily a serialization of RDF. It's a graph-based syntax that just so happens to losslessly convert to and from RDF. I think people think I'm kidding when I say this, I'm only half-kidding... JSON-LD started off by attempting to create a new graph model syntax that Web developers would use... RDF compatibility was not as important to our organization as it was to the existing RDF community and it continues to not be a primary goal.

iherman · 2018-07-03T04:53:12Z

@gkellogg not 100%...

I tested in the dev. version of playground:

{
  "@context": {
    "@version": 1.1,
    "dataset": {"@id": "@graph", "@container": ["@graph", "@id"]}
  },
  "dataset" : {
    "http://www.ex.org/1" : {
        "@id" : "http://www.ivan-herman.net",
        "http://a.b.c/1" : "Ivan Herman"
    },
    "http://www.ex.org/2" : [
        {
         "@id" : "http://p.q.r",
         "http://a.b.c/2" : "Somebody else"
        },
        {
         "@id" : "http://x.w.z",
         "http://a.b.c/3" : "And somebody else again"
        }
    ],
    "@none" : [{
        "@id" : "http://www.w3.org",
        "http://a.b.c/4" : "Nobody"
    }]
  }
}

and what I got was:

<http://p.q.r> <http://a.b.c/2> "Somebody else" <http://www.ex.org/2> .
<http://www.ivan-herman.net> <http://a.b.c/1> "Ivan Herman" <http://www.ex.org/1> .
<http://x.w.z> <http://a.b.c/3> "And somebody else again" <http://www.ex.org/2> .
<http://www.w3.org> <http://a.b.c/4> "Nobody" _:b0 .

Note the last line: I did not get statements in the default graph, but in yet another graph with a blank node as an id...

But yes, it is pretty close.

iherman · 2018-07-03T05:06:04Z

@msporny

What is the problem or documented author issue we are attempting to solve here?

If I am the only one who is constantly confused on how to use the @graph term then I will of course shut up, and assign it to my own deficiencies. I would have no problem accepting that. But I do believe that the usage of @graph is confusing. As I tried to show that it confuses terms, imposes restrictions (like the usage of blank nodes in graph containers which generate RDF that would be unusable in SPARQL). Its current usage of representing bushes in JSON-LD is confusing, and it is not obvious (only via a conceptually complex trick) to represent elementary datasets. If the reader cannot gain a clear mental model of what is happening, then the only way of encoding data would be to make copy-paste from the examples without really understanding them, which is a problem (in my view).

About the role of JSON-LD: history is what it is, but that is now bygone. JSON-LD has been "marketed", and I daresay extremely successfully so, as an RDF serialization format, too. This is what it has become today and used by various communities. We have to take this connection seriously and try to improve the purity of the relationship. Ie, if our syntax leads to a confusion of RDF Graphs and RDF Datasets I do see that as a problem.

gkellogg · 2018-07-03T05:06:51Z

No, it’s not implemented in the spec just now, but would be a logical thing to do. Similar to adding ‘@containeron@type` which is also considered elsewhere.

iherman · 2018-07-03T05:07:48Z

@gkellogg

The graph container feature is required for Verifiable Credentials, and I suspect that @msporny and @dlongley would object to it's being removed.

As I said, I did not thought through how to include graph container feature, I was not saying that the feature itself should be removed. Just its current syntax.

iherman · 2018-07-03T05:09:03Z

@gkellogg

what about the separate proposal of represent bushes via a simple cross reference to contexts?

ericprud · 2018-07-03T08:14:37Z

I understand Manu's (provocative) point about about JSON-LD being a graph language first and RDF-compatible second. I believe that the @container: @graph construct doesn't behave as one would expect in a graph language, i.e. that a property points at an non-rooted graph. For instance:

{ "@context": {
    "@version": 1.1,
    "p1": {
      "@id": "http://vocab.ex/p1",
      "@container": "@graph"
    },
    "p2": { "@id": "http://vocab.ex/p2" },
    "p3": { "@id": "http://vocab.ex/p3" },
    "p4": { "@id": "http://vocab.ex/p4" }
  },
  "p1": {
    "p2": {
      "p3": "v3",
      "p4": "v4"
    }
  } }

emits the dataset:

_:b0 ex:p1 _:b1 .
GRAPH _:b1 {
  _:b2 ex:p2 _:b3  .
  _:b3 ex:p3 "v3"  .
  _:b3 ex:p4 "v4"  .
}

Navigating in JSON-land, <p1> is strongly-connected to the object with a <p2> property (_:b2, in RDF land). If I'm navigating this as a graph (RDF graph, property graph, Spark's variant of Cypher, etc), <p1> connects to a bag of triples. The application has to be working with a known schema and valid data to discover _:b2.

A solution that will be irksome to some and blindingly obvious to others is to give the subjects of the nested properties (i.e. <p2>) the same identity as that of the graph. In such a schema,

  "p1": {
    "p2": {
      "p3": "v3"
    },
    "p5": "v5"

would look like:

_:b0 ex:p1 _:b1 .
GRAPH _:b1 {
  _:b1 ex:p2 _:b2  .
  _:b2 ex:p3 "v3"  .
  _:b1 ex:p5 "v5"  .
}

This would eliminate a lot of fuzzy heuristics from query/update/validation.

BigBlueHat · 2018-07-03T14:50:18Z

@ericprud looks like some wee typos in your last examples (i.e. what happened to p5 and v5? and where did p4 come from?). Could you fix those? Thanks!

BigBlueHat · 2018-07-03T14:54:02Z

We have to take this connection seriously and try to improve the purity of the relationship.

While I do agree with @iherman about the importance of JSON-LD as a serialization of RDF, I'll also 👍 @msporny's statements that (even regardless of history), JSON-LD's appeal reaches farther than just "RDF-land."

It may be a tricky balance to when addressing situations like this, but it's clear that "confusion" is subjective.

I've no clear technical suggestions at this point, other than that we at least have more to clarify and exemplify (i.e. improve our examples) and would prefer we start there...and see what's still missing.

gkellogg · 2018-07-03T15:12:06Z

@iherman while adding referencable contexts is feasible, I think it’s a big step, and I don’t think it’s necessary.

@ericprud your thought about preserving the graph name as the implicit subject of triples in the referenced graph has merit, and does solve the nasty rooting problem.

@msporny we settled on using the RDF model as the basis for JSON-LD not just to appease the RDF 1.1 WG, but because it didn’t make sense to introduce yet another model. I think it’s important that the JSON-LD surface syntax remain usable by developers that don’t care, but we need to make sure the underpinnings have a good basis in theory. Perhaps we can use JSON-LD to push forward on some emerging areas of interest, such as property graph alignment via RDF*.

iherman · 2018-07-03T15:15:52Z

@iherman while adding referencable contexts is feasible, I think it’s a big step, and I don’t think it’s necessary.

Because? I believe a proper and clean representation of bushes is very important and we do not have that (I do not consider the usage of @graph as "clean"...)

msporny · 2018-07-03T15:52:10Z

@msporny we settled on using the RDF model as the basis for JSON-LD not just to appease the RDF 1.1 WG, but because it didn’t make sense to introduce yet another model.

We did introduce another model:

https://json-ld.org/spec/latest/json-ld/#data-model

Yes, it is compatible but the JSON-LD data model is a superset of the RDF data model. You can express things in JSON-LD that you cannot in RDF, that was a very intentional strategy. I suggest that we keep it that way to continue to push RDF 1.1 into the modern world. Native support for RDF lists, anyone? :)

That strategy pushed the RDF 1.1 WG to add a few important features (named graphs, some would argue dataset support). I'll note that the JSON-LD data model is an extension of the RDF data model because RDF 1.1 didn't adopt all of the JSON-LD data model features.

I think it’s important that the JSON-LD surface syntax remain usable by developers that don’t care, but we need to make sure the underpinnings have a good basis in theory.

+1, as long as the realignment to theory doesn't deprecate features that are working just fine for the rest of us. It feels like this discussion is trying to fix a non-issue for JSON-LD authors. Yes, I readily admit that maybe the theoretical underpinnings aren't clean, but if you want something that's clean -- use TRiG. There are other languages that will give that to you.

JSON-LD is meant to be an everyday developer/author tool... we don't need to expose every thing in RDF to those folks (and I'd argue that if we do, JSON-LD will eventually fail).

The primary design criteria for JSON-LD is to help developers build better systems... being theoretically clean is very far down the list of priorities... and I'm very concerned that if we focus too much on that, we will turn JSON-LD into something that has so many bells and whistles attached to it that it loses its value and we'll be forced to do a JSON-LD Lite just like we were forced to do that for RDFa.

Perhaps we can use JSON-LD to push forward on some emerging areas of interest, such as property graph alignment via RDF*.

Do we have a list of features for JSON-LD 1.1 with priority based on group interest? Can we do some ranked choice voting on that so we don't spend a lot of time discussing changes to JSON-LD that are low priority?

msporny · 2018-07-03T15:56:09Z

Because? I believe a proper and clean representation of bushes is very important and we do not have that (I do not consider the usage of @graph as "clean"...)

What use case is not possible because of this missing feature?

iherman · 2018-07-03T16:01:14Z

Wow, this discussion goes a little bit out of hand. It was not my intention to start an RDF vs. non-RDF controversy. Can we avoid getting into this discussion?

@msporny

What use case is not possible because of this missing feature?

The question is not whether something is not possible. Yes, it is possible to express a bush with @graph, and I did not say otherwise. My claim is that it is complicated, way more complicated and counter-intuitive than necessary. The whole document is geared towards a special subset of graphs that are all rooted, but that is not the only use case out there.

My goal is to make JSON-LD easy to understand and use. In my experience, in the area of datasets and bushes, it is not. Obviously, you do not feel there is a problem with this. Let us try to pause here a bit, because whether it is easy or not is obviously a subjective statement; I would like to see the reactions of others in the group, that would begin to give a good sample.

msporny · 2018-07-03T16:48:03Z

Wow, this discussion goes a little bit out of hand. It was not my intention to start an RDF vs. non-RDF controversy.

Hey man, you're the one that wanted to be controversial. 😜

BigBlueHat · 2018-07-03T20:35:54Z

@iherman if you just want a bush, isn't this sufficient:

[
  {"@context": "http://schema.org/", "name": "Ivan"},
  {"@context": "http://example.com/schema", "name": "Pluto"}
]

...or if you want a default @context value, you'd reshape it this way (as I'm sure you know):

{
  "@context": "http://schema.org/",
  "@graph": [
    {"@context": "http://schema.org/", "name": "Ivan"},
    {"@context": "http://example.com/schema", "name": "Pluto"}
  ]
}

I've not been tripped up by that, conceptually. And even giving that last example an @id and/or other top-level properties has all made sense to me (for what little that's worth). I do, however, get a bit tangled up by the new "@container": "@graph" + @none (for statements about the "named graph" itself). That's likely a separate issue?

Overall, maybe there's a way we can narrow in on the exact concerns? The wiki page was great for an overview, but I guess I didn't find the solution(s) any clearer than the present approach.

gkellogg · 2018-07-03T22:47:48Z

@iherman Looking at your "addressable context" mechanism from the wiki:

[
   {
       "@context" : {
           "@id" : "_:a"
           ...
       }
   },
   {
     "@context" : "_:a",
     "@id" : "http://www.example.org/1",
     "http://a.b.c" : "something"
   },{
     "@context" : "_:a",
     "@id" : "http://www.example.org/2",
     "http://d.e.f" : "something"
   }
]

My concern here is that this implies that a context with "@id": "http://example/ctx" would be the same as a context loaded from "http://example/ctx". This may be the case, but nowhere else in JSON-LD is the assumption that loading a document from a location implies that the document has an @id that's the same as that location, unless it has "@id": "". This would seem to be creating a precedent for contexts.

Moreover, what if you had a remote context at "http://example.org/foo", which looked like the following:

{
  "@context": {
    "@id": "http://example.org/bar",
     ...
  }
}

What is the address of this context, "http://example.org/foo", or "http://example.org/bar"? Right now, if I use "@context": "http://schema.org", I either load it, or use the version already loaded from that address.

In short, I think that this raises some issues that may muddy the waters, all to "clean up" the use of @graph for describing a bush, which is well established practice by now.

iherman · 2018-07-04T05:16:59Z

@gkellogg yes, I find your argument compelling indeed. It may require a specific addressing mechanism, orthogonal to @id which, I admit, is not nice either.

iherman · 2018-07-04T05:21:32Z

@BigBlueHat and others: it seems that I am getting to the minority with my uneasiness. The fact that the same keyword (@graph) is used both for a bush and for datasets extremely confusing and "dirty", but I obviously won't lie down the road if I am the only one.

iherman · 2018-07-04T05:24:00Z

One positive thing that may have come out of this discussion: #30 (comment) shows a way to produce very cleanly a dataset, provided that @none works. Personally, I would prefer to have @dataset as a standard, but it may be considered as a standard idiom by users...

ericprud · 2018-07-04T09:25:28Z

I believe that @iherman's proposal to distinguish datasets with @dataset has low cost and good value:

The most compelling JSON-LD use cases demand dual access (JSON tree and RDF graph navigation). I'd estimate this at 90% of JSON-LD's value, though in my experience it's been closer to 100%.
Some folks exchange non-framed JSON-LD but their use of it is as a commodity serialization. The group most affected is a smallish set of engineers writing serializers and parsers.

For the most part, non-expert human eyes rarely fall on non-framed JSON-LD with keywords like @graph and @dataset. For those folks (let's call them experts-to-be), a clear model which distinguishes graphs from datasets is of greater value than the adoption of the legacy keyword @graph to mean a dataset.

BigBlueHat · 2018-07-06T13:44:33Z

I'm not sure the introduction of another keyword makes any of this any clearer...and doing so would certainly raise the "expert" bar a bit by requiring an understanding of the differences between a Named Graph and a Dataset--which seems to be unclear to (or at least debated by) the people defining the terminology--see https://www.w3.org/TR/rdf11-datasets/ linked earlier.

From that Note:

Defining the semantics of RDF datasets requires an understanding of the two following issues:

what the graph names (IRI or blank node) denote, or what are the constraints on what the names can possibly denote;

how the triples in the named graph influence the meaning of the dataset.

...
Depending on the assumptions taken with respect to these two issues, the formalization of the semantics of RDF datasets can vary very much.

Perhaps it would be helpful if someone (who cares deeply about this issue) were to go through the list of interpretations represented in that Note and present the various JSON-LD expressions for each +/- any confusion they think is represented by the current expression options and/or proposals to fix them.

That would help me at least, and perhaps at least narrow the discussions here a bit more.

gkellogg · 2018-08-02T16:51:55Z

So as not to loose @ericprud's comment about making a change to "@container": "@graph" to align the blank node table used to identify the graph with the implicit subject of the node contained within the graph, please create another issue for this to be considered (action on @ericprud).

I believe this directly relates to the ability to validate Verifiable Claim named graphs from a data-model perspective, rather than just a JSON Schema perspective. Potentially, the contents of such a named graph could have a very large number of statements, which makes it computationally impractical to find the "root" of the graph by searching for statements with a subject (@id) which is not the value of some other node (object of a statement). We might go so far as to describe a subset of named graphs where the graph name is the same as the primary subject of the graph.

iherman · 2019-02-09T15:06:09Z

This issue was discussed in a meeting.

No actions or resolutions

View the transcript

datasets and graphs
Rob Sanderson: ref: #30
Ivan Herman: I don’t like the way that this is done, but it turned into a philosophical argument, and I can just close it.
Ivan Herman: to clarify, I want to close it because it’s way too late.

iherman added spec:substantive spec:enhancement labels Jul 2, 2018

This was referenced Aug 23, 2018

@type as @container:@set? #34

Closed

Ensure that blank node identifiers for anonymous graphs are reused w3c/json-ld-api#26

Closed

iherman mentioned this issue Feb 6, 2019

TriG graphs in JSON-LD #128

Closed

azaroth42 added the propose closing label Feb 8, 2019

iherman closed this as completed Feb 8, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rethinking datasets and graphs? #30

Rethinking datasets and graphs? #30

iherman commented Jul 2, 2018

iherman commented Jul 2, 2018

gkellogg commented Jul 2, 2018

msporny commented Jul 2, 2018 •

edited

Loading

msporny commented Jul 2, 2018 •

edited

Loading

iherman commented Jul 3, 2018

iherman commented Jul 3, 2018

gkellogg commented Jul 3, 2018

iherman commented Jul 3, 2018

iherman commented Jul 3, 2018

ericprud commented Jul 3, 2018 •

edited

Loading

BigBlueHat commented Jul 3, 2018

BigBlueHat commented Jul 3, 2018

gkellogg commented Jul 3, 2018

iherman commented Jul 3, 2018

msporny commented Jul 3, 2018

msporny commented Jul 3, 2018

iherman commented Jul 3, 2018

msporny commented Jul 3, 2018

BigBlueHat commented Jul 3, 2018

gkellogg commented Jul 3, 2018

iherman commented Jul 4, 2018

iherman commented Jul 4, 2018

iherman commented Jul 4, 2018

ericprud commented Jul 4, 2018

BigBlueHat commented Jul 6, 2018

gkellogg commented Aug 2, 2018

iherman commented Feb 9, 2019

Rethinking datasets and graphs? #30

Rethinking datasets and graphs? #30

Comments

iherman commented Jul 2, 2018

iherman commented Jul 2, 2018

gkellogg commented Jul 2, 2018

msporny commented Jul 2, 2018 • edited Loading

msporny commented Jul 2, 2018 • edited Loading

iherman commented Jul 3, 2018

iherman commented Jul 3, 2018

gkellogg commented Jul 3, 2018

iherman commented Jul 3, 2018

iherman commented Jul 3, 2018

ericprud commented Jul 3, 2018 • edited Loading

BigBlueHat commented Jul 3, 2018

BigBlueHat commented Jul 3, 2018

gkellogg commented Jul 3, 2018

iherman commented Jul 3, 2018

msporny commented Jul 3, 2018

msporny commented Jul 3, 2018

iherman commented Jul 3, 2018

msporny commented Jul 3, 2018

BigBlueHat commented Jul 3, 2018

gkellogg commented Jul 3, 2018

iherman commented Jul 4, 2018

iherman commented Jul 4, 2018

iherman commented Jul 4, 2018

ericprud commented Jul 4, 2018

BigBlueHat commented Jul 6, 2018

gkellogg commented Aug 2, 2018

iherman commented Feb 9, 2019

msporny commented Jul 2, 2018 •

edited

Loading

msporny commented Jul 2, 2018 •

edited

Loading

ericprud commented Jul 3, 2018 •

edited

Loading