-
Notifications
You must be signed in to change notification settings - Fork 176
feat: add anchor for non-ordinal plan reference #726
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
This is technically true, but on the other hand we are already doing the same thing with function extensions when the number of functions used in a typical plan most likely will be a lot bigger than the number of subtrees. I don't think it makes sense to keep the existing behavior only due to performance considerations.
I think leaving both options especially w/o oneof will lead to a lot more confusion and possible divergence in implementations. My choice would be to deprecate
Pretty sure putting an existing field in a oneof is a backwards-compatible change. |
I should have time to finish this up in the next few days |
proto/substrait/algebra.proto
Outdated
// in case we refer to a RelRoot field names will be ignored | ||
message ReferenceRel { | ||
// points to a PlanRel in Plan.relations (ordinal reference) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's put these in on a one of (subtree_reference and subtree_ordinal) and deprecate subtree ordinal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I named the oneof ref_type
since that seemed descriptive enough without being too verbose.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jacques-n is there a reason to both deprecate the existing field AND put both of the deprecaetd and new field in a oneof. It feels noisy to create a oneof if the eventual plan is to remove one of the fields, and hence the need for the oneof.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Talking through this in the Substrait sync with @jacques-n, we agreed that keeping this out of a oneof
make sense.
For migration purposes, its actually useful to allow both to be set for a while. Effectively, we can add support for this in producers and consumers in a decoupled fashion, and once everything supports both, remove the old field.
As part of this, we would need to document that the new relation anchor should be preferred when both are set.
Created a oneof field, "ref_type," (should expand to `ref_type_case` for generated code) that captures both types of references. The old type, subtree_ordinal, has also been deprecated. This addresses feedback in the PR tagged below. PR: substrait-io#726
Created a oneof field, "ref_type," (should expand to `ref_type_case` for generated code) that captures both types of references. The old type, subtree_ordinal, has also been deprecated. This addresses feedback in the PR tagged below. PR: substrait-io#726
1cd1c49
to
bf76bb9
Compare
rebased on main |
hm, according to a stackoverflow post, moving an optional field into a |
By making the anchor on the plan object does that mean you can no longer use a ReferenceRel to point to a subtree of the existing plan? For instance, TPC-H 15 processes the same subtree twice and often ends up with different results (when using float). By using a reference you could process the subtree once and point to it avoiding differences. |
I think you may be misunderstanding where the anchor is? I don't think of the PlanRel as the Plan object
The change is essentially to change a relative ID (ordinal) to an absolute ID (anchor). I think in either case you need to know what you're pointing to and you can point to a subtree of the existing plan. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me see if I understand the intent of this change.
Function / Type References
For things like functions, we define Plan-level anchors
substrait/proto/substrait/plan.proto
Lines 34 to 35 in d430e52
// a list of extensions this plan may depend on | |
repeated substrait.extensions.SimpleExtensionDeclaration extensions = 2; |
substrait/proto/substrait/extensions/extensions.proto
Lines 57 to 59 in d430e52
message ExtensionFunction { | |
// references the extension_uri_anchor defined for a specific extension URI. | |
uint32 extension_uri_reference = 1; |
which are then used within the plan
substrait/proto/substrait/algebra.proto
Lines 1178 to 1183 in d430e52
// A scalar function call. | |
message ScalarFunction { | |
// Points to a function_anchor defined in this plan, which must refer | |
// to a scalar function in the associated YAML file. Required; avoid | |
// using anchor/reference zero. | |
uint32 function_reference = 1; |
This mechanism is also used for types.
Rel References
For RelReference we don't currently use an anchor, but instead an ordinal reference
substrait/proto/substrait/algebra.proto
Lines 1756 to 1760 in d430e52
// This rel is used to create references, | |
// in case we refer to a RelRoot field names will be ignored | |
message ReferenceRel { | |
int32 subtree_ordinal = 1; | |
} |
into an list of relations in the Plan
substrait/proto/substrait/plan.proto
Lines 37 to 38 in d430e52
// one or more relation trees that are associated with this plan. | |
repeated PlanRel relations = 3; |
Your Proposal
Bring references to relations into line with functions and types by attaching an anchor to the PlanRel message.
Notes
Bringing this inline with how we do references generally make sense to me. That being said:
The new anchor attribute is an improvement for cases where multiple plans are merged and at least one already contains a ReferenceRel.
How does this make merging plans easier. I assume that as with ordinal-based references you would still have to rewrite anchors when they collide between plans?
It is expected that only one of subtree_ordinal or subtree_reference will be used, however I don't see a good reason to enforce the use of only one
If we're going to do this, I think it would be better to have only 1 way to do it and we should plan to get rid of the ordinal-based resolution.
// in case we refer to a RelRoot field names will be ignored | ||
message ReferenceRel { | ||
int32 subtree_ordinal = 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we're going to do this, we should deprecate the old field, with the eventual plan of removing subtree_ordinal
entirely.
And if we're going to do that, I don't know how much sense it makes to put them both in a oneof. With the way you've worded the anchor on the PlanRel, 0 isn't a valid rel anchor, so while these field coexist, if you have a ReferenceRel with subtree_reference
set to 0 you know you have to use the subtree_ordinal
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I brought this PR up in the Substrait sync this week. I've updated some comments with what we talked about.
proto/substrait/algebra.proto
Outdated
// in case we refer to a RelRoot field names will be ignored | ||
message ReferenceRel { | ||
// points to a PlanRel in Plan.relations (ordinal reference) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Talking through this in the Substrait sync with @jacques-n, we agreed that keeping this out of a oneof
make sense.
For migration purposes, its actually useful to allow both to be set for a while. Effectively, we can add support for this in producers and consumers in a decoupled fashion, and once everything supports both, remove the old field.
As part of this, we would need to document that the new relation anchor should be preferred when both are set.
@vbarua I'm not opposed to |
I think more context about the interaction with I commented on #794 (moved my comment from here). |
This adds an "anchor" to `PlanRel` that can be referenced from a `ReferenceRel`. This anchor and reference relationship provides a non-ordinal method for identifying and accessing a "subtree" or sub-graph. This commit leaves in the original `subtree_ordinal` attribute since it seems a (mildly) more performant method for referencing a subtree, but also since it is still relevant in the typical case. The new anchor improves cases where multiple plans are merged and at least one already contains a `ReferenceRel`. It is expected that only one of `subtree_ordinal` or `subtree_reference` will be used, however I don't see a good reason to enforce the use of only one, so I did not group the attributes in a `oneof` constraint. Issue: substrait-io#725
Created a oneof field, "ref_type," (should expand to `ref_type_case` for generated code) that captures both types of references. The old type, subtree_ordinal, has also been deprecated. This addresses feedback in the PR tagged below. PR: substrait-io#726
Added comment to clarify that implicit presence is preferred for `subtree_anchor`. Explicit presence would only add the capability of using 0 as an anchor, which does not seem to be worthwhile.
This accommodates 2 primary changes: 1. change the name of the reference and anchor fields 2. simplify the nesting of the old and new fields into a `oneof` to a simple addition of the anchor field The first change is to be more descriptive: `planrel_reference` makes it clear what the anchor points to, `reference_anchor` can be a reusable field name that specifies it is an anchor designed to be used by a `ReferenceRel` ("referencerel_anchor" seems unnecessarily annoying to read). The second change is to make migration more natural by deprecating the existing field and adding the new field in a way that is backwards compatible. PR: 726
I accommodated feedback from @vbarua and rebased on main |
@tokoko @vbarua as long as |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change looks reasonable to me as is.
We still need reviews from other PMCs for this: @jacques-n @westonpace @EpsilonPrime @cpcloud
When we merge this, we should make sure the commit message is accurate. This is no longer a breaking change.
This adds an "anchor" to
PlanRel
that can be referenced from aReferenceRel
. This anchor and reference relationship provides a non-ordinal method for identifying and accessing a "subtree" or sub-graph. To maintain consistency withtype_anchor
andfunction_anchor
, the newsubtree_anchor
attribute isuint32
.This PR leaves in the original
subtree_ordinal
attribute since it seems a (mildly) more performant method for referencing a subtree, but also since it is still relevant in the typical case. The new anchor attribute is an improvement for cases where multiple plans are merged and at least one already contains aReferenceRel
.It is expected that only one of
subtree_ordinal
orsubtree_reference
will be used, however I don't see a good reason to enforce the use of only one, so I did not group the attributes in aoneof
constraint.If either of the above points ((1) keeping
subtree_ordinal
and (2) not enforcing aoneof
constraint) are changed, then this PR would likely become a breaking change.Also, this PR addresses #725 , but uses the name
subtree_reference
andsubtree_anchor
(instead ofplan_anchor
) to be consistent with the pre-existingsubtree_ordinal
attribute. I don't know that relations in a substrait plan are strictly trees, so I think this could be a good time to change "subtree" to "planrel" or "subgraph" or something similar.BREAKING CHANGE: The field
subtree_ordinal
is made optional with explicit presence due to being moved into a newoneof
field, "ref_type".