Skip to content

feat: add anchor for non-ordinal plan reference #726

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

drin
Copy link
Member

@drin drin commented Oct 25, 2024

This adds an "anchor" to PlanRel that can be referenced from a ReferenceRel. This anchor and reference relationship provides a non-ordinal method for identifying and accessing a "subtree" or sub-graph. To maintain consistency with type_anchor and function_anchor, the new subtree_anchor attribute is uint32.

This PR leaves in the original subtree_ordinal attribute since it seems a (mildly) more performant method for referencing a subtree, but also since it is still relevant in the typical case. The new anchor attribute is an improvement for cases where multiple plans are merged and at least one already contains a ReferenceRel.

It is expected that only one of subtree_ordinal or subtree_reference will be used, however I don't see a good reason to enforce the use of only one, so I did not group the attributes in a oneof constraint.

If either of the above points ((1) keeping subtree_ordinal and (2) not enforcing a oneof constraint) are changed, then this PR would likely become a breaking change.

Also, this PR addresses #725 , but uses the name subtree_reference and subtree_anchor (instead of plan_anchor) to be consistent with the pre-existing subtree_ordinal attribute. I don't know that relations in a substrait plan are strictly trees, so I think this could be a good time to change "subtree" to "planrel" or "subgraph" or something similar.

BREAKING CHANGE: The field subtree_ordinal is made optional with explicit presence due to being moved into a new oneof field, "ref_type".

@CLAassistant
Copy link

CLAassistant commented Oct 25, 2024

CLA assistant check
All committers have signed the CLA.

@tokoko
Copy link
Contributor

tokoko commented Oct 28, 2024

This PR leaves in the original subtree_ordinal attribute since it seems a (mildly) more performant method for referencing a subtree

This is technically true, but on the other hand we are already doing the same thing with function extensions when the number of functions used in a typical plan most likely will be a lot bigger than the number of subtrees. I don't think it makes sense to keep the existing behavior only due to performance considerations.

It is expected that only one of subtree_ordinal or subtree_reference will be used, however I don't see a good reason to enforce the use of only one, so I did not group the attributes in a oneof constraint.

I think leaving both options especially w/o oneof will lead to a lot more confusion and possible divergence in implementations. My choice would be to deprecate subtree_ordinal and eventually get rid of it, but oneof is also a viable option.

If either of the above points ((1) keeping subtree_ordinal and (2) not enforcing a oneof constraint) are changed, then this PR would likely become a breaking change.

Pretty sure putting an existing field in a oneof is a backwards-compatible change.

@drin
Copy link
Member Author

drin commented Nov 20, 2024

I should have time to finish this up in the next few days

// in case we refer to a RelRoot field names will be ignored
message ReferenceRel {
// points to a PlanRel in Plan.relations (ordinal reference)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's put these in on a one of (subtree_reference and subtree_ordinal) and deprecate subtree ordinal.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I named the oneof ref_type since that seemed descriptive enough without being too verbose.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jacques-n is there a reason to both deprecate the existing field AND put both of the deprecaetd and new field in a oneof. It feels noisy to create a oneof if the eventual plan is to remove one of the fields, and hence the need for the oneof.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Talking through this in the Substrait sync with @jacques-n, we agreed that keeping this out of a oneof make sense.

For migration purposes, its actually useful to allow both to be set for a while. Effectively, we can add support for this in producers and consumers in a decoupled fashion, and once everything supports both, remove the old field.

As part of this, we would need to document that the new relation anchor should be preferred when both are set.

drin added a commit to drin/substrait that referenced this pull request Feb 26, 2025
Created a oneof field, "ref_type," (should expand to `ref_type_case` for
generated code) that captures both types of references. The old type,
subtree_ordinal, has also been deprecated.

This addresses feedback in the PR tagged below.

PR: substrait-io#726
drin added a commit to drin/substrait that referenced this pull request Feb 26, 2025
Created a oneof field, "ref_type," (should expand to `ref_type_case` for
generated code) that captures both types of references. The old type,
subtree_ordinal, has also been deprecated.

This addresses feedback in the PR tagged below.

PR: substrait-io#726
@drin drin force-pushed the feat-refrel-anchor branch from 1cd1c49 to bf76bb9 Compare February 26, 2025 17:28
@drin
Copy link
Member Author

drin commented Feb 26, 2025

rebased on main

@drin
Copy link
Member Author

drin commented Feb 26, 2025

hm, according to a stackoverflow post, moving an optional field into a oneof is not breaking. This PR essentially makes subtree_ordinal an optional field instead of an implicit field by moving it into a oneof, but that seems backwards-compatible. However, the check for breaking changes still flags it so I decided to add a BREAKING CHANGES: footer to the PR description.

@drin drin marked this pull request as ready for review February 26, 2025 18:02
@EpsilonPrime
Copy link
Member

By making the anchor on the plan object does that mean you can no longer use a ReferenceRel to point to a subtree of the existing plan? For instance, TPC-H 15 processes the same subtree twice and often ends up with different results (when using float). By using a reference you could process the subtree once and point to it avoiding differences.

@drin
Copy link
Member Author

drin commented Jun 9, 2025

By making the anchor on the plan object...

I think you may be misunderstanding where the anchor is? I don't think of the PlanRel as the Plan object

...does that mean you can no longer use a ReferenceRel to point to a subtree of the existing plan?

The change is essentially to change a relative ID (ordinal) to an absolute ID (anchor). I think in either case you need to know what you're pointing to and you can point to a subtree of the existing plan.

Copy link
Member

@vbarua vbarua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me see if I understand the intent of this change.

Function / Type References

For things like functions, we define Plan-level anchors

// a list of extensions this plan may depend on
repeated substrait.extensions.SimpleExtensionDeclaration extensions = 2;

message ExtensionFunction {
// references the extension_uri_anchor defined for a specific extension URI.
uint32 extension_uri_reference = 1;

which are then used within the plan

// A scalar function call.
message ScalarFunction {
// Points to a function_anchor defined in this plan, which must refer
// to a scalar function in the associated YAML file. Required; avoid
// using anchor/reference zero.
uint32 function_reference = 1;

This mechanism is also used for types.

Rel References

For RelReference we don't currently use an anchor, but instead an ordinal reference

// This rel is used to create references,
// in case we refer to a RelRoot field names will be ignored
message ReferenceRel {
int32 subtree_ordinal = 1;
}

into an list of relations in the Plan
// one or more relation trees that are associated with this plan.
repeated PlanRel relations = 3;

Your Proposal

Bring references to relations into line with functions and types by attaching an anchor to the PlanRel message.

Notes

Bringing this inline with how we do references generally make sense to me. That being said:

The new anchor attribute is an improvement for cases where multiple plans are merged and at least one already contains a ReferenceRel.

How does this make merging plans easier. I assume that as with ordinal-based references you would still have to rewrite anchors when they collide between plans?

It is expected that only one of subtree_ordinal or subtree_reference will be used, however I don't see a good reason to enforce the use of only one

If we're going to do this, I think it would be better to have only 1 way to do it and we should plan to get rid of the ordinal-based resolution.

// in case we refer to a RelRoot field names will be ignored
message ReferenceRel {
int32 subtree_ordinal = 1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're going to do this, we should deprecate the old field, with the eventual plan of removing subtree_ordinal entirely.

And if we're going to do that, I don't know how much sense it makes to put them both in a oneof. With the way you've worded the anchor on the PlanRel, 0 isn't a valid rel anchor, so while these field coexist, if you have a ReferenceRel with subtree_reference set to 0 you know you have to use the subtree_ordinal.

@vbarua vbarua requested a review from jacques-n June 16, 2025 20:10
Copy link
Member

@vbarua vbarua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I brought this PR up in the Substrait sync this week. I've updated some comments with what we talked about.

// in case we refer to a RelRoot field names will be ignored
message ReferenceRel {
// points to a PlanRel in Plan.relations (ordinal reference)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Talking through this in the Substrait sync with @jacques-n, we agreed that keeping this out of a oneof make sense.

For migration purposes, its actually useful to allow both to be set for a while. Effectively, we can add support for this in producers and consumers in a decoupled fashion, and once everything supports both, remove the old field.

As part of this, we would need to document that the new relation anchor should be preferred when both are set.

@tokoko
Copy link
Contributor

tokoko commented Jun 18, 2025

@vbarua I'm not opposed to planrel_reference, but one slight caveat regarding naming: this field will most likely act as a reference to the list of relations in ExtendedExpression message as well (see #794). In my PR I decided to introduce new ExtendedExpressionRel message to hold Rels. If we go that route, calling the field planrel_reference doesn't make as much sense anymore. Alternatively we can reuse PlanRel message for ExtendedExpression relations, which is doable but slightly awkward, because relations in PlanRel for extended expressions will only ever be of type Rel, never RelRoot.

@drin
Copy link
Member Author

drin commented Jun 18, 2025

I think more context about the interaction with ExtendedExpressionRel would be helpful.

I commented on #794 (moved my comment from here).

@vbarua
Copy link
Member

vbarua commented Jun 18, 2025

@tokoko left a comment on your proposal in #794 with regards to including relations in ExtendedExpressions.

drin added 4 commits June 25, 2025 10:58
This adds an "anchor" to `PlanRel` that can be referenced from a
`ReferenceRel`. This anchor and reference relationship provides a
non-ordinal method for identifying and accessing a "subtree" or
sub-graph.

This commit leaves in the original `subtree_ordinal` attribute since it
seems a (mildly) more performant method for referencing a subtree, but
also since it is still relevant in the typical case. The new anchor
improves cases where multiple plans are merged and at least one
already contains a `ReferenceRel`.

It is expected that only one of `subtree_ordinal` or `subtree_reference`
will be used, however I don't see a good reason to enforce the use of
only one, so I did not group the attributes in a `oneof` constraint.

Issue: substrait-io#725
Created a oneof field, "ref_type," (should expand to `ref_type_case` for
generated code) that captures both types of references. The old type,
subtree_ordinal, has also been deprecated.

This addresses feedback in the PR tagged below.

PR: substrait-io#726
Added comment to clarify that implicit presence is preferred for
`subtree_anchor`. Explicit presence would only add the capability of
using 0 as an anchor, which does not seem to be worthwhile.
This accommodates 2 primary changes:
  1. change the name of the reference and anchor fields
  2. simplify the nesting of the old and new fields into a `oneof` to a
     simple addition of the anchor field

The first change is to be more descriptive: `planrel_reference` makes it
clear what the anchor points to, `reference_anchor` can be a reusable
field name that specifies it is an anchor designed to be used by a
`ReferenceRel` ("referencerel_anchor" seems unnecessarily annoying to
read).

The second change is to make migration more natural by deprecating the
existing field and adding the new field in a way that is backwards
compatible.

PR: 726
@drin drin force-pushed the feat-refrel-anchor branch from 5a5941a to 815fda4 Compare June 25, 2025 18:37
@drin
Copy link
Member Author

drin commented Jun 25, 2025

I accommodated feedback from @vbarua and rebased on main

@drin
Copy link
Member Author

drin commented Jun 25, 2025

@tokoko @vbarua as long as planrel_reference is a new field in ReferenceRel anyways, what if we actually nest it in a oneof, that way we can be explicit about other things the reference may point to (e.g. extendedexpr_reference) in a way that is a bit future proof (although a bit more verbose). Or, we can change it to be anchor_reference and have an enum that specifies what the expected type of message being pointed to is?

@vbarua vbarua self-requested a review July 11, 2025 00:19
Copy link
Member

@vbarua vbarua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change looks reasonable to me as is.

We still need reviews from other PMCs for this: @jacques-n @westonpace @EpsilonPrime @cpcloud

When we merge this, we should make sure the commit message is accurate. This is no longer a breaking change.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants