Skip to content

feat: restrict the usage of any in return types #798

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

vbarua
Copy link
Member

@vbarua vbarua commented Mar 27, 2025

any must not be used as the return type of a function

@vbarua
Copy link
Member Author

vbarua commented Mar 27, 2025

Was having a discussion with some folks about any type restrictions, and we realized that we probably shouldn't allow function signatures like foo() -> any or bar(i64) -> any because it's impossible to bind the return any to a concrete type.

@Blizzara
Copy link
Contributor

I think any return type might be needed for things like (make_)struct or get_struct_field? Unless there's a way to write those through the type math.. get_struct_field could also be rewritten as a selector maybe.

These are cases we have internally (actually the first one isn't correct since I guess it means "a struct of any one child field" while in reality there can be 1 or more, so maybe it should just be return: any..)

  - name: struct
    impls:
      - args:
          - value: any
        variadic:
          min: 1
        return: struct<any>

  - name: get_struct_field
    impls:
      - args:
          - name: x
            value: any
          - name: ordinal
            value: i32
        return: any

@EpsilonPrime
Copy link
Member

We do want the type to be known so that optimization can be performed on the plan (and that the final types can be computed) so this makes sense. I suppose if you want to use functionality like get_struct_field you either need some better computation from the type system (being able to extract or construct types from the input types) or have a variant type that is handled externally (like the geo opaque type or variants in Snowflake).

@vbarua
Copy link
Member Author

vbarua commented Mar 28, 2025

For get_struct_field, I think the way to encode that today is using ReferenceSegments as part of a FieldReference.

For make_struct, that might actually be beyond the type system at this point. As you point out

  - name: struct
    impls:
      - args:
          - value: any
        variadic:
          min: 1
        return: struct<any>

technically returns a struct with a single any field.

We don't have a way to write a function that takes X arguments, and outputs a struct of those X arguments. We could possibly finagle the type-system to figure out how to encode that, but that seems quite painful. Another approach would be to be to declare a new Expression that captures "make_struct".

@vbarua
Copy link
Member Author

vbarua commented Apr 7, 2025

I think any return type might be needed for things like (make_)struct

It turns there's already an expression construct for this in the protobuf spec.

// Expression to dynamically construct nested types.
message Nested {
// Whether the returned nested type is nullable.
bool nullable = 1;
// Optionally points to a type_variation_anchor defined in this plan for
// the returned nested type.
uint32 type_variation_reference = 2;
oneof nested_type {
Struct struct = 3;
List list = 4;
Map map = 5;
}
message Map {
message KeyValue {
// Mandatory key/value expressions.
Expression key = 1;
Expression value = 2;
}
// One or more key-value pairs. To specify an empty map, use
// Literal.empty_map (otherwise type information would be missing).
repeated KeyValue key_values = 1;
}
message Struct {
// Zero or more possibly heterogeneously-typed list of expressions that
// form the struct fields.
repeated Expression fields = 1;
}
message List {
// A homogeneously-typed list of one or more expressions that form the
// list entries. To specify an empty list, use Literal.empty_list
// (otherwise type information would be missing).
repeated Expression values = 1;
}
}

any must not be used as the return type of a function
@vbarua vbarua force-pushed the vbarua/restrict-any-return-type branch from 7ebcb1c to 31a9a37 Compare May 7, 2025 07:34
@Blizzara
Copy link
Contributor

Blizzara commented May 7, 2025

I think any return type might be needed for things like (make_)struct

It turns there's already an expression construct for this in the protobuf spec.

// Expression to dynamically construct nested types.
message Nested {
// Whether the returned nested type is nullable.
bool nullable = 1;
// Optionally points to a type_variation_anchor defined in this plan for
// the returned nested type.
uint32 type_variation_reference = 2;
oneof nested_type {
Struct struct = 3;
List list = 4;
Map map = 5;
}
message Map {
message KeyValue {
// Mandatory key/value expressions.
Expression key = 1;
Expression value = 2;
}
// One or more key-value pairs. To specify an empty map, use
// Literal.empty_map (otherwise type information would be missing).
repeated KeyValue key_values = 1;
}
message Struct {
// Zero or more possibly heterogeneously-typed list of expressions that
// form the struct fields.
repeated Expression fields = 1;
}
message List {
// A homogeneously-typed list of one or more expressions that form the
// list entries. To specify an empty list, use Literal.empty_list
// (otherwise type information would be missing).
repeated Expression values = 1;
}
}

Interesting, I don't think either substrait-spark or DF uses that, but maybe someday they will 😅

Another case where we found ourselves using any return type is things like from_avro. I guess one option there is to have an extra any1 arg that contains just the output type, and then use that as the return type, however I'm not sure what benefit that gives over any return type?

@vbarua
Copy link
Member Author

vbarua commented May 19, 2025

Another case where we found ourselves using any return type is things like from_avro.

Interesting. Is the function defined with an any return type, and then a specific invocation uses the

jsonFormatSchema: str
    the avro schema in JSON string format.

parameter, as seen in as seen in from_avro, to set the concrete return type in the invocation?

@Blizzara
Copy link
Contributor

Yes:

  - name: from_avro
    description: FromAvro
    impls:
      - args:
          - value: binary?
            name: input
          - value: string
            name: avro_schema
        return: any?

@Blizzara
Copy link
Contributor

I did notice there's an output_type on at least ScalarFunction:

Type output_type = 3;

Which means that on any actual Substrait plan, the output type of the function is explicitly and uniquely defined. What'd then be the benefit of restricting the function's return type? I think that means things like:

We do want the type to be known so that optimization can be performed on the plan (and that the final types can be computed) so this makes sense.

are already doable.

@vbarua
Copy link
Member Author

vbarua commented Jun 18, 2025

I brought this up in the sync today and filed #820 related to it.

The general consensus was that we should avoid unbound any outputs, but that functions like from_avro are use cases that we should support as well.

Which means that on any actual Substrait plan, the output type of the function is explicitly and uniquely defined. What'd then be the benefit of restricting the function's return type?

Part of this is that we want the function definitions themselves to be self-contained outside of a plan context. any as a return type is weird at like type-system, and functions like foo() -> any are somewhat nonsense when defined and impossible to implement. You can't really reason about what a function like foo() -> any does, because it can literally return anything.

from_avro isn't quite this, because you are actually constraining the output type via the avro_schema string, we just don't have a way to represent that easily. To your point about the output type in the invocation, the producer is already able to derive the type from the given schema to include it there.

guess one option there is to have an extra any1 arg that contains just the output type, and then use that as the return type,

That would be one way to approach the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants