-
Notifications
You must be signed in to change notification settings - Fork 280
feat: re-export name mapping #1116
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: re-export name mapping #1116
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @jdockerty for this pr, generally LGTM! Left some comments to improve.
a9ff2b9
to
85b024e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @jdockerty for this pr, generally LGTM! Left some comments to improve.
/// Utility mapping which contains field names to IDs and | ||
/// field IDs to the underlying [`MappedField`]. | ||
#[derive(Debug, Serialize, Deserialize, PartialEq, Eq, Clone)] | ||
pub struct MappedFields { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given we are going to add a lot of visitors for NameMapping
, how about we create a NameMapping
module, and puts everything related there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not quite sure what you mean, there is already a name_mapping.rs
as a separate module. Or do you mean include everything in this file instead and use 👇 ?
mod name_mapping {
// contents of name_mapping.rs here
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes mapped field should also be included in name_mapping.rs. You can move everything from mapped_fields
to name_mapping
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still don't understand why we need to have the MappedFields
field. In the pyiceberg implementation it doesnt use it (the initial implementation/review is here apache/iceberg-python#212). There isn't a usecase (unless there is) where we just index the first layer for MappedFields
without having to later create another index based on a full traversal. cc @liurenjie1024 @Fokko
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TBH I haven't come up with a case where Vec<MappedField>
doesn't work well. I'm trying to keep things as consistent as possible with java implementation, as it's currently the refrence implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not quite sure what you mean, there is already a name_mapping.rs as a separate module. Or do you mean include everything in this file instead and use 👇 ?
Oh sorry for misclarification. I mean we could create following file structures:
name_mapping/
mod.rs
visitors.rs
In rust, name_maping.rs
is a module called name_mapping
, and a dir name_maping
+ mod.rs
is also a module called name_maping
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In PyIceberg we've have a visitor to traverse the name-mapping and schema at the same time: https://github.com/apache/iceberg-python/blob/main/pyiceberg/table/name_mapping.py#L251-L372
This is because when looking up by column name, the .
could make it ambiguous to determine if it is a nested field, or if the field has a dot in it:
table {
foo: struct<bar: str>
foo.bar: str
}
/// Iceberg fallback field name to ID mapping. | ||
#[derive(Debug, Serialize, Deserialize, PartialEq, Eq, Clone)] | ||
#[serde(transparent)] | ||
pub struct NameMapping { | ||
pub root: Vec<MappedField>, | ||
root: Vec<MappedField>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should contains a MappedFields
#[serde(default)] | ||
#[serde(skip_serializing_if = "Vec::is_empty")] | ||
#[serde_as(deserialize_as = "DefaultOnNull")] | ||
pub fields: Vec<MappedField>, | ||
fields: Vec<MappedField>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be a MappedFields
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By changing this one to MappedFields
, this alters all of the expected output JSON too.
I assume that is expected for now and I'll update the other tests 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can't change json output since it's iceberg spec. I think we could use #serde[transparent]
to do that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@liurenjie1024 I've been taking another look at this again, although including #[serde(transparent)]
onto the structure doesn't seem to produce the right value as it isn't a struct that contains only an internal Vec
, since MappedFields
contains other fields/state too.
Do you mind elaborating more on this? As I don't quite understand how to stop the field being serialised incorrectly when this is altered.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems #[serde(transparent)]
doesn't work here. I think then maybe we should remove MappedFields
struct and put everything in NamedMapping
, just like what we do in pyiceberg. This maybe be harmful as this interface is not critical. Another approach is to have a standalone module for ser/de, just like what we did in other core data structures like TableMetadata But this may not deserve such complexity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To clarify here, do you mean pull the information that used to be in the MappedFields
struct, into the NameMapping
instead?
E.g. the id_to_field
etc?
It looks like that exists in pyiceberg via the _IndexByName
class
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To clarify here, do you mean pull the information that used to be in the MappedFields struct, into the NameMapping instead?
Exactly.
@jdockerty @liurenjie1024 I believe #1072 contains a lot of the functionality in this pr, this got split into #1082 being the first part of it. |
Hi, @jonathanc-n I think #1072 added extra functionality into like visitor, indexing into |
#[serde(default)] | ||
#[serde(skip_serializing_if = "Vec::is_empty")] | ||
#[serde_as(deserialize_as = "DefaultOnNull")] | ||
pub fields: Vec<MappedField>, | ||
fields: Vec<MappedField>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can't change json output since it's iceberg spec. I think we could use #serde[transparent]
to do that?
/// Utility mapping which contains field names to IDs and | ||
/// field IDs to the underlying [`MappedField`]. | ||
#[derive(Debug, Serialize, Deserialize, PartialEq, Eq, Clone)] | ||
pub struct MappedFields { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not quite sure what you mean, there is already a name_mapping.rs as a separate module. Or do you mean include everything in this file instead and use 👇 ?
Oh sorry for misclarification. I mean we could create following file structures:
name_mapping/
mod.rs
visitors.rs
In rust, name_maping.rs
is a module called name_mapping
, and a dir name_maping
+ mod.rs
is also a module called name_maping
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we put this in the name_mapping dir?🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I believe this was mentioned above as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems this should be removed, see #1116 (comment)
cc @jdockerty Are you still working on this? This is a blocker of 0.5.0 release |
@liurenjie1024 I am! Sorry I should've mentioned, I've been finalising a work project but I can come back to this likely Tuesday morning. Does that work or is there extra priority required as a release blocker? |
Hi, @jdockerty Thanks for contributing. No worry, we can wait for a while. |
@liurenjie1024 Can you take another look at this please? I've removed the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @jdockerty for this pr, left some suggestions to improve.
#[serde(skip)] | ||
name_to_id: HashMap<String, i32>, | ||
#[serde(skip)] | ||
id_to_field: HashMap<i32, MappedField>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
id_to_field: HashMap<i32, MappedField>, | |
id_to_field: HashMap<i32, Arc<MappedField>>, |
MappedField
maybe deeply nested and expensive to copy.
let mut name_to_id = HashMap::new(); | ||
let mut id_to_field = HashMap::new(); | ||
|
||
for field in &fields { | ||
if let Some(id) = field.field_id() { | ||
if id_to_field.contains_key(&id) { | ||
return Err(Error::new( | ||
crate::ErrorKind::DataInvalid, | ||
format!("duplicate id '{id}' is not allowed"), | ||
)); | ||
} | ||
|
||
id_to_field.insert(id, field.clone()); | ||
for name in field.names() { | ||
if name_to_id.contains_key(name) { | ||
return Err(Error::new( | ||
crate::ErrorKind::DataInvalid, | ||
format!("duplicate name '{name}' is not allowed"), | ||
)); | ||
} | ||
name_to_id.insert(name.to_string(), id); | ||
} | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This implementation is incorrect, id
and name
in top level should be global, and we should build it using a visitor. I would suggest to remove these two fields in this pr, and add them back later using visitor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove these two fields in this pr
By removing these fields, doesn't that also mean the Arc
suggestion above is not to be implemented as well?
id_to_field: HashMap<i32, Arc<MappedField>> // <-- no need to Arc here, this field is to be removed now.
These fields could be deeply nested, so cloning them through an Arc makes this far cheaper
This should be replaced by a visitor pattern in a separate PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @jdockerty for this pr, LGTM!
Which issue does this PR close?
Likely helps towards #919 and this was also discussed in Slack.
What changes are included in this PR?
This publicly re-exports the
name_mapping
module toiceberg::spec
. Prior to this, it is private and inaccessible outside of this crate.Are these changes tested?
The main changes here are not functional changes except to visibility.
The additional fields to the
NameMapping
structure have test coverage and are covered by prior tests, it is expected that these internal fields do not show up in the serialisedNameMapping
output, hence they are marked with theserde(skip)
attribute - this is upheld with prior tests.