Skip to content

feat: re-export name mapping #1116

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 17 commits into from
Apr 18, 2025

Conversation

jdockerty
Copy link
Contributor

@jdockerty jdockerty commented Mar 19, 2025

Which issue does this PR close?

Likely helps towards #919 and this was also discussed in Slack.

What changes are included in this PR?

This publicly re-exports the name_mapping module to iceberg::spec. Prior to this, it is private and inaccessible outside of this crate.

Are these changes tested?

The main changes here are not functional changes except to visibility.

The additional fields to the NameMapping structure have test coverage and are covered by prior tests, it is expected that these internal fields do not show up in the serialised NameMapping output, hence they are marked with the serde(skip) attribute - this is upheld with prior tests.

@jdockerty jdockerty marked this pull request as ready for review March 19, 2025 13:34
@jdockerty jdockerty changed the title chore: re-export name mapping feat: re-export name mapping Mar 19, 2025
Copy link
Contributor

@liurenjie1024 liurenjie1024 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jdockerty for this pr, generally LGTM! Left some comments to improve.

@jdockerty jdockerty force-pushed the chore/expose-name-mapping branch from a9ff2b9 to 85b024e Compare March 20, 2025 12:24
Copy link
Contributor

@liurenjie1024 liurenjie1024 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jdockerty for this pr, generally LGTM! Left some comments to improve.

/// Utility mapping which contains field names to IDs and
/// field IDs to the underlying [`MappedField`].
#[derive(Debug, Serialize, Deserialize, PartialEq, Eq, Clone)]
pub struct MappedFields {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given we are going to add a lot of visitors for NameMapping, how about we create a NameMapping module, and puts everything related there.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not quite sure what you mean, there is already a name_mapping.rs as a separate module. Or do you mean include everything in this file instead and use 👇 ?

mod name_mapping {
 // contents of name_mapping.rs here
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes mapped field should also be included in name_mapping.rs. You can move everything from mapped_fields to name_mapping

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still don't understand why we need to have the MappedFields field. In the pyiceberg implementation it doesnt use it (the initial implementation/review is here apache/iceberg-python#212). There isn't a usecase (unless there is) where we just index the first layer for MappedFields without having to later create another index based on a full traversal. cc @liurenjie1024 @Fokko

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TBH I haven't come up with a case where Vec<MappedField> doesn't work well. I'm trying to keep things as consistent as possible with java implementation, as it's currently the refrence implementation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not quite sure what you mean, there is already a name_mapping.rs as a separate module. Or do you mean include everything in this file instead and use 👇 ?

Oh sorry for misclarification. I mean we could create following file structures:

name_mapping/
   mod.rs
   visitors.rs

In rust, name_maping.rs is a module called name_mapping, and a dir name_maping + mod.rs is also a module called name_maping.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In PyIceberg we've have a visitor to traverse the name-mapping and schema at the same time: https://github.com/apache/iceberg-python/blob/main/pyiceberg/table/name_mapping.py#L251-L372

This is because when looking up by column name, the . could make it ambiguous to determine if it is a nested field, or if the field has a dot in it:

table {
  foo: struct<bar: str>
  foo.bar: str
}

/// Iceberg fallback field name to ID mapping.
#[derive(Debug, Serialize, Deserialize, PartialEq, Eq, Clone)]
#[serde(transparent)]
pub struct NameMapping {
pub root: Vec<MappedField>,
root: Vec<MappedField>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should contains a MappedFields

#[serde(default)]
#[serde(skip_serializing_if = "Vec::is_empty")]
#[serde_as(deserialize_as = "DefaultOnNull")]
pub fields: Vec<MappedField>,
fields: Vec<MappedField>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be a MappedFields.

Copy link
Contributor Author

@jdockerty jdockerty Mar 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By changing this one to MappedFields, this alters all of the expected output JSON too.

I assume that is expected for now and I'll update the other tests 👍

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't change json output since it's iceberg spec. I think we could use #serde[transparent] to do that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@liurenjie1024 I've been taking another look at this again, although including #[serde(transparent)] onto the structure doesn't seem to produce the right value as it isn't a struct that contains only an internal Vec, since MappedFields contains other fields/state too.

Do you mind elaborating more on this? As I don't quite understand how to stop the field being serialised incorrectly when this is altered.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems #[serde(transparent)] doesn't work here. I think then maybe we should remove MappedFields struct and put everything in NamedMapping, just like what we do in pyiceberg. This maybe be harmful as this interface is not critical. Another approach is to have a standalone module for ser/de, just like what we did in other core data structures like TableMetadata But this may not deserve such complexity.

Copy link
Contributor Author

@jdockerty jdockerty Apr 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To clarify here, do you mean pull the information that used to be in the MappedFields struct, into the NameMapping instead?

E.g. the id_to_field etc?

It looks like that exists in pyiceberg via the _IndexByName class

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To clarify here, do you mean pull the information that used to be in the MappedFields struct, into the NameMapping instead?

Exactly.

@jonathanc-n
Copy link
Contributor

jonathanc-n commented Mar 21, 2025

@jdockerty @liurenjie1024 I believe #1072 contains a lot of the functionality in this pr, this got split into #1082 being the first part of it.

@liurenjie1024
Copy link
Contributor

@jdockerty @liurenjie1024 I believe #1072 contains a lot of the functionality in this pr, this got split into #1082 being the first part of it.

Hi, @jonathanc-n I think #1072 added extra functionality into like visitor, indexing into NameMapping. I think this pr should happen before #1072 since it's a refactoring before exposing it to public.

#[serde(default)]
#[serde(skip_serializing_if = "Vec::is_empty")]
#[serde_as(deserialize_as = "DefaultOnNull")]
pub fields: Vec<MappedField>,
fields: Vec<MappedField>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't change json output since it's iceberg spec. I think we could use #serde[transparent] to do that?

/// Utility mapping which contains field names to IDs and
/// field IDs to the underlying [`MappedField`].
#[derive(Debug, Serialize, Deserialize, PartialEq, Eq, Clone)]
pub struct MappedFields {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not quite sure what you mean, there is already a name_mapping.rs as a separate module. Or do you mean include everything in this file instead and use 👇 ?

Oh sorry for misclarification. I mean we could create following file structures:

name_mapping/
   mod.rs
   visitors.rs

In rust, name_maping.rs is a module called name_mapping, and a dir name_maping + mod.rs is also a module called name_maping.

@liurenjie1024 liurenjie1024 added this to the 0.5.0 Release milestone Apr 1, 2025
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we put this in the name_mapping dir?🤔

Copy link
Contributor

@jonathanc-n jonathanc-n Apr 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I believe this was mentioned above as well

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems this should be removed, see #1116 (comment)

@liurenjie1024
Copy link
Contributor

cc @jdockerty Are you still working on this? This is a blocker of 0.5.0 release

@jdockerty
Copy link
Contributor Author

@liurenjie1024 I am! Sorry I should've mentioned, I've been finalising a work project but I can come back to this likely Tuesday morning.

Does that work or is there extra priority required as a release blocker?

@liurenjie1024
Copy link
Contributor

@liurenjie1024 I am! Sorry I should've mentioned, I've been finalising a work project but I can come back to this likely Tuesday morning.

Does that work or is there extra priority required as a release blocker?

Hi, @jdockerty Thanks for contributing. No worry, we can wait for a while.

@jdockerty
Copy link
Contributor Author

@liurenjie1024 Can you take another look at this please?

I've removed the mapped_fields.rs in favour of introducing internal fields to NameMapping, similar to what was mentioned here and here 🙇

Copy link
Contributor

@liurenjie1024 liurenjie1024 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jdockerty for this pr, left some suggestions to improve.

#[serde(skip)]
name_to_id: HashMap<String, i32>,
#[serde(skip)]
id_to_field: HashMap<i32, MappedField>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
id_to_field: HashMap<i32, MappedField>,
id_to_field: HashMap<i32, Arc<MappedField>>,

MappedField maybe deeply nested and expensive to copy.

Comment on lines 44 to 67
let mut name_to_id = HashMap::new();
let mut id_to_field = HashMap::new();

for field in &fields {
if let Some(id) = field.field_id() {
if id_to_field.contains_key(&id) {
return Err(Error::new(
crate::ErrorKind::DataInvalid,
format!("duplicate id '{id}' is not allowed"),
));
}

id_to_field.insert(id, field.clone());
for name in field.names() {
if name_to_id.contains_key(name) {
return Err(Error::new(
crate::ErrorKind::DataInvalid,
format!("duplicate name '{name}' is not allowed"),
));
}
name_to_id.insert(name.to_string(), id);
}
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This implementation is incorrect, id and name in top level should be global, and we should build it using a visitor. I would suggest to remove these two fields in this pr, and add them back later using visitor.

Copy link
Contributor Author

@jdockerty jdockerty Apr 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove these two fields in this pr

By removing these fields, doesn't that also mean the Arc suggestion above is not to be implemented as well?

id_to_field: HashMap<i32, Arc<MappedField>> // <-- no need to Arc here, this field is to be removed now.

These fields could be deeply nested, so cloning them through an Arc makes this far cheaper
This should be replaced by a visitor pattern in a separate PR
Copy link
Contributor

@liurenjie1024 liurenjie1024 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jdockerty for this pr, LGTM!

@liurenjie1024 liurenjie1024 merged commit 609b792 into apache:main Apr 18, 2025
17 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants