Skip to content

feat: re-export name mapping #1116

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 17 commits into from
Apr 18, 2025
Merged
Show file tree
Hide file tree
Changes from 9 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
165 changes: 165 additions & 0 deletions crates/iceberg/src/spec/mapped_fields.rs
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we put this in the name_mapping dir?🤔

Copy link
Contributor

@jonathanc-n jonathanc-n Apr 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I believe this was mentioned above as well

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems this should be removed, see #1116 (comment)

Original file line number Diff line number Diff line change
@@ -0,0 +1,165 @@
// Licensed to the Apache Software Foundation (ASF) under one
// or more contributor license agreements. See the NOTICE file
// distributed with this work for additional information
// regarding copyright ownership. The ASF licenses this file
// to you under the Apache License, Version 2.0 (the
// "License"); you may not use this file except in compliance
// with the License. You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing,
// software distributed under the License is distributed on an
// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
// KIND, either express or implied. See the License for the
// specific language governing permissions and limitations
// under the License.

//! Iceberg mapped fields.

use std::collections::HashMap;

use serde::{Deserialize, Serialize};

use crate::spec::MappedField;
use crate::Error;

/// Utility mapping which contains field names to IDs and
/// field IDs to the underlying [`MappedField`].
#[derive(Debug, Serialize, Deserialize, PartialEq, Eq, Clone)]
pub struct MappedFields {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given we are going to add a lot of visitors for NameMapping, how about we create a NameMapping module, and puts everything related there.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not quite sure what you mean, there is already a name_mapping.rs as a separate module. Or do you mean include everything in this file instead and use 👇 ?

mod name_mapping {
 // contents of name_mapping.rs here
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes mapped field should also be included in name_mapping.rs. You can move everything from mapped_fields to name_mapping

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still don't understand why we need to have the MappedFields field. In the pyiceberg implementation it doesnt use it (the initial implementation/review is here apache/iceberg-python#212). There isn't a usecase (unless there is) where we just index the first layer for MappedFields without having to later create another index based on a full traversal. cc @liurenjie1024 @Fokko

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TBH I haven't come up with a case where Vec<MappedField> doesn't work well. I'm trying to keep things as consistent as possible with java implementation, as it's currently the refrence implementation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not quite sure what you mean, there is already a name_mapping.rs as a separate module. Or do you mean include everything in this file instead and use 👇 ?

Oh sorry for misclarification. I mean we could create following file structures:

name_mapping/
   mod.rs
   visitors.rs

In rust, name_maping.rs is a module called name_mapping, and a dir name_maping + mod.rs is also a module called name_maping.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In PyIceberg we've have a visitor to traverse the name-mapping and schema at the same time: https://github.com/apache/iceberg-python/blob/main/pyiceberg/table/name_mapping.py#L251-L372

This is because when looking up by column name, the . could make it ambiguous to determine if it is a nested field, or if the field has a dot in it:

table {
  foo: struct<bar: str>
  foo.bar: str
}

fields: Vec<MappedField>,
name_to_id: HashMap<String, i32>,
id_to_field: HashMap<i32, MappedField>,
}

impl MappedFields {
/// Create a new [`MappedFields`].
///
/// # Errors
///
/// This will produce an error if the input fields contain either duplicate
/// names or field IDs.
pub fn try_new(fields: Vec<MappedField>) -> Result<Self, Error> {
let mut name_to_id = HashMap::new();
let mut id_to_field = HashMap::new();

for field in &fields {
if let Some(id) = field.field_id() {
if id_to_field.contains_key(&id) {
return Err(Error::new(
crate::ErrorKind::DataInvalid,
format!("duplicate id '{id}' is not allowed"),
));
}

id_to_field.insert(id, field.clone());
for name in field.names() {
if name_to_id.contains_key(name) {
return Err(Error::new(
crate::ErrorKind::DataInvalid,
format!("duplicate name '{name}' is not allowed"),
));
}
name_to_id.insert(name.to_string(), id);
}
}
}

Ok(Self {
fields,
name_to_id,
id_to_field,
})
}

/// Get a reference to the underlying fields.
pub fn fields(&self) -> &[MappedField] {
&self.fields
}

/// Get a field, by name, returning its ID if it exists, otherwise `None`.
pub fn id(&self, field_name: String) -> Option<i32> {
self.name_to_id.get(&field_name).copied()
}

/// Get a field, by ID, returning the underlying [`MappedField`] if it exists,
/// otherwise `None`.
pub fn field(&self, id: i32) -> Option<&MappedField> {
self.id_to_field.get(&id)
}
}

#[cfg(test)]
mod test {

use pretty_assertions::assert_eq;

use super::*;
use crate::ErrorKind;

#[test]
fn mapped_fields() {
let my_fields = vec![
MappedField::new(Some(1), vec!["field_one".to_string()], Vec::new()),
MappedField::new(
Some(2),
vec!["field_two".to_string(), "field_foo".to_string()],
Vec::new(),
),
];

let mapped_fields = MappedFields::try_new(my_fields).expect("valid mapped fields for test");

assert!(
mapped_fields.field(1000).is_none(),
"Field with ID '1000' was not inserted to the collection"
);
assert_eq!(
*mapped_fields.field(1).unwrap(),
MappedField::new(Some(1), vec!["field_one".to_string()], Vec::new()),
);
assert_eq!(
*mapped_fields.field(2).unwrap(),
MappedField::new(
Some(2),
vec!["field_two".to_string(), "field_foo".to_string()],
Vec::new(),
)
);

assert!(
mapped_fields.id("not_exist".to_string()).is_none(),
"Field was not expected to exist in the collection"
);
assert_eq!(mapped_fields.id("field_two".to_string()), Some(2));
assert_eq!(
mapped_fields.id("field_foo".to_string()),
Some(2),
"Field with another name shares the same ID of '2'"
);
assert_eq!(mapped_fields.id("field_one".to_string()), Some(1));
}

#[test]
fn invalid_mapped_fields() {
let duplicate_ids = vec![
MappedField::new(Some(1), vec!["field_one".to_string()], Vec::new()),
MappedField::new(Some(1), vec!["field_two".to_string()], Vec::new()),
];

assert_eq!(
MappedFields::try_new(duplicate_ids).unwrap_err().kind(),
ErrorKind::DataInvalid
);

let duplicate_names = vec![
MappedField::new(Some(1), vec!["field_one".to_string()], Vec::new()),
MappedField::new(Some(2), vec!["field_one".to_string()], Vec::new()),
];
assert_eq!(
MappedFields::try_new(duplicate_names).unwrap_err().kind(),
ErrorKind::DataInvalid
);
}
}
3 changes: 3 additions & 0 deletions crates/iceberg/src/spec/mod.rs
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,7 @@
mod datatypes;
mod manifest;
mod manifest_list;
mod mapped_fields;
mod name_mapping;
mod partition;
mod schema;
Expand All @@ -37,6 +38,8 @@ mod view_version;
pub use datatypes::*;
pub use manifest::*;
pub use manifest_list::*;
pub use mapped_fields::*;
pub use name_mapping::*;
pub use partition::*;
pub use schema::*;
pub use snapshot::*;
Expand Down
49 changes: 45 additions & 4 deletions crates/iceberg/src/spec/name_mapping.rs
Original file line number Diff line number Diff line change
Expand Up @@ -20,11 +20,26 @@
use serde::{Deserialize, Serialize};
use serde_with::{serde_as, DefaultOnNull};

/// Property name for name mapping.
pub const DEFAULT_SCHEMA_NAME_MAPPING: &str = "schema.name-mapping.default";

/// Iceberg fallback field name to ID mapping.
#[derive(Debug, Serialize, Deserialize, PartialEq, Eq, Clone)]
#[serde(transparent)]
pub struct NameMapping {
pub root: Vec<MappedField>,
root: Vec<MappedField>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should contains a MappedFields

}

impl NameMapping {
/// Create a new [`NameMapping`] given a collection of mapped fields.
pub fn new(fields: Vec<MappedField>) -> NameMapping {
Self { root: fields }
}

/// Get a reference to fields which are to be mapped from name to field ID.
pub fn root(&self) -> &[MappedField] {
&self.root
}
}

/// Maps field names to IDs.
Expand All @@ -33,12 +48,38 @@ pub struct NameMapping {
#[serde(rename_all = "kebab-case")]
pub struct MappedField {
#[serde(skip_serializing_if = "Option::is_none")]
pub field_id: Option<i32>,
pub names: Vec<String>,
field_id: Option<i32>,
names: Vec<String>,
#[serde(default)]
#[serde(skip_serializing_if = "Vec::is_empty")]
#[serde_as(deserialize_as = "DefaultOnNull")]
pub fields: Vec<MappedField>,
fields: Vec<MappedField>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be a MappedFields.

Copy link
Contributor Author

@jdockerty jdockerty Mar 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By changing this one to MappedFields, this alters all of the expected output JSON too.

I assume that is expected for now and I'll update the other tests 👍

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't change json output since it's iceberg spec. I think we could use #serde[transparent] to do that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@liurenjie1024 I've been taking another look at this again, although including #[serde(transparent)] onto the structure doesn't seem to produce the right value as it isn't a struct that contains only an internal Vec, since MappedFields contains other fields/state too.

Do you mind elaborating more on this? As I don't quite understand how to stop the field being serialised incorrectly when this is altered.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems #[serde(transparent)] doesn't work here. I think then maybe we should remove MappedFields struct and put everything in NamedMapping, just like what we do in pyiceberg. This maybe be harmful as this interface is not critical. Another approach is to have a standalone module for ser/de, just like what we did in other core data structures like TableMetadata But this may not deserve such complexity.

Copy link
Contributor Author

@jdockerty jdockerty Apr 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To clarify here, do you mean pull the information that used to be in the MappedFields struct, into the NameMapping instead?

E.g. the id_to_field etc?

It looks like that exists in pyiceberg via the _IndexByName class

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To clarify here, do you mean pull the information that used to be in the MappedFields struct, into the NameMapping instead?

Exactly.

}

impl MappedField {
/// Create a new [`MappedField`].
pub fn new(field_id: Option<i32>, names: Vec<String>, fields: Vec<MappedField>) -> Self {
Self {
field_id,
names,
fields,
}
}

/// Iceberg field ID when a field's name is present within `names`.
pub fn field_id(&self) -> Option<i32> {
self.field_id
}

/// Get a reference to names for a mapped field.
pub fn names(&self) -> &[String] {
&self.names
}

/// Get a reference to the field mapping for any child fields.
pub fn fields(&self) -> &[MappedField] {
&self.fields
}
}

#[cfg(test)]
Expand Down
Loading