Skip to content

Add name-mapping #212

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Dec 19, 2023
Merged

Add name-mapping #212

merged 6 commits into from
Dec 19, 2023

Conversation

Fokko
Copy link
Contributor

@Fokko Fokko commented Dec 13, 2023

All the things to (de)serialize the name-mapping, and all the neccessary visitors and such

All the things to (de)serialize the name-mapping, and
all the neccessary visitors and such
@sungwy sungwy mentioned this pull request Dec 17, 2023
3 tasks
try:
return self._field_by_id[field_id]
except KeyError as e:
raise ValueError(f"Could not find field-id: {field_id}") from e
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why raise an exception? In the Java ApplyNameMapping, fields are not updated with an ID if there is no matching mapping. It seems like returning Optional[MappedField] is the equivalent here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is considered more idiomatic (although it is still up for debate): https://devblogs.microsoft.com/python/idiomatic-python-eafp-versus-lbyl/

We have the same pattern in find_field:

def find_field(self, name_or_id: Union[str, int], case_sensitive: bool = True) -> NestedField:
"""Find a field using a field name or field ID.
Args:
name_or_id (Union[str, int]): Either a field name or a field ID.
case_sensitive (bool, optional): Whether to perform a case-sensitive lookup using a field name. Defaults to True.
Raises:
ValueError: When the value cannot be found.
Returns:
NestedField: The matched NestedField.
"""
if isinstance(name_or_id, int):
if name_or_id not in self._lazy_id_to_field:
raise ValueError(f"Could not find field with id: {name_or_id}")
return self._lazy_id_to_field[name_or_id]
if case_sensitive:
field_id = self._name_to_id.get(name_or_id)
else:
field_id = self._lazy_name_to_id_lower.get(name_or_id.lower())
if field_id is None:
raise ValueError(f"Could not find field with name {name_or_id}, case_sensitive={case_sensitive}")
return self._lazy_id_to_field[field_id]

This avoids having to do endless None-check in the code downstream. Often when you are sure that it can't be None, but you have to please the type checker.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't you just need to catch ValueError instead? It's normal for fields to not be mapped.

def _field_by_name(self) -> Dict[str, MappedField]:
return visit_name_mapping(self, _IndexByName())

def id(self, name: str) -> int:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this pulls from both NameMapping and MappedFields. The behavior is like NameMapping because it looks up fields but the method names are from MappedFields. I think this should use the find method names from NameMapping because the project typically uses find to signal that it isn't a direct lookup.

Also, there isn't a use of the lookup by ID method other than in tests, so I'd skip that. I think the only one that is used is the lookup(name: Tuple[str, ...]) version.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, missed that one since they are combined in Python!


def id(self, name: str) -> int:
try:
return self._field_by_name[name].field_id
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should behave like the methods elsewhere that accept Union[str, Tuple[str, ...]]. If the caller has a path of separate names to find, then this should join them with . before looking up the mapping.

Copy link
Contributor Author

@Fokko Fokko Dec 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think of:

    def find(self, *names: str) -> MappedField:
        name = '.'.join(names)
        try:
            return self._field_by_name[name]
        except KeyError as e:
            raise ValueError(f"Could not find field with name: {name}") from e

I think this is more pythonic, it will work as find('a.b.c'), find('a', 'b', 'c') and find(*['a', 'b', 'c']).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great!

"""Visit a MappedField."""


class _IndexById(NameMappingVisitor[Dict[int, MappedField]]):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this is actually needed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, I just copied what's on the Java side, but if you don't think it isn't needed, let's remove it for now.

return visitor.fields(fields, results)


def load_mapping_from_json(mapping: str) -> NameMapping:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about parse?


@visit_name_mapping.register(list)
def _(fields: List[MappedField], visitor: NameMappingVisitor[T]) -> T:
results = [visitor.field(field, visit_name_mapping(field.fields, visitor)) for field in fields]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if fields is None? Will that return None or raise an exception?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is hidden by returning [] instead of None in the _CreateMapping visitor.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is an interesting semantic discussion. When Pydantic deserializes, and it encounters that the fields field is missing, it will automatically inject an empty list:

fields: List[MappedField] = Field(default_factory=list)

When serializing to JSON, it will omit the fields key in the JSÓN object.

This way we can just always assume that there is a list as it is annotated, and this is also why the type-checker isn't complaining. So it boils down to the definition: Is there a difference in meaning when fields is set to None or []?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also added tests to illustrate the (de)serialization and also capture this behavior.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As long as it works, I'm good with this.

]

def primitive(self, primitive: PrimitiveType) -> List[MappedField]:
return []
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be None?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then all the signatures should change into Optional[List[MappedField]] which isn't nice. I've commented above: #212 (comment)

fields: List[MappedField] = Field(default_factory=list)

@model_serializer
def ser_model(self) -> Dict[str, Any]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this needed if fields is Optional[List[MappedField]] instead? It seems more accurate to me to use None for nested mappings for primitive types.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No this is the ser, it will leave out the fields entirely if there are no nested fields.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I meant was: is this method still needed if fields is None instead of an empty list?

Copy link
Contributor

@rdblue rdblue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great overall. I have a few minor comments about API and whether to use [] or None when there are no nested fields.

def _field_by_name(self) -> Dict[str, MappedField]:
return visit_name_mapping(self, _IndexByName())

def find(self, *names: str) -> MappedField:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: Do we need to check that names is not empty?

Copy link
Contributor

@rdblue rdblue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks correct to me. My only issue is that find will throw a ValueError when a name isn't found, which I think is a bit awkward compared to handling None. That's minor though and will get worked out when we go to write ApplyNameMapping code.

And it's arguably more pythonic to raise exceptions for control flow. 😄

@rdblue rdblue merged commit dcc3d9f into apache:main Dec 19, 2023
sungwy pushed a commit to sungwy/iceberg-python that referenced this pull request Dec 19, 2023
@Fokko Fokko deleted the fd-name-mapping branch December 19, 2023 08:32
sungwy pushed a commit to sungwy/iceberg-python that referenced this pull request Jan 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants