-
Notifications
You must be signed in to change notification settings - Fork 322
Add name-mapping #212
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add name-mapping #212
Conversation
15bf5f5
to
a8aed26
Compare
All the things to (de)serialize the name-mapping, and all the neccessary visitors and such
a8aed26
to
0a0e829
Compare
pyiceberg/table/name_mapping.py
Outdated
try: | ||
return self._field_by_id[field_id] | ||
except KeyError as e: | ||
raise ValueError(f"Could not find field-id: {field_id}") from e |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why raise an exception? In the Java ApplyNameMapping
, fields are not updated with an ID if there is no matching mapping. It seems like returning Optional[MappedField]
is the equivalent here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is considered more idiomatic (although it is still up for debate): https://devblogs.microsoft.com/python/idiomatic-python-eafp-versus-lbyl/
We have the same pattern in find_field
:
iceberg-python/pyiceberg/schema.py
Lines 177 to 203 in ba8f9eb
def find_field(self, name_or_id: Union[str, int], case_sensitive: bool = True) -> NestedField: | |
"""Find a field using a field name or field ID. | |
Args: | |
name_or_id (Union[str, int]): Either a field name or a field ID. | |
case_sensitive (bool, optional): Whether to perform a case-sensitive lookup using a field name. Defaults to True. | |
Raises: | |
ValueError: When the value cannot be found. | |
Returns: | |
NestedField: The matched NestedField. | |
""" | |
if isinstance(name_or_id, int): | |
if name_or_id not in self._lazy_id_to_field: | |
raise ValueError(f"Could not find field with id: {name_or_id}") | |
return self._lazy_id_to_field[name_or_id] | |
if case_sensitive: | |
field_id = self._name_to_id.get(name_or_id) | |
else: | |
field_id = self._lazy_name_to_id_lower.get(name_or_id.lower()) | |
if field_id is None: | |
raise ValueError(f"Could not find field with name {name_or_id}, case_sensitive={case_sensitive}") | |
return self._lazy_id_to_field[field_id] |
This avoids having to do endless None
-check in the code downstream. Often when you are sure that it can't be None
, but you have to please the type checker.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't you just need to catch ValueError
instead? It's normal for fields to not be mapped.
pyiceberg/table/name_mapping.py
Outdated
def _field_by_name(self) -> Dict[str, MappedField]: | ||
return visit_name_mapping(self, _IndexByName()) | ||
|
||
def id(self, name: str) -> int: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like this pulls from both NameMapping
and MappedFields
. The behavior is like NameMapping
because it looks up fields but the method names are from MappedFields
. I think this should use the find
method names from NameMapping
because the project typically uses find
to signal that it isn't a direct lookup.
Also, there isn't a use of the lookup by ID method other than in tests, so I'd skip that. I think the only one that is used is the lookup(name: Tuple[str, ...]) version.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, missed that one since they are combined in Python!
pyiceberg/table/name_mapping.py
Outdated
|
||
def id(self, name: str) -> int: | ||
try: | ||
return self._field_by_name[name].field_id |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should behave like the methods elsewhere that accept Union[str, Tuple[str, ...]]
. If the caller has a path of separate names to find, then this should join them with .
before looking up the mapping.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think of:
def find(self, *names: str) -> MappedField:
name = '.'.join(names)
try:
return self._field_by_name[name]
except KeyError as e:
raise ValueError(f"Could not find field with name: {name}") from e
I think this is more pythonic, it will work as find('a.b.c')
, find('a', 'b', 'c')
and find(*['a', 'b', 'c'])
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great!
pyiceberg/table/name_mapping.py
Outdated
"""Visit a MappedField.""" | ||
|
||
|
||
class _IndexById(NameMappingVisitor[Dict[int, MappedField]]): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure this is actually needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright, I just copied what's on the Java side, but if you don't think it isn't needed, let's remove it for now.
pyiceberg/table/name_mapping.py
Outdated
return visitor.fields(fields, results) | ||
|
||
|
||
def load_mapping_from_json(mapping: str) -> NameMapping: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about parse
?
|
||
@visit_name_mapping.register(list) | ||
def _(fields: List[MappedField], visitor: NameMappingVisitor[T]) -> T: | ||
results = [visitor.field(field, visit_name_mapping(field.fields, visitor)) for field in fields] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if fields
is None
? Will that return None
or raise an exception?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is hidden by returning []
instead of None
in the _CreateMapping
visitor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is an interesting semantic discussion. When Pydantic deserializes, and it encounters that the fields
field is missing, it will automatically inject an empty list:
fields: List[MappedField] = Field(default_factory=list) |
When serializing to JSON, it will omit the fields
key in the JSÓN object.
This way we can just always assume that there is a list as it is annotated, and this is also why the type-checker isn't complaining. So it boils down to the definition: Is there a difference in meaning when fields
is set to None
or []
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also added tests to illustrate the (de)serialization and also capture this behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As long as it works, I'm good with this.
] | ||
|
||
def primitive(self, primitive: PrimitiveType) -> List[MappedField]: | ||
return [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this be None
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then all the signatures should change into Optional[List[MappedField]]
which isn't nice. I've commented above: #212 (comment)
fields: List[MappedField] = Field(default_factory=list) | ||
|
||
@model_serializer | ||
def ser_model(self) -> Dict[str, Any]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this needed if fields
is Optional[List[MappedField]]
instead? It seems more accurate to me to use None
for nested mappings for primitive types.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No this is the ser
, it will leave out the fields
entirely if there are no nested fields.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I meant was: is this method still needed if fields
is None
instead of an empty list?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great overall. I have a few minor comments about API and whether to use []
or None
when there are no nested fields.
def _field_by_name(self) -> Dict[str, MappedField]: | ||
return visit_name_mapping(self, _IndexByName()) | ||
|
||
def find(self, *names: str) -> MappedField: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor: Do we need to check that names
is not empty?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks correct to me. My only issue is that find
will throw a ValueError
when a name isn't found, which I think is a bit awkward compared to handling None
. That's minor though and will get worked out when we go to write ApplyNameMapping
code.
And it's arguably more pythonic to raise exceptions for control flow. 😄
All the things to (de)serialize the name-mapping, and all the neccessary visitors and such