Skip to content

Conversation

@TomAugspurger
Copy link
Collaborator

This updates to_geodataframe to optionally use pyarrow types, rather than NumPy. These types let us faithfully represent the actual nested types, rather than casting everything to object. I think this will be a good default in the future. For now, it's just optional.

There are some changes to the actual values associated with this change, related to how optional fields are stored.

If the source STAC documents had some values like

            {
                "a": {
                    "href": "a.tif",
                },
                "b": {
                    "href": "b.tif",
                    "title": "B",
                }
            }

the new output will have a struct type with two fields href and title. The value of a.title will be None, instead of just being absent.

This updates to_geodataframe to optionally use pyarrow types, rather
than NumPy. These types let us faithfully represent the actual nested
types, rather than casting everything to `object`.
@kylebarron
Copy link
Collaborator

Awesome! Excited to see this!

for k, v in items2.items():
if k in DATETIME_COLUMNS:
items2[k] = pd.arrays.ArrowExtensionArray(
pa.array(pd.to_datetime(v, format="ISO8601"))
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want the output here to be identical to what we're getting in #27.

Right now, the date time columns from this PR end up with nanosecond precision, while Kyle's PR has microsecond precision. I'm not sure if there's a correct default, but we should try and get them the same.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than guessing, I've made this a parameter for to_geodataframe. The default is ns which will be compatible with what pandas was doing previously for NumPy dtypes.

We're actually still relying on pandas' to_datetime for parsing strings into timestamps, before casting to Arrow. Apparently pyarrow's pc.strptime doesn't support fractional seconds yet: apache/arrow#20146

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants