Skip to content

API: Timestamp(pydatetime) microsecond reso#49034

Merged
mroeschke merged 41 commits into
pandas-dev:mainfrom
jbrockmendel:nano-tstamp-pydatetime
Nov 30, 2022
Merged

API: Timestamp(pydatetime) microsecond reso#49034
mroeschke merged 41 commits into
pandas-dev:mainfrom
jbrockmendel:nano-tstamp-pydatetime

Conversation

@jbrockmendel

Copy link
Copy Markdown
Member
  • closes #xxxx (Replace xxxx with the Github issue number)
  • Tests added and passed if fixing a bug or adding a new feature
  • All code checks passed.
  • Added type annotations to new arguments/methods/functions.
  • Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

I've got one json test still failing locally cc @WillAyd see the comment in the ujson file, suggestions?

@WillAyd

WillAyd commented Oct 11, 2022

Copy link
Copy Markdown
Member

I think the get_long_attr function currently in the JSON implicitly expects nanosecond resolution, which would be incorrect and mangles the subsequent functions that convert to the unit of the encoder. I think the easiest approach would be to update the get_long_attr function to read the unit of the object and convert to nanosecond as required

@WillAyd

WillAyd commented Oct 11, 2022

Copy link
Copy Markdown
Member

Also wanted to add...this is awesome. Very impressive work

@mroeschke mroeschke added Non-Nano datetime64/timedelta64 with non-nanosecond resolution Timestamp pd.Timestamp and associated methods labels Oct 11, 2022
@jbrockmendel

Copy link
Copy Markdown
Member Author

update the get_long_attr function

I tried adding

PyObject *nano_obj = PyObject_CallMethod(o, "_as_unit", "(s)", "ns");

and am getting a segfault. is there a better way to do this?

@WillAyd

WillAyd commented Oct 11, 2022

Copy link
Copy Markdown
Member

I think you want

PyObject *nano_obj = PyObject_CallMethod(o, "_as_unit", "s", "ns");

The (s) would mean you are building a tuple.

If you get stuck from there I would suggest running python with gdb attached to it via gdb python then doing something like run -m pytest --lf (or run whatever script gives you the segfault)

Assuming you built the extensions with debugging symbols enabled your debugger will drop you right at the point where the segfault occurs and let you inspect the state of things. If the above doesn't do it alone, my guess is there are other types of objects that we deal with than just timestamps in that function. Ideally we would do a type check...but we aren't at the moment, so dealing with anything but a Timestamp and trying to invoke that method could also cause a segfault

Lastly you might find this useful:

https://pandas.pydata.org/docs/dev/development/debugging_extensions.html#improve-debugger-printing

Good luck. If you are still having trouble ping me again and can look deeper

@jbrockmendel

Copy link
Copy Markdown
Member Author

I ended up effectively re-implementing _as_unit. No more segfault, but im just getting wrong answers.

static npy_int64 get_long_attr(PyObject *o, const char *attr) {
    // NB we are implicitly assuming that o is a Timedelta or Timestamp, or NaT

    npy_int64 long_val;
    PyObject *value = PyObject_GetAttrString(o, attr);
    long_val =
        (PyLong_Check(value) ? PyLong_AsLongLong(value) : PyLong_AsLong(value));

    Py_DECREF(value);

    if (long_val == NPY_MIN_INT64) {
        // i.e. o is NaT
        return long_val;
    }

    // ensure we are in nanoseconds, similar to Timestamp._as_reso or _as_unit
    NPY_DATETIMEUNIT reso = (NPY_DATETIMEUNIT)PyObject_GetAttrString(o, "_reso");

    if (reso == NPY_FR_us) {
        long_val = long_val * 1000L;
    } else if (reso == NPY_FR_ms) {
        long_val = long_val * 1000000L;
    } else if (reso == NPY_FR_s) {
        long_val = long_val * 1000000000L;
    }

    return long_val;
}

The test case thats failing should go through the NPY_FR_us branch, but apparently none of these conditions are evaluating as truthy.

@WillAyd

WillAyd commented Oct 11, 2022

Copy link
Copy Markdown
Member

Ah I gotcha. The problem there is you essentially are getting back a PyObject - casting it to an NPY_DATETIMEUNIT doesn't change anything about the object itself or the bytes that comprise it; in this case you would just errantly be suppressing warnings with the cast

If you want the primitive integer value for comparison against the NPY_DATETIMEUNIT enum you'd likely want to do something like:

if (!PyLong_Check(reso)) {
  // TODO: we should have error handling here, but one step at a time...
}

long cReso = PyLong_AsLong(reso);
if (cReso == -1 && PyErr_Occurred()) {
  // TODO: we should have error handling here, but one step at a time...
}

And then use cReso for your comparisons

Also be sure to Py_DECREF(reso) when you are done with it

@WillAyd

WillAyd commented Oct 12, 2022

Copy link
Copy Markdown
Member

@jbrockmendel just to drive home the issue with the cast, I think it is helpful to think about the bytes that are getting moved around. When comparing against the NPY_DATETIMEUNIT enum to see if something is in milliseconds, you are essentially comparisong against the integer 8. Assuming a 32 bit integer on a little endian platform, the binary structure of that is 00000000 00000000 00000000 00001000.

By contrast, a PyObject is a struct that will occupy more memory than 32 bits and have a totally different binary representation. It's not important what exactly that is for now, but let's assume that it's 00110011 00110011 00110011 00110011 00110011 00110011 00110011 00110011 and so on...

When you cast the PyObject to a NPY_DATETIMEUNIT you are basically just casting it to an integer and therefore only interpreting an integer's amount of bytes. I don't know what the exact rules for that are and if its well defined, but let's assume for simplicity it just takes the first 32 bytes of the PyObject. Your previous comparisons would then just be trying to compare against 00110011 00110011 00110011 00110011, which still doesn't equal the 8 you are looking for.

@mroeschke mroeschke added this to the 2.0 milestone Oct 18, 2022
Comment thread pandas/_libs/tslibs/conversion.pyx Outdated
# Keep the converter same as PyDateTime's
ts = datetime.combine(ts, time())
return convert_datetime_to_tsobject(ts, tz)
return convert_datetime_to_tsobject(ts, tz, nanos=0, reso=NPY_FR_us) # TODO: or lower?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO I would cast dates to the lowest resolution (and document it)

@WillAyd WillAyd left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a whatsnew for this too?

if isinstance(ts, ABCTimestamp):
reso = abbrev_to_npy_unit(ts.unit) # TODO: faster way to do this?
else:
# TODO: what if user explicitly passes nanos=0?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to hit this? Maybe we should raise instead?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think it could happen with pd.Timestamp(pydatetime_obj, nanosecond=0)

@WillAyd WillAyd left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think generally lgtm. A few comments that aren't blockers

raise ValueError("'value' should be a Timestamp.")
self._check_compatible_with(value)
return value.asm8
if value is NaT:

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possible I've missed this conversation but do we need to give consideration to a generic NaT type that can hold different precisions? Or are we always going to use numpy's value?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need to give consideration to a generic NaT type that can hold different precisions

The closest I've seen to this has been a discussion of having a separate NaT-like for timedelta. I'm not aware of any discussion of a resolution-specific NaT.

if fill_wrap is not None:
# FIXME: if we get here with dt64/td64 we need to be sure we have
# matching resos
if fill_value.dtype.kind == "m":

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be worth making a re-usable function for this? I could see this useful in other areas. Something like reso_for_type

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'd like to hold off on that as out of scope

Comment thread pandas/_libs/tslibs/conversion.pyx Outdated
return convert_datetime_to_tsobject(ts, tz, nanos, reso=reso)
elif PyDate_Check(ts):
# Keep the converter same as PyDateTime's
# For date object we give the lowest supporte resolution, ie. "s"

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have a test where we construct a Timestamp from a datetime.date?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dedicated test added + green

Comment thread pandas/_libs/tslibs/conversion.pyx Outdated

@mroeschke mroeschke left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM can merge on green

@mroeschke mroeschke merged commit 41db572 into pandas-dev:main Nov 30, 2022
@mroeschke

Copy link
Copy Markdown
Member

Thanks @jbrockmendel

@jbrockmendel jbrockmendel deleted the nano-tstamp-pydatetime branch November 30, 2022 23:37
@jbrockmendel

Copy link
Copy Markdown
Member Author

thanks for reviews @mroeschke and @WillAyd. we're getting close to having full non-nano support!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Non-Nano datetime64/timedelta64 with non-nanosecond resolution Timestamp pd.Timestamp and associated methods

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants