-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
pd.DataFrame.describe percentile string precision #13104
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
@bla1089 can you add the example which raises. |
It seems like by design someone coded in to only show 1 decimal place. What would be an appropriate fix? Change decimal place if input has more decimals in place? First timer here. Hope I'm not stepping bounds...just wanted to see how I can start contributing to the project and learn about python. |
You would have to determine the first significant decimal place that On Tue, May 10, 2016, 09:23 teaspreader [email protected] wrote:
|
Makes sense. Thanks. |
In fact, it might be done by taking np.ediff1d(np.log10([100_percentiles])) On Tue, May 10, 2016, 09:45 Ben Alterman [email protected] wrote:
|
How about just return string decimal format based on the input?
|
Test it. Does it work for cases in which you have a different number of On Tue, May 10, 2016, 09:58 teaspreader [email protected] wrote:
|
I cannot seem to reproduce this issue on my machine.
|
Srib, you are reproducing it in your output too. |
Same thing with 0.0005 and 0.001. On Tue, May 10, 2016 at 8:53 PM teaspreader [email protected]
|
If I may add my 2 cents:
def prettify(percentiles):
pcts = 100 * percentiles
# Number of digits left to decimal point (needed for padding):
m = max(1, 1 + np.floor(np.log10(pcts.max())).astype(int))
is_int_arr = (pcts.astype(int) == pcts);
if np.all(is_int_arr):
out = pcts.astype(int).astype(str)
return [' ' * (m - len(i)) + i + '%' for i in out]
# precision:
prec = -np.floor(np.log10( # position of the first digit of
np.min( # the minimum of
np.ediff1d( # differences of adjacent
np.sort(np.unique( # sorted unique entries of
np.append(pcts, [0, 100]) # pcts with extra boundary values
))
)
)
)).astype(int)
prec = max(1, prec)
out = np.empty_like(pcts, dtype = object)
out[is_int_arr] = pcts[is_int_arr].astype(int).astype(str)
out[~is_int_arr] = pcts[~is_int_arr].round(prec).astype(str)
return [' ' * (m - len(i)) + i + '%' if is_int
else ' ' * (m-i.find('.')) + i + '%'
for (is_int, i) in zip (is_int_arr, out)] and also does some padding: prettify(np.array([1e-5, 0.0001, 0.0005, 0.001, 0.1, 0.5, 0.999, 0.9995, 0.9999]))
Out[448]:
[' 0.001%',
' 0.01%',
' 0.05%',
' 0.1%',
'10%',
'50%',
'99.9%',
'99.95%',
'99.99%']
prettify(np.array([0.0199, 0.03, 0.2]))
Out[449]: [' 2.0%', ' 3%', '20%'] I'm not sure if this isn't an overkill. |
I'd better ask before I submit. Does anyone has an opinion on whether such a formatting (padding with blanks to align decimal points) is acceptable?
|
aligning on decimal point is fine 0.01,0.05,0.10,50.00,99.90,99.95,99.99 as they line up better |
OK. Thanks for the comment. |
Personally, I would prefer 0.9 over 0.90 because it implies less visual On Wed, May 18, 2016 at 10:15 PM pijucha [email protected] wrote:
|
maybe just leave the sig figs like u have them but align on decimals and also align the % - might be a nicer look |
All right. I'm posting it once more to give some visual comparison.
|
a small issue is that to actually get at these fields you have to use an exactly formatted label - if this is for display purposes only then that doesn't matter s[' 0.1 %'] might be a bit awkward |
Yes. This was one of the reasons I preferred to ask first. |
I think that assuming this is for display purposes only might get us into On Wed, May 18, 2016 at 10:51 PM pijucha [email protected] wrote:
|
True. I'm giving up on padding. Without blanks, it's both simpler to use and easier to code. Now, I'm a bit in favour of Version 2. (Version 1 automatically adds a decimal point to all entries if there is at least one non-integer.)
|
I think I like version 2 |
+1 for version2, as readabily looks important rather than precision consistency. |
I was about to submit a pull request but discovered that the rounding is only a part of the problem here. The other part got slightly overlooked. Namely, users themselves can supply non-unique percentiles. It works fine with Series
but not with DataFrame (or other multidimensional objects):
So, what would be the most sensible action when a user supplies non-unique percentiles:
|
There's a third option. Use the unique percentiles and issue a warning to Personally, I'm in favor of raising a ValueError for all objects so that On Tue, May 24, 2016 at 1:22 AM pijucha [email protected] wrote:
|
I think raising on duplicated percentiles if fine. |
I like @bla1089 's idea (unique percentiles plus a warning). So if it's also acceptable, I'd go for it. And another minor issue: I'd always sort percentiles on output. Now, it depends on whether you enter 0.5, so some weird reordering may happen:
|
sorting is fine. I am not really in favor of duplicates. yes they are handled generally, but not particularly useful IMHO. |
I'm actually not in favor of the warning because (1) it is only going to be If you want the ability for duplicate percentiles, then how about a kwarg I was only proposing the warning for completeness. That being said, the On Tue, May 24, 2016, 11:11 Jeff Reback [email protected] wrote:
|
Yes, this sounds reasonable. Exceptions then. Slightly off-topic: if I keep finding other bugs and can fix them, should I open a new issue to discuss the code changes I propose? Or is it enough to discuss it in a pull request comments? As an illustration, another bug with df = pd.DataFrame({'A': list("BCDE"), 0: [1,2,3,4]})
df.describe()
# long traceback listing several internal functions
ValueError: Buffer dtype mismatch, expected 'Python object' but got 'long' I've found the source of the error and have a seemingly reasonable solution (it still needs testing, though). |
This seems reasonably distinct and should probably be handled on its own. On Wed, May 25, 2016, 10:06 pijucha [email protected] wrote:
|
Yes it's distinct and I don't want to discuss it here. But I was going to fix it in the same commit because the code changes will probably be contained within the function My question was essentially: does any code change require opening a new issue? Wiki says not necessarily but mentions only insignificant changes. |
generally we like to have issues to go with PR's. |
Ok, thanks. I'll open a new one then. |
…s-dev#13288) BUG pandas-dev#13104: - Percentile identifiers are now rounded to the least precision that keeps them unique. - Supplying duplicates in percentiles will raise ValueError. BUG pandas-dev#13288 - Fixed a column index of the output data frame. Previously, if a data frame had a column index of object type and the index contained numeric values, the output column index could be corrupt. It led to ValueError if the output was displayed. - describe() will raise ValueError with an informative message on DataFrame without columns.
When using
pd.DataFrame.describe
, if your percentiles are different only at the 4th decimal place, aValueError
is thrown because the the percentiles that vary at the 4th decimal place become the same value.The text was updated successfully, but these errors were encountered: