Skip to content

pd.DataFrame.describe percentile string precision #13104

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
blalterman opened this issue May 6, 2016 · 36 comments
Closed

pd.DataFrame.describe percentile string precision #13104

blalterman opened this issue May 6, 2016 · 36 comments
Labels
Bug Indexing Related to indexing on series/frames, not to indexes themselves Numeric Operations Arithmetic, Comparison, and Logical operations Output-Formatting __repr__ of pandas objects, to_string
Milestone

Comments

@blalterman
Copy link

blalterman commented May 6, 2016

When using pd.DataFrame.describe, if your percentiles are different only at the 4th decimal place, a ValueError is thrown because the the percentiles that vary at the 4th decimal place become the same value.

In [1]: s = Series(np.random.randn(10))

In [2]: s.describe()
Out[2]: 
count    10.000000
mean      0.291571
std       1.057143
min      -1.453547
25%      -0.614614
50%       0.637435
75%       0.968905
max       1.823964
dtype: float64

In [3]: s.describe(percentiles=[0.0001, 0.0005, 0.001, 0.999, 0.9995, 0.9999])
Out[3]: 
count     10.000000
mean       0.291571
std        1.057143
min       -1.453547
0.0%      -1.453107
0.1%      -1.451348
0.1%      -1.449149
50%        0.637435
99.9%      1.817201
100.0%     1.820583
100.0%     1.823288
max        1.823964
dtype: float64
@jreback jreback added Bug Indexing Related to indexing on series/frames, not to indexes themselves Output-Formatting __repr__ of pandas objects, to_string Numeric Operations Arithmetic, Comparison, and Logical operations Difficulty Novice labels May 6, 2016
@jreback jreback added this to the 0.18.2 milestone May 6, 2016
@jreback
Copy link
Contributor

jreback commented May 6, 2016

@bla1089 can you add the example which raises.

@ghost
Copy link

ghost commented May 10, 2016

It seems like by design someone coded in to only show 1 decimal place.
def pretty_name(x): x *= 100 if x == int(x): return '%.0f%%' % x else: return '%.1f%%' % x

What would be an appropriate fix? Change decimal place if input has more decimals in place?

First timer here. Hope I'm not stepping bounds...just wanted to see how I can start contributing to the project and learn about python.

@blalterman
Copy link
Author

You would have to determine the first significant decimal place that
provides unique string identifiers for each percentile. I'm not trained as
a programmer, but I would be surprised if there isn't a standard way to do
this.

On Tue, May 10, 2016, 09:23 teaspreader [email protected] wrote:

It seems like by design someone coded in to only show 1 decimal place.
def pretty_name(x):
x *= 100
if x == int(x):
return '%.0f%%' % x
else:
return '%.1f%%' % x

What would be an appropriate fix? Change decimal place if input has more
decimals in place?

First timer here. Hope I'm not stepping bounds...just wanted to see how I
can start contributing to the project and learn about python.


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#13104 (comment)

@ghost
Copy link

ghost commented May 10, 2016

Makes sense. Thanks.

@blalterman
Copy link
Author

In fact, it might be done by taking np.ediff1d(np.log10([100_percentiles]))
or replacing 100_percentiles with the decimals that remain after taking
100*percentiles. Round that result to the left on the number line, and take
the most negative of the numbers. That would be the denial place precision
necessary to get a unique index. Though this could be way too complicated a
formulation.

On Tue, May 10, 2016, 09:45 Ben Alterman [email protected] wrote:

You would have to determine the first significant decimal place that
provides unique string identifiers for each percentile. I'm not trained as
a programmer, but I would be surprised if there isn't a standard way to do
this.

On Tue, May 10, 2016, 09:23 teaspreader [email protected] wrote:

It seems like by design someone coded in to only show 1 decimal place.
def pretty_name(x):
x *= 100
if x == int(x):
return '%.0f%%' % x
else:
return '%.1f%%' % x

What would be an appropriate fix? Change decimal place if input has more
decimals in place?

First timer here. Hope I'm not stepping bounds...just wanted to see how I
can start contributing to the project and learn about python.


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#13104 (comment)

@ghost
Copy link

ghost commented May 10, 2016

How about just return string decimal format based on the input?

def pretty_name(x): decimal_place = str(x)[::-1].find('.')-2 x *= 100 if x == int(x): return '%.0f%%' % x else: return '%.*f%%' % (decimal_place, x)

@blalterman
Copy link
Author

Test it. Does it work for cases in which you have a different number of
decimal places in the percentiles input? How anesthetically clean are the
results?

On Tue, May 10, 2016, 09:58 teaspreader [email protected] wrote:

How about just...?
def pretty_name(x):
decimal_place = str(x)[::-1].find('.')-2

x *= 100
if x == int(x):
return '%.0f%%' % x
else:

return '%.*f%%' % (decimal_place, x)


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#13104 (comment)

@srib
Copy link
Contributor

srib commented May 10, 2016

I cannot seem to reproduce this issue on my machine.

>>> s.describe( percentiles=[0.0001, 0.0005, 0.001, 0.999, 0.9995, 0.9999])
count     10.000000
mean      -0.225519
std        0.841984
min       -1.910093
0.0%      -1.909577
0.1%      -1.907514
0.1%      -1.904935
50%        0.018802
99.9%      0.752442
100.0%     0.753580
100.0%     0.754491
max        0.754719
dtype: float64
>>> pd.show_version()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'module' object has no attribute 'show_version'
>>> pd.show_versions()

INSTALLED VERSIONS
------------------
commit: 2e975b6f39e3ab3840097ec0aa07c44ac13c4467
python: 2.7.11.final.0
python-bits: 32
OS: Linux
OS-release: 3.13.0-71-generic
machine: i686
processor: i686
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8

pandas: 0.18.1+5.g2e975b6
nose: 1.3.7
pip: 8.1.1
setuptools: 20.7.0
Cython: 0.24
numpy: 1.11.0
scipy: None
statsmodels: None
xarray: None
IPython: 4.2.0
sphinx: 1.4.1
patsy: None
dateutil: 2.5.2
pytz: 2016.3
blosc: None
bottleneck: None
tables: None
numexpr: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.8
boto: None
pandas_datareader: None

@ghost
Copy link

ghost commented May 11, 2016

Srib, you are reproducing it in your output too.
Look at how 0.9995 and 0.9999 is formatted to be both 100% in the function's output.

@blalterman
Copy link
Author

Same thing with 0.0005 and 0.001.

On Tue, May 10, 2016 at 8:53 PM teaspreader [email protected]
wrote:

Srib, you are reproducing it in your output too.
Look at how 0.9995 and 0.9999 I formatted to be both 100% in the
function's output.


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#13104 (comment)

@pijucha
Copy link
Contributor

pijucha commented May 17, 2016

If I may add my 2 cents:

  1. I have a bit of doubts if a unique index is an issue here. Firstly, a user can very well call s.describe(percentiles = [0.1, 0.1, 0.5]). Secondly, describe is not a function people usually use to calculate percentiles. So a pretty output might be more important than an exact percentile identifier. (But it's only a humble opinion.)

  2. @ghost. Your function pretty_name has a minor issue:

    pretty_name(0.00001)
    Out[285]: '0%'
    # becasue:
    str(0.00001)
    Out[286]: '1e-05'
  3. I wrote another function which finds a cutoff precision for float percentiles (and rounds them):

def prettify(percentiles):
    pcts = 100 * percentiles
    # Number of digits left to decimal point (needed for padding):
    m = max(1, 1 + np.floor(np.log10(pcts.max())).astype(int))
    is_int_arr = (pcts.astype(int) == pcts);
    if np.all(is_int_arr):
        out = pcts.astype(int).astype(str)
        return [' ' * (m - len(i)) + i + '%' for i in out]
    # precision:
    prec = -np.floor(np.log10(                          # position of the first digit of
                np.min(                                 # the minimum of
                    np.ediff1d(                         # differences of adjacent
                        np.sort(np.unique(              # sorted unique entries of
                            np.append(pcts, [0, 100])   # pcts with extra boundary values 
                        ))        
                    )
                )
            )).astype(int)
    prec = max(1, prec)
    out = np.empty_like(pcts, dtype = object)
    out[is_int_arr] = pcts[is_int_arr].astype(int).astype(str)
    out[~is_int_arr] = pcts[~is_int_arr].round(prec).astype(str)
    return [' ' * (m - len(i)) + i + '%' if is_int
            else ' ' * (m-i.find('.')) + i + '%'
            for (is_int, i) in zip (is_int_arr, out)]

and also does some padding:

prettify(np.array([1e-5, 0.0001, 0.0005, 0.001, 0.1, 0.5, 0.999, 0.9995, 0.9999]))
Out[448]: 
[' 0.001%',
 ' 0.01%',
 ' 0.05%',
 ' 0.1%',
 '10%',
 '50%',
 '99.9%',
 '99.95%',
 '99.99%']

prettify(np.array([0.0199, 0.03, 0.2]))
Out[449]: [' 2.0%', ' 3%', '20%']

I'm not sure if this isn't an overkill.
But if this is the expected behaviour and nobody works on this issue (ghost's pull request has been cancelled, I guess), I can submit it.

@jreback
Copy link
Contributor

jreback commented May 17, 2016

@pijucha you can submit if you'd like, still some comments outstanding on #13132

@pijucha
Copy link
Contributor

pijucha commented May 19, 2016

I'd better ask before I submit. Does anyone has an opinion on whether such a formatting (padding with blanks to align decimal points) is acceptable?

# Possible output of s.describe(percentiles=[0.0001, 0.0005, 0.001, 0.999, 0.9995, 0.9999])
count     10.000000
mean       0.291571
std        1.057143
min       -1.453547
 0.01%    -1.453107
 0.05%    -1.451348
 0.1%     -1.449149
50%        0.637435
99.9%      1.817201
99.95%     1.820583
99.99%     1.823288
max        1.823964

@jreback

still some comments outstanding on #13132
The author's (ghost) github account is deleted. Does it affect the pull request?

@jreback
Copy link
Contributor

jreback commented May 19, 2016

aligning on decimal point is fine
I think you should have all the same number of digits
to the right of decimal point as well

0.01,0.05,0.10,50.00,99.90,99.95,99.99

as they line up better

@pijucha
Copy link
Contributor

pijucha commented May 19, 2016

OK. Thanks for the comment.

@blalterman
Copy link
Author

Personally, I would prefer 0.9 over 0.90 because it implies less visual
clutter and I don't think we're concerned about sig figs.

On Wed, May 18, 2016 at 10:15 PM pijucha [email protected] wrote:

OK. Thanks for the comment.


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#13104 (comment)

@jreback
Copy link
Contributor

jreback commented May 19, 2016

maybe just leave the sig figs like u have them but align on decimals and also align the % - might be a nicer look

@pijucha
Copy link
Contributor

pijucha commented May 19, 2016

All right. I'm posting it once more to give some visual comparison.

# Possible output of s.describe(percentiles=[0.0001, 0.0005, 0.001, 0.999, 0.9995, 0.9999])
count     10.000000
mean       0.291571
std        1.057143
min       -1.453547
 0.01%    -1.453107
 0.05%    -1.451348
 0.1 %    -1.449149
50   %     0.637435
99.9 %     1.817201
99.95%     1.820583
99.99%     1.823288
max        1.823964

@jreback
Copy link
Contributor

jreback commented May 19, 2016

a small issue is that to actually get at these fields you have to use an exactly formatted label - if this is for display purposes only then that doesn't matter

s[' 0.1 %'] might be a bit awkward

@pijucha
Copy link
Contributor

pijucha commented May 19, 2016

Yes. This was one of the reasons I preferred to ask first.
But I don't think people really use it like this. (At least I can't think of a situation I'd need to use it.)

@blalterman
Copy link
Author

I think that assuming this is for display purposes only might get us into
trouble later on. I'm sure that if it's there, some people use it for more
than display purposes.

On Wed, May 18, 2016 at 10:51 PM pijucha [email protected] wrote:

Yes. This was one of the reasons I preferred to ask first.
But I don't think people really use it like this. (At least I can't think
of a situation I'd need to use it.)


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#13104 (comment)

@pijucha
Copy link
Contributor

pijucha commented May 19, 2016

True. I'm giving up on padding. Without blanks, it's both simpler to use and easier to code.

Now, I'm a bit in favour of Version 2. (Version 1 automatically adds a decimal point to all entries if there is at least one non-integer.)

Version 1                                  Version 2

count     10.000000                        count     10.000000
mean       0.291571                        mean       0.291571
std        1.057143                        std        1.057143
min       -1.453547                        min       -1.453547
0.01%     -1.453107                        0.01%     -1.453107
0.05%     -1.451348                        0.05%     -1.451348
0.1%      -1.449149                        0.1%      -1.449149
50.0%      0.637435                        50%        0.637435
66.66%     0.901020                        66.66%     0.901020
75.0%      1.010203                        75%        1.010203
99.0%      1.817201                        99%        1.817201
99.9%      1.820583                        99.9%      1.820583
99.99%     1.823288                        99.99%     1.823288
max        1.823964                        max        1.823964

count     10.000000                        count     10.000000
mean       0.291571                        mean       0.291571
std        1.057143                        std        1.057143
min       -1.453547                        min       -1.453547
0.01%     -1.453107                        0.01%     -1.453107
25.0%      0.637435                        25%        0.637435
50.0%      0.637435                        50%        0.637435
75.0%      1.010203                        75%        1.010203
max        1.823964                        max        1.823964

count     10.000000                        count     10.000000
mean       0.291571                        mean       0.291571
std        1.057143                        std        1.057143
min       -1.453547                        min       -1.453547
25%        0.637435                        25%        0.637435
50%        0.637435                        50%        0.637435
75%        1.010203                        75%        1.010203
max        1.823964                        max        1.823964

@jreback
Copy link
Contributor

jreback commented May 19, 2016

I think I like version 2

@jreback
Copy link
Contributor

jreback commented May 19, 2016

@sinhrks
Copy link
Member

sinhrks commented May 19, 2016

+1 for version2, as readabily looks important rather than precision consistency.

@pijucha
Copy link
Contributor

pijucha commented May 24, 2016

I was about to submit a pull request but discovered that the rounding is only a part of the problem here. The other part got slightly overlooked. Namely, users themselves can supply non-unique percentiles. It works fine with Series

s = pd.Series(np.arange(11))
s.describe(percentiles = [0.1, 0.2, 0.2])
Out[52]: 
count    11.000000
mean      5.000000
std       3.316625
min       0.000000
10%       1.000000
20%       2.000000
20%       2.000000
50%       5.000000
max      10.000000

but not with DataFrame (or other multidimensional objects):

df = pd.DataFrame(np.arange(11))
df.describe(percentiles = [0.1, 0.2, 0.2])
# ... longer traceback ...
ValueError: cannot reindex from a duplicate axis

.describe() internally uses pd.concat() for DataFrames and it breaks with non-unique entries in indexes (here, percentiles identifiers).

So, what would be the most sensible action when a user supplies non-unique percentiles:

  1. raising a ValueError with an informative message (and if so, for all objects or excluding Series?)
  2. modifying code to allow non-unique percentiles?

@blalterman
Copy link
Author

There's a third option. Use the unique percentiles and issue a warning to
the user that they provided non-unique percentiles.

Personally, I'm in favor of raising a ValueError for all objects so that
things are standardized and to ensure people are getting the output they
expect.

On Tue, May 24, 2016 at 1:22 AM pijucha [email protected] wrote:

I was about to submit a pull request but discovered that the rounding is
only a part of the problem here. The other part got slightly overlooked.
Namely, users themselves can supply non-unique percentiles. It works fine
with Series

s = pd.Series(np.arange(11))
s.describe(percentiles = [0.1, 0.2, 0.2])
Out[52]:
count 11.000000
mean 5.000000
std 3.316625
min 0.000000
10% 1.000000
20% 2.000000
20% 2.000000
50% 5.000000
max 10.000000

but not with DataFrame (or other multidimensional objects):

df = pd.DataFrame(np.arange(11))
df.describe(percentiles = [0.1, 0.2, 0.2])

... longer traceback ...

ValueError: cannot reindex from a duplicate axis

.describe() internally uses pd.concat() for DataFrames and it breaks with
non-unique entries in indexes (here, percentiles identifiers).

So, what would be the most sensible action when a user supplies non-unique
percentiles:

  1. raising a ValueError with an informative message (and if so, for all
    objects or excluding Series?)
  2. modifying code to allow non-unique percentiles?


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#13104 (comment)

@jreback
Copy link
Contributor

jreback commented May 24, 2016

I think raising on duplicated percentiles if fine.

@pijucha
Copy link
Contributor

pijucha commented May 24, 2016

I like @bla1089 's idea (unique percentiles plus a warning). So if it's also acceptable, I'd go for it.

And another minor issue: I'd always sort percentiles on output. Now, it depends on whether you enter 0.5, so some weird reordering may happen:

s.describe(percentiles = [0.3, 0.6, 0.2])
Out[72]: 
count    11.000000
mean      5.000000
std       3.316625
min       0.000000
30%       3.000000
20%       2.000000
50%       5.000000
60%       6.000000
max      10.000000
dtype: float64

@jreback
Copy link
Contributor

jreback commented May 24, 2016

sorting is fine. I am not really in favor of duplicates. yes they are handled generally, but not particularly useful IMHO.

@blalterman
Copy link
Author

I'm actually not in favor of the warning because (1) it is only going to be
raised once; (2) we don't know how the user will use the percentiles (e.g.do
they iterate through them in the describe output?); and (3) it permits
sloppy coding that could cause someone problems later.

If you want the ability for duplicate percentiles, then how about a kwarg
like warn_on_dupe so that the default is an error and someone has to
explicitly request duplicates?

I was only proposing the warning for completeness. That being said, the
percentiles should be sorted so that the index is monotonic.

On Tue, May 24, 2016, 11:11 Jeff Reback [email protected] wrote:

sorting is fine. I am not really in favor of duplicates. yes they are
handled generally, but not particularly useful IMHO.


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#13104 (comment)

@pijucha
Copy link
Contributor

pijucha commented May 25, 2016

Yes, this sounds reasonable. Exceptions then.

Slightly off-topic: if I keep finding other bugs and can fix them, should I open a new issue to discuss the code changes I propose? Or is it enough to discuss it in a pull request comments?

As an illustration, another bug with describe, this time unrelated to percentiles:

df = pd.DataFrame({'A': list("BCDE"), 0: [1,2,3,4]})
df.describe()
# long traceback listing several internal functions
ValueError: Buffer dtype mismatch, expected 'Python object' but got 'long'

I've found the source of the error and have a seemingly reasonable solution (it still needs testing, though).

@blalterman
Copy link
Author

This seems reasonably distinct and should probably be handled on its own.

On Wed, May 25, 2016, 10:06 pijucha [email protected] wrote:

Yes, this sounds reasonable. Exceptions then.

Slightly off-topic: if I keep finding other bugs and can fix them, should
I open a new issue to discuss the code changes I propose? Or is it enough
to discuss it in a pull request comments?

As an illustration, another bug with describe, this time unrelated to
percentiles:

df = pd.DataFrame({'A': list("BCDE"), 0: [1,2,3,4]})
df.describe()# long traceback listing several internal functionsValueError: Buffer dtype mismatch, expected 'Python object' but got 'long'

I've found the source of the error and have a seemingly reasonable
solution (it still needs testing, though).


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#13104 (comment)

@pijucha
Copy link
Contributor

pijucha commented May 25, 2016

Yes it's distinct and I don't want to discuss it here. But I was going to fix it in the same commit because the code changes will probably be contained within the function describe.

My question was essentially: does any code change require opening a new issue? Wiki says not necessarily but mentions only insignificant changes.

@jreback
Copy link
Contributor

jreback commented May 25, 2016

generally we like to have issues to go with PR's.

@pijucha
Copy link
Contributor

pijucha commented May 25, 2016

Ok, thanks. I'll open a new one then.

pijucha added a commit to pijucha/pandas that referenced this issue May 31, 2016
…s-dev#13288)

BUG pandas-dev#13104:
- Percentile identifiers are now rounded to the least precision
that keeps them unique.
- Supplying duplicates in percentiles will raise ValueError.

BUG pandas-dev#13288
- Fixed a column index of the output data frame.
Previously, if a data frame had a column index of object type and
the index contained numeric values, the output column index could
be corrupt. It led to ValueError if the output was displayed.

- describe() will raise ValueError with an informative message
on DataFrame without columns.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Indexing Related to indexing on series/frames, not to indexes themselves Numeric Operations Arithmetic, Comparison, and Logical operations Output-Formatting __repr__ of pandas objects, to_string
Projects
None yet
5 participants