Skip to content

Numbers --> strings conversion - unexpected results (read_html) #18342

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Vladimirzelenkovskii opened this issue Nov 17, 2017 · 2 comments
Closed
Labels
Enhancement IO HTML read_html, to_html, Styler.apply, Styler.applymap

Comments

@Vladimirzelenkovskii
Copy link

Vladimirzelenkovskii commented Nov 17, 2017

Hi everyone,

It seems that pandas read_html doesn't process numeric values properly, the detailed issue with code examples on stackoverflow: https://stackoverflow.com/questions/47327966/pandas-converting-numbers-to-strings-unexpected-results

Source table:

<body>
    <table>
        <thead>
            <tr>
                <th class="tabHead" x:autofilter="all">Number</th>
            </tr>
        </thead>
        <tbody>
            <tr>
                <td class="tDetail">1.320,00</td>
            </tr>
            <tr>
                <td class="tDetail">600,00</td>
            </tr>
        </tbody>
    </table>
</body>

Obviously, the expected output is:

     Number
0  1.320,00
1    600,00

(1) Straightforward reading of the file

Processing code:

import pandas

df = pandas.read_html('test_file.xls')
print(df[0])
print(df[0].dtypes)

Output:

     Number
0      1.32
1  60000.00

Number    float64
dtype: object

(2) Applying str function as a convertor for each dimension

Processing code:

converters = {column_name: str for column_name in df[0].dtypes.index}
df = pandas.read_html(f, converters = converters)
print(df[0])
print(df[0].dtypes)

Output:

    Number
0  1.32000
1    60000

Number    object
dtype: object

There could be cases when one file contains numbers typed in different formats (American / European / etc). This numbers differs with decimal mark, thousand mark, etc. So the logical way to handle such files will be to extract the data "as it is" in strings and perform parsing with regexps / other modules separately for each row. Is there a way how to do it in pandas? Thanks guys!

Notes:

  1. The input value of ,,,2,,,,5,,,,,5,,,,0,,,.,,,7,,,7,,, (mind the dote!) also converts to 2550.77
  2. Specification of "decimal" and "thousands" parameters for pandas.read_* doesn't look like a reliable solution because it is appled for all fields. Quick example: it can treat date fields in "02.2017" format as numbers and convert it to "022017" (even to "22017" without leading zero)
  3. The workaround if you have 100.000,00 formatted numerical values and 01.12.2017 dates is the following: using decimal = ',', thousands = '.' and passing the convertor dictionary that maps all columns to str: converters = {column_name: str for column_name in df[0].dtypes.index} in read_html call. So the numbers will be correct (according to this format) and dates won't be changed to something like 1122017 (remember that leading zero might be removed!)

Similar issues is #10534.
It is still opened, but here I also mention a direct unexpected behavior, like:

  1. "01.12.2017" -> 1122017
  2. ",,,,,42.......42,,.,.,.,.,.,.42........" -> 424242

I guess this issue is not about "how to convert numbers properly" but "how to get actual data from html table". It's clear that pandas provides an analytical way of data processing and management, but at the same time pandas.read_html is the only reliable way in Python of obtaining raw data from html tables without parsing tr, th, td, etc... So I think it's really important to think about just "conversion" behavior of pandas in these terms.

@chris-b1
Copy link
Contributor

To directly answer your question, if you want unparsed strings, pass thousands=None.

pd.read_html(data(), thousands=None, converters={'Number': lambda x: x})[0].values
Out[28]: 
array([['1.320,00'],
       ['600,00']], dtype=object)

The two issues I see

  1. read_html has thousands=',' as the default, while most (all?) other parsing returns have None

  2. In all parsing routines, thousands isn't used a true three-number sep, but but more a find/replace as you note

In [29]: pd.read_csv(StringIO("""a|b
    ...: 1,222,22|01.12.2017
    ...: 1,222,22|01.12.2017"""), sep='|', thousands='.')
Out[29]: 
          a        b
0  1,222,22  1122017
1  1,222,22  1122017

Would probably take a PR for either, 1. would require deprecation, 2. might be fairly involved

@chris-b1 chris-b1 added API Design IO Data IO issues that don't fit into a more specific label IO HTML read_html, to_html, Styler.apply, Styler.applymap labels Nov 17, 2017
@mroeschke mroeschke added Enhancement and removed IO Data IO issues that don't fit into a more specific label labels May 2, 2020
@mroeschke
Copy link
Member

Thanks for the request, but it appears this hasn't gain traction in a while so closing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement IO HTML read_html, to_html, Styler.apply, Styler.applymap
Projects
None yet
Development

No branches or pull requests

3 participants