Numbers --> strings conversion - unexpected results (read_html) #18342

Vladimirzelenkovskii · 2017-11-17T09:29:23Z

Hi everyone,

It seems that pandas read_html doesn't process numeric values properly, the detailed issue with code examples on stackoverflow: https://stackoverflow.com/questions/47327966/pandas-converting-numbers-to-strings-unexpected-results

Source table:

<body>
    <table>
        <thead>
            <tr>
                <th class="tabHead" x:autofilter="all">Number</th>
            </tr>
        </thead>
        <tbody>
            <tr>
                <td class="tDetail">1.320,00</td>
            </tr>
            <tr>
                <td class="tDetail">600,00</td>
            </tr>
        </tbody>
    </table>
</body>

Obviously, the expected output is:

     Number
0  1.320,00
1    600,00

(1) Straightforward reading of the file

Processing code:

import pandas

df = pandas.read_html('test_file.xls')
print(df[0])
print(df[0].dtypes)

Output:

     Number
0      1.32
1  60000.00

Number    float64
dtype: object

(2) Applying str function as a convertor for each dimension

Processing code:

converters = {column_name: str for column_name in df[0].dtypes.index}
df = pandas.read_html(f, converters = converters)
print(df[0])
print(df[0].dtypes)

Output:

    Number
0  1.32000
1    60000

Number    object
dtype: object

There could be cases when one file contains numbers typed in different formats (American / European / etc). This numbers differs with decimal mark, thousand mark, etc. So the logical way to handle such files will be to extract the data "as it is" in strings and perform parsing with regexps / other modules separately for each row. Is there a way how to do it in pandas? Thanks guys!

Notes:

The input value of ,,,2,,,,5,,,,,5,,,,0,,,.,,,7,,,7,,, (mind the dote!) also converts to 2550.77
Specification of "decimal" and "thousands" parameters for pandas.read_* doesn't look like a reliable solution because it is appled for all fields. Quick example: it can treat date fields in "02.2017" format as numbers and convert it to "022017" (even to "22017" without leading zero)
The workaround if you have 100.000,00 formatted numerical values and 01.12.2017 dates is the following: using decimal = ',', thousands = '.' and passing the convertor dictionary that maps all columns to str: converters = {column_name: str for column_name in df[0].dtypes.index} in read_html call. So the numbers will be correct (according to this format) and dates won't be changed to something like 1122017 (remember that leading zero might be removed!)

Similar issues is #10534.
It is still opened, but here I also mention a direct unexpected behavior, like:

"01.12.2017" -> 1122017
",,,,,42.......42,,.,.,.,.,.,.42........" -> 424242

I guess this issue is not about "how to convert numbers properly" but "how to get actual data from html table". It's clear that pandas provides an analytical way of data processing and management, but at the same time pandas.read_html is the only reliable way in Python of obtaining raw data from html tables without parsing tr, th, td, etc... So I think it's really important to think about just "conversion" behavior of pandas in these terms.

The text was updated successfully, but these errors were encountered:

chris-b1 · 2017-11-17T15:35:58Z

To directly answer your question, if you want unparsed strings, pass thousands=None.

pd.read_html(data(), thousands=None, converters={'Number': lambda x: x})[0].values
Out[28]: 
array([['1.320,00'],
       ['600,00']], dtype=object)

The two issues I see

read_html has thousands=',' as the default, while most (all?) other parsing returns have None
In all parsing routines, thousands isn't used a true three-number sep, but but more a find/replace as you note

In [29]: pd.read_csv(StringIO("""a|b
    ...: 1,222,22|01.12.2017
    ...: 1,222,22|01.12.2017"""), sep='|', thousands='.')
Out[29]: 
          a        b
0  1,222,22  1122017
1  1,222,22  1122017

Would probably take a PR for either, 1. would require deprecation, 2. might be fairly involved

mroeschke · 2024-08-24T19:13:39Z

Thanks for the request, but it appears this hasn't gain traction in a while so closing

chris-b1 added API Design IO Data IO issues that don't fit into a more specific label IO HTML read_html, to_html, Styler.apply, Styler.applymap labels Nov 17, 2017

mroeschke added Enhancement and removed IO Data IO issues that don't fit into a more specific label labels May 2, 2020

mroeschke removed the API Design label Jun 12, 2021

mroeschke closed this as completed Aug 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Numbers --> strings conversion - unexpected results (read_html) #18342

Numbers --> strings conversion - unexpected results (read_html) #18342

Vladimirzelenkovskii commented Nov 17, 2017 •

edited

Loading

chris-b1 commented Nov 17, 2017

mroeschke commented Aug 24, 2024

Numbers --> strings conversion - unexpected results (read_html) #18342

Numbers --> strings conversion - unexpected results (read_html) #18342

Comments

Vladimirzelenkovskii commented Nov 17, 2017 • edited Loading

(1) Straightforward reading of the file

(2) Applying str function as a convertor for each dimension

chris-b1 commented Nov 17, 2017

mroeschke commented Aug 24, 2024

Vladimirzelenkovskii commented Nov 17, 2017 •

edited

Loading