You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Number
0 1.32
1 60000.00
Number float64
dtype: object
(2) Applying str function as a convertor for each dimension
Processing code:
converters = {column_name: str for column_name in df[0].dtypes.index}
df = pandas.read_html(f, converters = converters)
print(df[0])
print(df[0].dtypes)
Output:
Number
0 1.32000
1 60000
Number object
dtype: object
There could be cases when one file contains numbers typed in different formats (American / European / etc). This numbers differs with decimal mark, thousand mark, etc. So the logical way to handle such files will be to extract the data "as it is" in strings and perform parsing with regexps / other modules separately for each row. Is there a way how to do it in pandas? Thanks guys!
Notes:
The input value of ,,,2,,,,5,,,,,5,,,,0,,,.,,,7,,,7,,, (mind the dote!) also converts to 2550.77
Specification of "decimal" and "thousands" parameters for pandas.read_* doesn't look like a reliable solution because it is appled for all fields. Quick example: it can treat date fields in "02.2017" format as numbers and convert it to "022017" (even to "22017" without leading zero)
The workaround if you have 100.000,00 formatted numerical values and 01.12.2017 dates is the following: using decimal = ',', thousands = '.' and passing the convertor dictionary that maps all columns to str: converters = {column_name: str for column_name in df[0].dtypes.index} in read_html call. So the numbers will be correct (according to this format) and dates won't be changed to something like 1122017 (remember that leading zero might be removed!)
Similar issues is #10534.
It is still opened, but here I also mention a direct unexpected behavior, like:
I guess this issue is not about "how to convert numbers properly" but "how to get actual data from html table". It's clear that pandas provides an analytical way of data processing and management, but at the same time pandas.read_html is the only reliable way in Python of obtaining raw data from html tables without parsing tr, th, td, etc... So I think it's really important to think about just "conversion" behavior of pandas in these terms.
The text was updated successfully, but these errors were encountered:
Hi everyone,
It seems that pandas read_html doesn't process numeric values properly, the detailed issue with code examples on stackoverflow: https://stackoverflow.com/questions/47327966/pandas-converting-numbers-to-strings-unexpected-results
Source table:
Obviously, the expected output is:
(1) Straightforward reading of the file
Processing code:
Output:
(2) Applying str function as a convertor for each dimension
Processing code:
Output:
There could be cases when one file contains numbers typed in different formats (American / European / etc). This numbers differs with decimal mark, thousand mark, etc. So the logical way to handle such files will be to extract the data "as it is" in strings and perform parsing with regexps / other modules separately for each row. Is there a way how to do it in pandas? Thanks guys!
Notes:
Similar issues is #10534.
It is still opened, but here I also mention a direct unexpected behavior, like:
I guess this issue is not about "how to convert numbers properly" but "how to get actual data from html table". It's clear that pandas provides an analytical way of data processing and management, but at the same time pandas.read_html is the only reliable way in Python of obtaining raw data from html tables without parsing tr, th, td, etc... So I think it's really important to think about just "conversion" behavior of pandas in these terms.
The text was updated successfully, but these errors were encountered: