read_html ignores commas #8200

foebu · 2014-09-06T22:13:38Z

Hello,

I'm trying to parse a file with the extension 'xls' which is not an Excel file but it's clearly an html file (open with a text editor it's clearly html code).

One of my columns uses the italian convention of using the comma instead of the dot before decimals: 5,5 instead of 5.5. I was hoping to parse it at least as string and replace commas with dots and convert the string to a float.

The problem is that the commas are completely ignored and instead of getting 5,5 or 7,04, I'm getting 55 and 704.

Is this known? Any idea on how to solve it?

cpcloud · 2014-09-06T22:20:10Z

Why aren't you using read_excel for this? I think xls is a special kind of Microsoft XML. Not all valid XML is valid HTML, but all valid HTML is valid XML, so your file is probably not HTML, it's some kind of XML.

cpcloud · 2014-09-06T22:21:24Z

Try it with read_excel, and if that doesn't do what you want (i.e., there's still a separator issue) I'll take a look.

foebu · 2014-09-06T22:25:07Z

Actually I'm using read_html because read_excel doesn't work. I get: XLRDError: Unsupported format, or corrupt file: Expected BOF record; found b'<table b'

cpcloud · 2014-09-06T22:56:44Z

This requires implementing the decimal option for the Python-based TextParser which I personally don't have the time to look into right now. However, I can help you get started on making a pull request if you'd like to dive in

foebu · 2014-09-06T22:58:48Z

I'm in, I can have a look at it.

cpcloud · 2014-09-06T23:41:44Z

look in pandas/parser.pyx around lines 357 and lines 1439. This would have to go into pandas/io/parsers.py somewhere in the TextParser class functionality.

hayd · 2014-09-09T17:56:52Z

This came up on the ML:

does anyone have an idea, how to read html tables of german sites with its decimal writing;
e.g. the tables on the bottom of http://www.finanzen.net/bilanz_guv/SAP

Right now, I try to read that with:
df = pd.read_html(url,infer_types=False,parse_dates=False,header=0,skiprows=0,thousands=".",match=pattern,index_col=0)
dfR = df[0].replace(",",".")
dfR = pd.DataFrame(dfR, dtype='float')

cpcloud · 2014-09-09T18:03:21Z

Yep this is actually non trivial as it requires adding some new parsing functionality. It can probably be copied from the c parser mostly so not huge. Would be a nice first pr

foebu · 2015-01-18T17:28:43Z

Actually I noticed that the argument thousands = None allows the parsing of the commas correctly.

jreback · 2016-04-17T14:00:03Z

closing as dupe of #12907

cpcloud added this to the 0.15.0 milestone Sep 6, 2014

jreback modified the milestones: 0.15.1, 0.15.0 Sep 8, 2014

jreback added Enhancement IO HTML read_html, to_html, Styler.apply, Styler.applymap API Design labels Sep 8, 2014

jreback modified the milestones: 0.16.0, Next Major Release Mar 6, 2015

jreback mentioned this issue Apr 17, 2016

read_html: support "decimal" argument for parsing numbers, like read_csv #12907

Closed

jreback added the Duplicate Report Duplicate issue or pull request label Apr 17, 2016

jreback closed this as completed Apr 17, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

read_html ignores commas #8200

read_html ignores commas #8200

foebu commented Sep 6, 2014

cpcloud commented Sep 6, 2014

cpcloud commented Sep 6, 2014

foebu commented Sep 6, 2014

cpcloud commented Sep 6, 2014

foebu commented Sep 6, 2014

cpcloud commented Sep 6, 2014

hayd commented Sep 9, 2014

cpcloud commented Sep 9, 2014

foebu commented Jan 18, 2015

jreback commented Apr 17, 2016

read_html ignores commas #8200

read_html ignores commas #8200

Comments

foebu commented Sep 6, 2014

cpcloud commented Sep 6, 2014

cpcloud commented Sep 6, 2014

foebu commented Sep 6, 2014

cpcloud commented Sep 6, 2014

foebu commented Sep 6, 2014

cpcloud commented Sep 6, 2014

hayd commented Sep 9, 2014

cpcloud commented Sep 9, 2014

foebu commented Jan 18, 2015

jreback commented Apr 17, 2016