Skip to content

read_html ignores commas #8200

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
foebu opened this issue Sep 6, 2014 · 10 comments
Closed

read_html ignores commas #8200

foebu opened this issue Sep 6, 2014 · 10 comments
Labels
API Design Duplicate Report Duplicate issue or pull request Enhancement IO HTML read_html, to_html, Styler.apply, Styler.applymap

Comments

@foebu
Copy link

foebu commented Sep 6, 2014

Hello,

I'm trying to parse a file with the extension 'xls' which is not an Excel file but it's clearly an html file (open with a text editor it's clearly html code).

One of my columns uses the italian convention of using the comma instead of the dot before decimals: 5,5 instead of 5.5. I was hoping to parse it at least as string and replace commas with dots and convert the string to a float.

The problem is that the commas are completely ignored and instead of getting 5,5 or 7,04, I'm getting 55 and 704.

Is this known? Any idea on how to solve it?

@cpcloud
Copy link
Member

cpcloud commented Sep 6, 2014

Why aren't you using read_excel for this? I think xls is a special kind of Microsoft XML. Not all valid XML is valid HTML, but all valid HTML is valid XML, so your file is probably not HTML, it's some kind of XML.

@cpcloud cpcloud added this to the 0.15.0 milestone Sep 6, 2014
@cpcloud
Copy link
Member

cpcloud commented Sep 6, 2014

Try it with read_excel, and if that doesn't do what you want (i.e., there's still a separator issue) I'll take a look.

@foebu
Copy link
Author

foebu commented Sep 6, 2014

Actually I'm using read_html because read_excel doesn't work. I get: XLRDError: Unsupported format, or corrupt file: Expected BOF record; found b'<table b'

@cpcloud
Copy link
Member

cpcloud commented Sep 6, 2014

This requires implementing the decimal option for the Python-based TextParser which I personally don't have the time to look into right now. However, I can help you get started on making a pull request if you'd like to dive in

@foebu
Copy link
Author

foebu commented Sep 6, 2014

I'm in, I can have a look at it.

@cpcloud
Copy link
Member

cpcloud commented Sep 6, 2014

look in pandas/parser.pyx around lines 357 and lines 1439. This would have to go into pandas/io/parsers.py somewhere in the TextParser class functionality.

@jreback jreback modified the milestones: 0.15.1, 0.15.0 Sep 8, 2014
@jreback jreback added Enhancement IO HTML read_html, to_html, Styler.apply, Styler.applymap API Design labels Sep 8, 2014
@hayd
Copy link
Contributor

hayd commented Sep 9, 2014

This came up on the ML:

does anyone have an idea, how to read html tables of german sites with its decimal writing;
e.g. the tables on the bottom of http://www.finanzen.net/bilanz_guv/SAP

Right now, I try to read that with:
df = pd.read_html(url,infer_types=False,parse_dates=False,header=0,skiprows=0,thousands=".",match=pattern,index_col=0)
dfR = df[0].replace(",",".")
dfR = pd.DataFrame(dfR, dtype='float')

@cpcloud
Copy link
Member

cpcloud commented Sep 9, 2014

Yep this is actually non trivial as it requires adding some new parsing functionality. It can probably be copied from the c parser mostly so not huge. Would be a nice first pr

@foebu
Copy link
Author

foebu commented Jan 18, 2015

Actually I noticed that the argument thousands = None allows the parsing of the commas correctly.

@jreback jreback modified the milestones: 0.16.0, Next Major Release Mar 6, 2015
@jreback jreback added the Duplicate Report Duplicate issue or pull request label Apr 17, 2016
@jreback
Copy link
Contributor

jreback commented Apr 17, 2016

closing as dupe of #12907

@jreback jreback closed this as completed Apr 17, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Duplicate Report Duplicate issue or pull request Enhancement IO HTML read_html, to_html, Styler.apply, Styler.applymap
Projects
None yet
Development

No branches or pull requests

4 participants