-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
get_dummies chokes on unicode values #6885
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Could you try: import pandas as pd
from StringIO import StringIO
s = """å,b
œ,c
"""
df = pd.read_csv(StringIO(s), header=None)
pd.get_dummies(df) This works fine on my machine. Or share a link to the file that generated your data and the code so we can try to reproduce the problem. |
Thanks for the assist! import pandas as pd
from StringIO import StringIO
s = """letter,cat
å,b
œ,c
"""
df = pd.read_csv(StringIO(s))
pd.get_dummies(df['letter'], prefix=u'foo') reproduces the bug |
I think @hayd mentioned this |
That one was in relation to
|
that should be fine |
No. Issue is that it tries to create a column name for each of the different values in the columns. When it creates a column name for the dummy (and there's a prefix) it calls: dummy_cols = ['%s%s%s' % (prefix, prefix_sep, str(v)) for v in levels] (as per the stacktrace I pasted earlier) since v (the value of the element in the category) is non-ascii-able unicode, calling Instead of calling It may be sufficient to use |
It actually breaks in the format stage:
|
I assume changing it to u('%s%s%s') % (prefix, prefix_sep, str(v)) would be cheating? :) Alternative would be try:
dummy_cols = ['%s%s%s' % (prefix, prefix_sep, str(v)) for v in levels]
catch UnicodeDecodeError:
dummy_cols = [u('%s%s%s') % (prefix, prefix_sep, str(v)) for v in levels] |
@maxgrenderjones I don't think that's cheating, in fact always returning unicode is IMO correct (i.e. just your second line). fancy putting together a PR ? :) |
I think there's a bug in our test case. Changing the test to: import pandas as pd
reload(pd.core)
reload(pd.core)
from StringIO import StringIO
s = u"""letter,cat
å,b
œ,c
""".encode('utf-8')
df = pd.read_csv(StringIO(s), encoding='utf-8')
print(df)
pd.get_dummies(df['letter'], prefix='foo') (i.e. make sure that pandas knows it's reading unicode) and all that is needed to get correct output is to remove the call to if prefix is not None:
dummy_cols = ['%s%s%s' % (prefix, prefix_sep, v)
for v in levels] Trivial change - if it's enough for a pull request, happy to create one. |
Definitely sounds like enough / would be a good PR, with the tests :) |
closed by #6975 |
(Context:
pandas version 0.13.1 running on 2.7.6 |Anaconda 1.9.1 (64-bit)| (default, Nov 11 2013, 10:49:15) [MSC v.1500 64 bit (AMD64)]
)In my code I have a category containing lots of non-English names and want to create dummies out of it.
So I call:
and get:
Issue would appear to be the call to
str(v)
- ifv
is a unicode string with non-ascii, this is liable to explode.The text was updated successfully, but these errors were encountered: