Skip to content

Strip columns/column names in data frame of white spaces #14460

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
rahulporuri opened this issue Oct 20, 2016 · 9 comments
Open

Strip columns/column names in data frame of white spaces #14460

rahulporuri opened this issue Oct 20, 2016 · 9 comments
Labels
Bug IO CSV read_csv, to_csv

Comments

@rahulporuri
Copy link

rahulporuri commented Oct 20, 2016

This is more of a question than a bug.

A small, complete example of the issue

while opening a data file similar to

 a , b , c , d 
 1 , 2 , 3 , 4 
 5 , 6 , 7 , 8 

using

python -c "import pandas; df = pandas.read_table('unstripped_data.csv', sep=','); print df.columns"

Observed Output

Index([u' a ', u' b ', u' c ', u' d '], dtype='object')

Expected Output

Index([u'a', u'b', u'c', u'd'], dtype='object')

We expected that the column names/columns be stripped of white spaces.
Apologies for the noise if this has already been reported or is being addressed.

Output of pd.show_versions()

## INSTALLED VERSIONS

commit: None
python: 2.7.11.final.0
python-bits: 64
OS: Darwin
OS-release: 16.0.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: en_US.UTF-8
LANG: en_US.UTF-8
LOCALE: None.None

pandas: 0.19.0
nose: 1.3.7
pip: 8.1.2
setuptools: 23.1.0
Cython: 0.24
numpy: 1.10.4
scipy: None
statsmodels: None
xarray: None
IPython: 5.1.0
sphinx: 1.4.1
patsy: None
dateutil: 2.5.2
pytz: 2016.3
blosc: None
bottleneck: None
tables: None
numexpr: None
matplotlib: None
openpyxl: 2.4.0
xlrd: 1.0.0
xlwt: None
xlsxwriter: None
lxml: 3.6.0
bs4: 4.4.1
html5lib: 0.999
httplib2: None
apiclient: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: None
boto: None
pandas_datareader: None

@dpinte
Copy link

dpinte commented Oct 20, 2016

The main problem is exacerbated when you have duplicated column names. With 0.19, mangle_dup_columns does not support being turned off. If you have two A columns, you end up with A .1 and not A.1. Adding a strip operations on the column names would nicely solve the issue.

The current pandas behaviour is hard to work with. Having to either write code to fix the mangling or write code to do our own header processing is far from optimal as it just duplicated what pandas does in a slightly different way.

@jorisvandenbossche
Copy link
Member

In general, you can use:

  • skipinitialspace=True, but that only strips the whitespace after the delimiter, not before (so in this case does not solve the issue
  • df.columns = df.columns.str.strip(), but that indeed does not work anymore once you have mangled duplicate column names

Regarding the mangle_dupe_cols, it is true that it cannot be turned off anymore, but note that before it just returned wrong results (that's the reason it was turned off). PRs to (correctly) enable it again are certainly very welcome, the issue for that is #13262

@jorisvandenbossche jorisvandenbossche added the IO CSV read_csv, to_csv label Oct 20, 2016
@dpinte
Copy link

dpinte commented Oct 20, 2016

@jorisvandenbossche do you have specific use case where stripping the column name before mangling would not be desired? That is a simple change and would provide what I think is the appropriate behaviour in the current implementation.

I'll see if we can look into opening a PR allowing to remove the mangling.

@jorisvandenbossche
Copy link
Member

@dpinte No, I cannot think of one (apart from the one where the spaces do mean something to you, but never dealt with such data). But stripping the whitespace when mangling, and not in other cases would also introduce an inconsistency.
So I think we should look at this irrespective of the mangling. And in general, I certainly would see value in an option to specify you want to strip all initial/trailing whitespace. We only already have so many options ...

@jorisvandenbossche
Copy link
Member

See also #14367 for a similar recent issue

@dpinte
Copy link

dpinte commented Oct 20, 2016

@jorisvandenbossche thanks for the feedback.

In the very short term, @rahulporuri can probably work on a PR to add stripping to column names for all the cases and exposing an option to turn it on/off.

@liangbright
Copy link

liangbright commented Jun 8, 2018

Over a year, it is still an issue for '0.23.0'. I can not believe it...

This is a solution
https://stackoverflow.com/questions/21606987/how-can-i-strip-the-whitespace-from-pandas-dataframe-headers

@jorisvandenbossche
Copy link
Member

@liangbright it's not a question of belief, it's a question of someone doing it. A pull request to add this option is always welcome.

@rmsmani
Copy link

rmsmani commented Aug 20, 2020

@jorisvandenbossche
Can I take this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO CSV read_csv, to_csv
Projects
None yet
Development

No branches or pull requests

6 participants