Strip columns/column names in data frame of white spaces #14460

rahulporuri · 2016-10-20T12:19:37Z

This is more of a question than a bug.

A small, complete example of the issue

while opening a data file similar to

 a , b , c , d 
 1 , 2 , 3 , 4 
 5 , 6 , 7 , 8

using

python -c "import pandas; df = pandas.read_table('unstripped_data.csv', sep=','); print df.columns"

Observed Output

Index([u' a ', u' b ', u' c ', u' d '], dtype='object')

Expected Output

Index([u'a', u'b', u'c', u'd'], dtype='object')

We expected that the column names/columns be stripped of white spaces.
Apologies for the noise if this has already been reported or is being addressed.

Output of `pd.show_versions()`

## INSTALLED VERSIONS

commit: None
python: 2.7.11.final.0
python-bits: 64
OS: Darwin
OS-release: 16.0.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: en_US.UTF-8
LANG: en_US.UTF-8
LOCALE: None.None

pandas: 0.19.0
nose: 1.3.7
pip: 8.1.2
setuptools: 23.1.0
Cython: 0.24
numpy: 1.10.4
scipy: None
statsmodels: None
xarray: None
IPython: 5.1.0
sphinx: 1.4.1
patsy: None
dateutil: 2.5.2
pytz: 2016.3
blosc: None
bottleneck: None
tables: None
numexpr: None
matplotlib: None
openpyxl: 2.4.0
xlrd: 1.0.0
xlwt: None
xlsxwriter: None
lxml: 3.6.0
bs4: 4.4.1
html5lib: 0.999
httplib2: None
apiclient: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: None
boto: None
pandas_datareader: None

The text was updated successfully, but these errors were encountered:

dpinte · 2016-10-20T12:36:12Z

The main problem is exacerbated when you have duplicated column names. With 0.19, mangle_dup_columns does not support being turned off. If you have two A columns, you end up with A .1 and not A.1. Adding a strip operations on the column names would nicely solve the issue.

The current pandas behaviour is hard to work with. Having to either write code to fix the mangling or write code to do our own header processing is far from optimal as it just duplicated what pandas does in a slightly different way.

jorisvandenbossche · 2016-10-20T13:07:52Z

In general, you can use:

skipinitialspace=True, but that only strips the whitespace after the delimiter, not before (so in this case does not solve the issue
df.columns = df.columns.str.strip(), but that indeed does not work anymore once you have mangled duplicate column names

Regarding the mangle_dupe_cols, it is true that it cannot be turned off anymore, but note that before it just returned wrong results (that's the reason it was turned off). PRs to (correctly) enable it again are certainly very welcome, the issue for that is #13262

dpinte · 2016-10-20T13:29:16Z

@jorisvandenbossche do you have specific use case where stripping the column name before mangling would not be desired? That is a simple change and would provide what I think is the appropriate behaviour in the current implementation.

I'll see if we can look into opening a PR allowing to remove the mangling.

jorisvandenbossche · 2016-10-20T13:38:39Z

@dpinte No, I cannot think of one (apart from the one where the spaces do mean something to you, but never dealt with such data). But stripping the whitespace when mangling, and not in other cases would also introduce an inconsistency.
So I think we should look at this irrespective of the mangling. And in general, I certainly would see value in an option to specify you want to strip all initial/trailing whitespace. We only already have so many options ...

jorisvandenbossche · 2016-10-20T13:40:29Z

See also #14367 for a similar recent issue

dpinte · 2016-10-20T13:49:29Z

@jorisvandenbossche thanks for the feedback.

In the very short term, @rahulporuri can probably work on a PR to add stripping to column names for all the cases and exposing an option to turn it on/off.

liangbright · 2018-06-08T20:08:41Z

Over a year, it is still an issue for '0.23.0'. I can not believe it...

This is a solution
https://stackoverflow.com/questions/21606987/how-can-i-strip-the-whitespace-from-pandas-dataframe-headers

jorisvandenbossche · 2018-06-08T21:27:52Z

@liangbright it's not a question of belief, it's a question of someone doing it. A pull request to add this option is always welcome.

rmsmani · 2020-08-20T23:36:51Z

@jorisvandenbossche
Can I take this?

jorisvandenbossche added the IO CSV read_csv, to_csv label Oct 20, 2016

rahulporuri mentioned this issue Oct 24, 2016

Strip whitespace from column names when usecols in read_csv #14480

Closed

mroeschke added the Bug label May 2, 2021

phofl mentioned this issue Nov 26, 2021

Strip header names? #9067

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Strip columns/column names in data frame of white spaces #14460

Strip columns/column names in data frame of white spaces #14460

rahulporuri commented Oct 20, 2016 •

edited

Loading

dpinte commented Oct 20, 2016

jorisvandenbossche commented Oct 20, 2016

dpinte commented Oct 20, 2016

jorisvandenbossche commented Oct 20, 2016

jorisvandenbossche commented Oct 20, 2016

dpinte commented Oct 20, 2016

liangbright commented Jun 8, 2018 •

edited

Loading

jorisvandenbossche commented Jun 8, 2018

rmsmani commented Aug 20, 2020

Strip columns/column names in data frame of white spaces #14460

Strip columns/column names in data frame of white spaces #14460

Comments

rahulporuri commented Oct 20, 2016 • edited Loading

A small, complete example of the issue

Observed Output

Expected Output

Output of pd.show_versions()

dpinte commented Oct 20, 2016

jorisvandenbossche commented Oct 20, 2016

dpinte commented Oct 20, 2016

jorisvandenbossche commented Oct 20, 2016

jorisvandenbossche commented Oct 20, 2016

dpinte commented Oct 20, 2016

liangbright commented Jun 8, 2018 • edited Loading

jorisvandenbossche commented Jun 8, 2018

rmsmani commented Aug 20, 2020

rahulporuri commented Oct 20, 2016 •

edited

Loading

Output of `pd.show_versions()`

liangbright commented Jun 8, 2018 •

edited

Loading