Skip to content

ENH: Raise ParserWarning when length of names does not match length of data #38587

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 27 commits into from
Jun 16, 2021

Conversation

phofl
Copy link
Member

@phofl phofl commented Dec 19, 2020

@gfyoung

Raising ParserWarning now. Could change to FutureWarning, if we would like to deprecate for 2.0

As long as we are only raising a ParserWarning I am inclined to raise for trailing commas too.

@phofl phofl added IO CSV read_csv, to_csv Warnings Warnings that appear or should be added to pandas labels Dec 19, 2020
@jreback jreback added this to the 1.3 milestone Dec 21, 2020
@jreback
Copy link
Contributor

jreback commented Dec 21, 2020

can you merge master.

I think this might be too noisy (in the real world) to raise on a trailing command as this is a common thing to write for csv formats.

@phofl
Copy link
Member Author

phofl commented Dec 23, 2020

Merged master, so we should avoid raising the warning when one set of trailing commas is given?

@jreback
Copy link
Contributor

jreback commented Dec 24, 2020

Merged master, so we should avoid raising the warning when one set of trailing commas is given?

yeah i think so

phofl added 2 commits January 3, 2021 20:06
� Conflicts:
�	doc/source/whatsnew/v1.3.0.rst
�	pandas/tests/io/parser/test_common.py
@phofl
Copy link
Member Author

phofl commented Jan 3, 2021

This should do the trick. One set of trailing commas is allowed

"""
if not self.index_col and len(columns) != len(data) and columns:
if len(columns) == len(data) - 1 and np.all(
(data[-1] == "") | isna(data[-1])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this seems a very specific condition. can you relax it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to allow more than one set of trailing commas? In this case we can remove the len check.
The array representing the last entries has either only nans or empty strings, this check is necessary

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't we want to warn if there is a matchmatch at all?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought we wanted to warn if we have more data-columns than names/headers except if we have trailing commas?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we have a test for that? I would warn regardless

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed this based on #38587 (comment)

test_no_header_two_extra_columns checks the warning and

def test_trailing_delimiters(all_parsers):
checks trailing commas which does not raise a Warning

@jreback
Copy link
Contributor

jreback commented Jan 4, 2021

can you merge master.

cc @gfyoung

@phofl
Copy link
Member Author

phofl commented Jan 4, 2021

Merged

data: list of array-likes containing the data column-wise

"""
if not self.index_col and len(columns) != len(data) and columns:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need to check that data is actually null? IOW when would this situation happen when len(columns) > len(data) ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

len(columns) > len(data) is caught at another place I think.
We run in there when len(columns) < len(data). In case of one set of trailing commas we have len(columns) + 1 = len(data). To see if we really have trailing commas we have to check if array is empty. If array is not empty we do not have trailing commas but data which will be dropped.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

have trailing commas but data which will be dropped.

ok ideally we should put these kinds of checks in the same place that is happening if possible.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bad wording, with caught I meant if we got more columns than len(data), these columns are inserted all nans.

@phofl phofl removed the Stale label Apr 20, 2021
@jreback
Copy link
Contributor

jreback commented Apr 21, 2021

this looks reasonable.

any comments @pandas-dev/pandas-core

@@ -757,6 +757,7 @@ the end of each data line, confusing the parser. To explicitly disable the
index column inference and discard the last column, pass ``index_col=False``:

.. ipython:: python
:okwarning:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is going to warn, should the docs here then have to be updated to reflect this change?

(but is this actually going to warn? Below I read "One set of trailing commas is allowed.", which is the case here?)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, this raised a Warning earlier before allowing one set of trailing commas

Copy link
Member

@rhshadrach rhshadrach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good - minor requests. Also, can you add behavior of index_col=None to the docstring as mentioned at the top of #21768 (comment)

@jreback
Copy link
Contributor

jreback commented May 21, 2021

@phofl can you rebase and some questions above

@phofl
Copy link
Member Author

phofl commented May 23, 2021

I think I have adressed all comments

@simonjayhawkins simonjayhawkins removed this from the 1.3 milestone May 25, 2021
Copy link
Member

@rhshadrach rhshadrach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@simonjayhawkins
Copy link
Member

@phofl can you resolve conflicts.

@phofl
Copy link
Member Author

phofl commented Jun 12, 2021

resolved conflicts, @jreback ready to merge?

@jreback jreback added this to the 1.3 milestone Jun 16, 2021
@jreback jreback merged commit a6a8915 into pandas-dev:master Jun 16, 2021
@jreback
Copy link
Contributor

jreback commented Jun 16, 2021

thanks @phofl

@jreback
Copy link
Contributor

jreback commented Jun 16, 2021

@meeseeksdev backport 1.3.x

@lumberbot-app
Copy link

lumberbot-app bot commented Jun 16, 2021

Something went wrong ... Please have a look at my logs.

simonjayhawkins pushed a commit that referenced this pull request Jun 16, 2021
…s not match length of data (#42047)

Co-authored-by: Patrick Hoefler <[email protected]>
@phofl phofl deleted the 21768 branch June 16, 2021 08:45
JulianWgs pushed a commit to JulianWgs/pandas that referenced this pull request Jul 3, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO CSV read_csv, to_csv Warnings Warnings that appear or should be added to pandas
Projects
None yet
Development

Successfully merging this pull request may close these issues.

pd.read_table: Using space as delimiter on file with trailing space gives cryptic error
6 participants