-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
ENH: Raise ParserWarning when length of names does not match length of data #38587
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
can you merge master. I think this might be too noisy (in the real world) to raise on a trailing command as this is a common thing to write for csv formats. |
� Conflicts: � doc/source/whatsnew/v1.3.0.rst
Merged master, so we should avoid raising the warning when one set of trailing commas is given? |
yeah i think so |
� Conflicts: � doc/source/whatsnew/v1.3.0.rst � pandas/tests/io/parser/test_common.py
This should do the trick. One set of trailing commas is allowed |
pandas/io/parsers.py
Outdated
""" | ||
if not self.index_col and len(columns) != len(data) and columns: | ||
if len(columns) == len(data) - 1 and np.all( | ||
(data[-1] == "") | isna(data[-1]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this seems a very specific condition. can you relax it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to allow more than one set of trailing commas? In this case we can remove the len check.
The array representing the last entries has either only nans or empty strings, this check is necessary
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't we want to warn if there is a matchmatch at all?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought we wanted to warn if we have more data-columns than names/headers except if we have trailing commas?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we have a test for that? I would warn regardless
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed this based on #38587 (comment)
test_no_header_two_extra_columns checks the warning and
pandas/pandas/tests/io/parser/test_common.py
Line 1066 in b5707d6
def test_trailing_delimiters(all_parsers): |
can you merge master. cc @gfyoung |
� Conflicts: � pandas/tests/io/parser/test_common.py
Merged |
pandas/io/parsers.py
Outdated
data: list of array-likes containing the data column-wise | ||
|
||
""" | ||
if not self.index_col and len(columns) != len(data) and columns: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we need to check that data is actually null? IOW when would this situation happen when len(columns) > len(data) ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
len(columns) > len(data) is caught at another place I think.
We run in there when len(columns) < len(data). In case of one set of trailing commas we have len(columns) + 1 = len(data). To see if we really have trailing commas we have to check if array is empty. If array is not empty we do not have trailing commas but data which will be dropped.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
have trailing commas but data which will be dropped.
ok ideally we should put these kinds of checks in the same place that is happening if possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bad wording, with caught I meant if we got more columns than len(data), these columns are inserted all nans.
this looks reasonable. any comments @pandas-dev/pandas-core |
� Conflicts: � pandas/io/parsers/python_parser.py
doc/source/user_guide/io.rst
Outdated
@@ -757,6 +757,7 @@ the end of each data line, confusing the parser. To explicitly disable the | |||
index column inference and discard the last column, pass ``index_col=False``: | |||
|
|||
.. ipython:: python | |||
:okwarning: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this is going to warn, should the docs here then have to be updated to reflect this change?
(but is this actually going to warn? Below I read "One set of trailing commas is allowed.", which is the case here?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, this raised a Warning earlier before allowing one set of trailing commas
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good - minor requests. Also, can you add behavior of index_col=None to the docstring as mentioned at the top of #21768 (comment)
@phofl can you rebase and some questions above |
I think I have adressed all comments |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
@phofl can you resolve conflicts. |
� Conflicts: � doc/source/whatsnew/v1.3.0.rst
resolved conflicts, @jreback ready to merge? |
thanks @phofl |
… names does not match length of data
@meeseeksdev backport 1.3.x |
Something went wrong ... Please have a look at my logs. |
…s not match length of data (#42047) Co-authored-by: Patrick Hoefler <[email protected]>
black pandas
git diff upstream/master -u -- "*.py" | flake8 --diff
@gfyoung
Raising ParserWarning now. Could change to FutureWarning, if we would like to deprecate for 2.0
As long as we are only raising a ParserWarning I am inclined to raise for trailing commas too.