-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
read_sas with chunksize/iterator raises ValueError #14734
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
@pijucha, thanks for the report. I've PR'd a possible fix. |
Is this issue solved? I just got this trying to iterate through a large sas7bdat file (using pandas
The file is 27GB, iterating with Is this related to the error referenced in this issue? |
Is the file compressed? Check the "compression" attribute of the iterator.
…On Tue, May 2, 2017 at 2:47 AM, Fredrik Boulund ***@***.***> wrote:
Is this issue solved? I just got this trying to iterate through a large
sas7bdat file (using pandas 0.19.2 via conda)
Traceback (most recent call last):
File "./extract_subset_of_columns.py", line 35, in <module>
extract_columns_from_sas(lmed_file, columns=["lpnr", "KON", "atc", "EDATUM"], output_csv=lmed_file+".csv")
File "./extract_subset_of_columns.py", line 25, in extract_columns_from_sas
for count, chunk in enumerate(reader, start=1):
File "/home/ctmr/anaconda3/lib/python3.6/site-packages/pandas/io/sas/sas7bdat.py", line 229, in __next__
da = self.read(nrows=self.chunksize or 1)
File "/home/ctmr/anaconda3/lib/python3.6/site-packages/pandas/io/sas/sas7bdat.py", line 614, in read
rslt = self._chunk_to_dataframe()
File "/home/ctmr/anaconda3/lib/python3.6/site-packages/pandas/io/sas/sas7bdat.py", line 663, in _chunk_to_dataframe
rslt[name] = self._string_chunk[js, :]
File "/home/ctmr/anaconda3/lib/python3.6/site-packages/pandas/core/frame.py", line 2419, in __setitem__
self._set_item(key, value)
File "/home/ctmr/anaconda3/lib/python3.6/site-packages/pandas/core/frame.py", line 2485, in _set_item
value = self._sanitize_column(key, value)
File "/home/ctmr/anaconda3/lib/python3.6/site-packages/pandas/core/frame.py", line 2656, in _sanitize_column
value = _sanitize_index(value, self.index, copy=False)
File "/home/ctmr/anaconda3/lib/python3.6/site-packages/pandas/core/series.py", line 2800, in _sanitize_index
raise ValueError('Length of values does not match length of ' 'index')
ValueError: Length of values does not match length of index
The file is 27GB, iterating with chunksize=100000. It failed
approximately 70% through the file.
column_count is 49, row_count is reported as 89065305.
Is this related to the error referenced in this issue?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#14734 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ACiww06W4XtNHfKVIrS0lCpvGJ__SEfxks5r1tFygaJpZM4K8GRf>
.
|
@kshedden I get the following from the
Which I interpret as some kind of compression. So, yes, I guess? Edit: I actually got another error for one of my other files. Maybe it's related?
|
The SAS specification is not public and had to be reverse engineered
through examples. The compression algorithm was particularly hard to
reverse engineer (other people did most of the hard work on this, I only
made a small contribution). I have been able to validate that our code
successfully reads many compressed SAS files, but I'm pretty sure there are
some compression codes that we do not know. We know the common codes, but
it's more likely that we have missed a rare one, which is consistent with
the fault occurring late in the file.
To verify that this is a compression issue, would it be possible for you to
generate this file as an uncompressed SAS file and see if the issue arises
still?
…On Tue, May 2, 2017 at 7:46 AM, Fredrik Boulund ***@***.***> wrote:
@kshedden <https://github.com/kshedden> I get the following from the
compression attribute of the iterator:
In [4]: iter.compression
Out[4]: b'SASYZCRL'
Which I interpret as some kind of compression. So, yes, I guess?
It's interesting the error occured first at somewhere after 70% into the
file. All information up until this point was extracted without issue.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#14734 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ACiwwxcO2jrc4FpUYJFPJLofkrEMZ7l-ks5r1xePgaJpZM4K8GRf>
.
|
I see. Really appreciate your effort! Unfortunately I don't think I can generate the file without compression, but I'll look into it (I don't have access to the source data). |
If you have SAS, you can convert the file from compressed to uncompressed
(of course in that case you could just use SAS to dump it to csv, but it
would be helpful to us if this flags a problem that we can fix). I can
give you SAS code to do the conversion if needed.
…On Tue, May 2, 2017 at 10:50 AM, Fredrik Boulund ***@***.***> wrote:
I see. Really appreciate your effort!
Unfortunately I don't think I can generate the file without compression,
but I'll look into it (I don't have access to the source data).
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#14734 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ACiww-spSUCaKPqs49_N_HJM0cZ2LQT_ks5r10KdgaJpZM4K8GRf>
.
|
read_sas
doesn't work well withchunksize
oriterator
parameters.Code Sample and Problem Description
The following data test file in the repository have 32 lines.
When we carefully read the file with
chunksize
/iterator
, all's well:or
But if we don't know the length of the data, we'll easily stumble on an exception and won't read the whole data, which is painful with large files.
or
Output of
pd.show_versions()
INSTALLED VERSIONS
commit: 75b606a
python: 3.5.1.final.0
python-bits: 64
OS: Linux
OS-release: 4.1.20-1
machine: x86_64
processor: Intel(R)_Core(TM)i5-2520M_CPU@_2.50GHz
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8
pandas: 0.19.0+112.g75b606a
nose: 1.3.7
pip: 9.0.1
setuptools: 21.0.0
Cython: 0.24
numpy: 1.11.0
The text was updated successfully, but these errors were encountered: