read_sas with chunksize/iterator raises ValueError #14734

pijucha · 2016-11-25T04:17:11Z

read_sas doesn't work well with chunksize or iterator parameters.

Code Sample and Problem Description

The following data test file in the repository have 32 lines.

sasfile = 'pandas/io/tests/sas/data/airline.sas7bdat'
pd.read_sas(sasfile).shape
Out[18]: (32, 6)

When we carefully read the file with chunksize/iterator, all's well:

reader = pd.read_sas(sasfile, chunksize=16)
df = reader.read()
df.shape
Out[31]: (16, 6)
df = reader.read()
df.shape
Out[33]: (16, 6)

or

reader = pd.read_sas(sasfile, iterator=True)
df = reader.read(30)
df.shape
Out[37]: (30, 6)
df = reader.read(2)
df.shape
Out[39]: (2, 6)
df = reader.read(2)
type(df)
Out[41]: NoneType

But if we don't know the length of the data, we'll easily stumble on an exception and won't read the whole data, which is painful with large files.

reader = pd.read_sas(sasfile, chunksize=20)
df = reader.read()
df.shape
Out[45]: (20, 6)
df = reader.read()
Traceback (most recent call last):
  File "/usr/local/lib64/python3.5/site-packages/IPython/core/interactiveshell.py", line 2885, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-46-c5d811b93ac1>", line 1, in <module>
    df = reader.read()
  File "/home/users/piotr/workspace/pandas-pijucha/pandas/io/sas/sas7bdat.py", line 604, in read
    rslt = self._chunk_to_dataframe()
  File "/home/users/piotr/workspace/pandas-pijucha/pandas/io/sas/sas7bdat.py", line 646, in _chunk_to_dataframe
    dtype=self.byte_order + 'd')
  File "/home/users/piotr/workspace/pandas-pijucha/pandas/core/frame.py", line 2419, in __setitem__
    self._set_item(key, value)
  File "/home/users/piotr/workspace/pandas-pijucha/pandas/core/frame.py", line 2485, in _set_item
    value = self._sanitize_column(key, value)
  File "/home/users/piotr/workspace/pandas-pijucha/pandas/core/frame.py", line 2656, in _sanitize_column
    value = _sanitize_index(value, self.index, copy=False)
  File "/home/users/piotr/workspace/pandas-pijucha/pandas/core/series.py", line 2793, in _sanitize_index
    raise ValueError('Length of values does not match length of ' 'index')
ValueError: Length of values does not match length of index

or

reader = pd.read_sas(sasfile, iterator=True)
reader.read(30).shape
Out[51]: (30, 6)
reader.read(30).shape
Traceback (most recent call last):
  File "/usr/local/lib64/python3.5/site-packages/IPython/core/interactiveshell.py", line 2885, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-52-5d757f713808>", line 1, in <module>
    reader.read(30).shape
  File "/home/users/piotr/workspace/pandas-pijucha/pandas/io/sas/sas7bdat.py", line 604, in read
    rslt = self._chunk_to_dataframe()
  File "/home/users/piotr/workspace/pandas-pijucha/pandas/io/sas/sas7bdat.py", line 646, in _chunk_to_dataframe
    dtype=self.byte_order + 'd')
  File "/home/users/piotr/workspace/pandas-pijucha/pandas/core/frame.py", line 2419, in __setitem__
    self._set_item(key, value)
  File "/home/users/piotr/workspace/pandas-pijucha/pandas/core/frame.py", line 2485, in _set_item
    value = self._sanitize_column(key, value)
  File "/home/users/piotr/workspace/pandas-pijucha/pandas/core/frame.py", line 2656, in _sanitize_column
    value = _sanitize_index(value, self.index, copy=False)
  File "/home/users/piotr/workspace/pandas-pijucha/pandas/core/series.py", line 2793, in _sanitize_index
    raise ValueError('Length of values does not match length of ' 'index')
ValueError: Length of values does not match length of index

Output of `pd.show_versions()`

pd.show_versions()

INSTALLED VERSIONS

commit: 75b606a
python: 3.5.1.final.0
python-bits: 64
OS: Linux
OS-release: 4.1.20-1
machine: x86_64
processor: Intel(R)_Core(TM)i5-2520M_CPU@_2.50GHz
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.19.0+112.g75b606a
nose: 1.3.7
pip: 9.0.1
setuptools: 21.0.0
Cython: 0.24
numpy: 1.11.0

The text was updated successfully, but these errors were encountered:

jorisvandenbossche · 2016-11-25T08:37:10Z

cc @kshedden @Winand

kshedden · 2016-11-25T15:14:48Z

@pijucha, thanks for the report. I've PR'd a possible fix.

pijucha · 2016-11-25T15:24:16Z

@kshedden Yes, this should be it. I see you probably also solved #13654. Very nice. Thanks.

closes #14734 closes #13654

closes #14734 closes #13654 (cherry picked from commit c5f219a)

boulund · 2017-05-02T06:46:52Z

Is this issue solved? I just got this trying to iterate through a large sas7bdat file (using pandas 0.19.2 via conda)

Traceback (most recent call last):                                                                                   
  File "./extract_subset_of_columns.py", line 35, in <module>                                                        
    extract_columns_from_sas(lmed_file, columns=["lpnr", "KON", "atc", "EDATUM"], output_csv=lmed_file+".csv")       
  File "./extract_subset_of_columns.py", line 25, in extract_columns_from_sas                                        
    for count, chunk in enumerate(reader, start=1):                                                                  
  File "/home/ctmr/anaconda3/lib/python3.6/site-packages/pandas/io/sas/sas7bdat.py", line 229, in __next__           
    da = self.read(nrows=self.chunksize or 1)                                                                        
  File "/home/ctmr/anaconda3/lib/python3.6/site-packages/pandas/io/sas/sas7bdat.py", line 614, in read               
    rslt = self._chunk_to_dataframe()                                                                                
  File "/home/ctmr/anaconda3/lib/python3.6/site-packages/pandas/io/sas/sas7bdat.py", line 663, in _chunk_to_dataframe
    rslt[name] = self._string_chunk[js, :]                                                                           
  File "/home/ctmr/anaconda3/lib/python3.6/site-packages/pandas/core/frame.py", line 2419, in __setitem__            
    self._set_item(key, value)                                                                                       
  File "/home/ctmr/anaconda3/lib/python3.6/site-packages/pandas/core/frame.py", line 2485, in _set_item              
    value = self._sanitize_column(key, value)                                                                        
  File "/home/ctmr/anaconda3/lib/python3.6/site-packages/pandas/core/frame.py", line 2656, in _sanitize_column       
    value = _sanitize_index(value, self.index, copy=False)                                                           
  File "/home/ctmr/anaconda3/lib/python3.6/site-packages/pandas/core/series.py", line 2800, in _sanitize_index       
    raise ValueError('Length of values does not match length of ' 'index')                                           
ValueError: Length of values does not match length of index

The file is 27GB, iterating with chunksize=100000. It failed approximately 70% through the file.
column_count is 49, row_count is reported as 89065305.

Is this related to the error referenced in this issue?

kshedden · 2017-05-02T11:13:38Z

Is the file compressed? Check the "compression" attribute of the iterator.

…

On Tue, May 2, 2017 at 2:47 AM, Fredrik Boulund ***@***.***> wrote: Is this issue solved? I just got this trying to iterate through a large sas7bdat file (using pandas 0.19.2 via conda) Traceback (most recent call last): File "./extract_subset_of_columns.py", line 35, in <module> extract_columns_from_sas(lmed_file, columns=["lpnr", "KON", "atc", "EDATUM"], output_csv=lmed_file+".csv") File "./extract_subset_of_columns.py", line 25, in extract_columns_from_sas for count, chunk in enumerate(reader, start=1): File "/home/ctmr/anaconda3/lib/python3.6/site-packages/pandas/io/sas/sas7bdat.py", line 229, in __next__ da = self.read(nrows=self.chunksize or 1) File "/home/ctmr/anaconda3/lib/python3.6/site-packages/pandas/io/sas/sas7bdat.py", line 614, in read rslt = self._chunk_to_dataframe() File "/home/ctmr/anaconda3/lib/python3.6/site-packages/pandas/io/sas/sas7bdat.py", line 663, in _chunk_to_dataframe rslt[name] = self._string_chunk[js, :] File "/home/ctmr/anaconda3/lib/python3.6/site-packages/pandas/core/frame.py", line 2419, in __setitem__ self._set_item(key, value) File "/home/ctmr/anaconda3/lib/python3.6/site-packages/pandas/core/frame.py", line 2485, in _set_item value = self._sanitize_column(key, value) File "/home/ctmr/anaconda3/lib/python3.6/site-packages/pandas/core/frame.py", line 2656, in _sanitize_column value = _sanitize_index(value, self.index, copy=False) File "/home/ctmr/anaconda3/lib/python3.6/site-packages/pandas/core/series.py", line 2800, in _sanitize_index raise ValueError('Length of values does not match length of ' 'index') ValueError: Length of values does not match length of index The file is 27GB, iterating with chunksize=100000. It failed approximately 70% through the file. column_count is 49, row_count is reported as 89065305. Is this related to the error referenced in this issue? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#14734 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ACiww06W4XtNHfKVIrS0lCpvGJ__SEfxks5r1tFygaJpZM4K8GRf> .

boulund · 2017-05-02T11:46:10Z

@kshedden I get the following from the compression attribute of the iterator:

In [4]: iter.compression 
Out[4]: b'SASYZCRL'

Which I interpret as some kind of compression. So, yes, I guess?
It's interesting the error occured first at somewhere after 70% into the file. All information up until this point was extracted without issue.

Edit: I actually got another error for one of my other files. Maybe it's related?

Traceback (most recent call last):                                                                              
  File "pandas/io/sas/saslib.pyx", line 29, in pandas.io.sas.saslib.rle_decompress (pandas/io/sas/saslib.c:2540)
ValueError: Unexpected non-zero end_of_first_byte

kshedden · 2017-05-02T14:46:31Z

The SAS specification is not public and had to be reverse engineered through examples. The compression algorithm was particularly hard to reverse engineer (other people did most of the hard work on this, I only made a small contribution). I have been able to validate that our code successfully reads many compressed SAS files, but I'm pretty sure there are some compression codes that we do not know. We know the common codes, but it's more likely that we have missed a rare one, which is consistent with the fault occurring late in the file. To verify that this is a compression issue, would it be possible for you to generate this file as an uncompressed SAS file and see if the issue arises still?

…

On Tue, May 2, 2017 at 7:46 AM, Fredrik Boulund ***@***.***> wrote: @kshedden <https://github.com/kshedden> I get the following from the compression attribute of the iterator: In [4]: iter.compression Out[4]: b'SASYZCRL' Which I interpret as some kind of compression. So, yes, I guess? It's interesting the error occured first at somewhere after 70% into the file. All information up until this point was extracted without issue. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#14734 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ACiwwxcO2jrc4FpUYJFPJLofkrEMZ7l-ks5r1xePgaJpZM4K8GRf> .

boulund · 2017-05-02T14:49:47Z

I see. Really appreciate your effort!

Unfortunately I don't think I can generate the file without compression, but I'll look into it (I don't have access to the source data).

kshedden · 2017-05-02T14:52:13Z

If you have SAS, you can convert the file from compressed to uncompressed (of course in that case you could just use SAS to dump it to csv, but it would be helpful to us if this flags a problem that we can fix). I can give you SAS code to do the conversion if needed.

…

On Tue, May 2, 2017 at 10:50 AM, Fredrik Boulund ***@***.***> wrote: I see. Really appreciate your effort! Unfortunately I don't think I can generate the file without compression, but I'll look into it (I don't have access to the source data). — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#14734 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ACiww-spSUCaKPqs49_N_HJM0cZ2LQT_ks5r10KdgaJpZM4K8GRf> .

jorisvandenbossche added Bug IO SAS SAS: read_sas labels Nov 25, 2016

kshedden mentioned this issue Nov 25, 2016

SAS chunksize / iteration issues #14743

Merged

3 tasks

jreback added this to the 0.19.2 milestone Nov 25, 2016

jorisvandenbossche closed this as completed in #14743 Nov 28, 2016

jorisvandenbossche pushed a commit that referenced this issue Nov 28, 2016

BUG: SAS chunksize / iteration issues (#14743)

c5f219a

closes #14734 closes #13654

jorisvandenbossche pushed a commit that referenced this issue Dec 15, 2016

[Backport #14743] BUG: SAS chunksize / iteration issues (#14743)

6c688b9

closes #14734 closes #13654 (cherry picked from commit c5f219a)

jicky94 mentioned this issue Jan 28, 2020

pd.read_sas with chunksize option raises IndexError #31385

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

read_sas with chunksize/iterator raises ValueError #14734

read_sas with chunksize/iterator raises ValueError #14734

pijucha commented Nov 25, 2016

INSTALLED VERSIONS

jorisvandenbossche commented Nov 25, 2016

Uh oh!

kshedden commented Nov 25, 2016

Uh oh!

pijucha commented Nov 25, 2016

Uh oh!

boulund commented May 2, 2017

Uh oh!

kshedden commented May 2, 2017 via email

Uh oh!

boulund commented May 2, 2017 •

edited

Loading

Uh oh!

kshedden commented May 2, 2017 via email

Uh oh!

boulund commented May 2, 2017

Uh oh!

kshedden commented May 2, 2017 via email

Uh oh!

Uh oh!

read_sas with chunksize/iterator raises ValueError #14734

read_sas with chunksize/iterator raises ValueError #14734

Comments

pijucha commented Nov 25, 2016

Code Sample and Problem Description

Output of pd.show_versions()

INSTALLED VERSIONS

jorisvandenbossche commented Nov 25, 2016

Uh oh!

kshedden commented Nov 25, 2016

Uh oh!

pijucha commented Nov 25, 2016

Uh oh!

boulund commented May 2, 2017

Uh oh!

kshedden commented May 2, 2017 via email

Uh oh!

boulund commented May 2, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kshedden commented May 2, 2017 via email

Uh oh!

boulund commented May 2, 2017

Uh oh!

kshedden commented May 2, 2017 via email

Uh oh!

Output of `pd.show_versions()`

boulund commented May 2, 2017 •

edited

Loading