Skip to content

read_sas with chunksize/iterator raises ValueError #14734

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
pijucha opened this issue Nov 25, 2016 · 9 comments · Fixed by #14743
Closed

read_sas with chunksize/iterator raises ValueError #14734

pijucha opened this issue Nov 25, 2016 · 9 comments · Fixed by #14743
Labels
Bug IO SAS SAS: read_sas
Milestone

Comments

@pijucha
Copy link
Contributor

pijucha commented Nov 25, 2016

read_sas doesn't work well with chunksize or iterator parameters.

Code Sample and Problem Description

The following data test file in the repository have 32 lines.

sasfile = 'pandas/io/tests/sas/data/airline.sas7bdat'
pd.read_sas(sasfile).shape
Out[18]: (32, 6)

When we carefully read the file with chunksize/iterator, all's well:

reader = pd.read_sas(sasfile, chunksize=16)
df = reader.read()
df.shape
Out[31]: (16, 6)
df = reader.read()
df.shape
Out[33]: (16, 6)

or

reader = pd.read_sas(sasfile, iterator=True)
df = reader.read(30)
df.shape
Out[37]: (30, 6)
df = reader.read(2)
df.shape
Out[39]: (2, 6)
df = reader.read(2)
type(df)
Out[41]: NoneType

But if we don't know the length of the data, we'll easily stumble on an exception and won't read the whole data, which is painful with large files.

reader = pd.read_sas(sasfile, chunksize=20)
df = reader.read()
df.shape
Out[45]: (20, 6)
df = reader.read()
Traceback (most recent call last):
  File "/usr/local/lib64/python3.5/site-packages/IPython/core/interactiveshell.py", line 2885, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-46-c5d811b93ac1>", line 1, in <module>
    df = reader.read()
  File "/home/users/piotr/workspace/pandas-pijucha/pandas/io/sas/sas7bdat.py", line 604, in read
    rslt = self._chunk_to_dataframe()
  File "/home/users/piotr/workspace/pandas-pijucha/pandas/io/sas/sas7bdat.py", line 646, in _chunk_to_dataframe
    dtype=self.byte_order + 'd')
  File "/home/users/piotr/workspace/pandas-pijucha/pandas/core/frame.py", line 2419, in __setitem__
    self._set_item(key, value)
  File "/home/users/piotr/workspace/pandas-pijucha/pandas/core/frame.py", line 2485, in _set_item
    value = self._sanitize_column(key, value)
  File "/home/users/piotr/workspace/pandas-pijucha/pandas/core/frame.py", line 2656, in _sanitize_column
    value = _sanitize_index(value, self.index, copy=False)
  File "/home/users/piotr/workspace/pandas-pijucha/pandas/core/series.py", line 2793, in _sanitize_index
    raise ValueError('Length of values does not match length of ' 'index')
ValueError: Length of values does not match length of index

or

reader = pd.read_sas(sasfile, iterator=True)
reader.read(30).shape
Out[51]: (30, 6)
reader.read(30).shape
Traceback (most recent call last):
  File "/usr/local/lib64/python3.5/site-packages/IPython/core/interactiveshell.py", line 2885, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-52-5d757f713808>", line 1, in <module>
    reader.read(30).shape
  File "/home/users/piotr/workspace/pandas-pijucha/pandas/io/sas/sas7bdat.py", line 604, in read
    rslt = self._chunk_to_dataframe()
  File "/home/users/piotr/workspace/pandas-pijucha/pandas/io/sas/sas7bdat.py", line 646, in _chunk_to_dataframe
    dtype=self.byte_order + 'd')
  File "/home/users/piotr/workspace/pandas-pijucha/pandas/core/frame.py", line 2419, in __setitem__
    self._set_item(key, value)
  File "/home/users/piotr/workspace/pandas-pijucha/pandas/core/frame.py", line 2485, in _set_item
    value = self._sanitize_column(key, value)
  File "/home/users/piotr/workspace/pandas-pijucha/pandas/core/frame.py", line 2656, in _sanitize_column
    value = _sanitize_index(value, self.index, copy=False)
  File "/home/users/piotr/workspace/pandas-pijucha/pandas/core/series.py", line 2793, in _sanitize_index
    raise ValueError('Length of values does not match length of ' 'index')
ValueError: Length of values does not match length of index

Output of pd.show_versions()

pd.show_versions()

INSTALLED VERSIONS

commit: 75b606a
python: 3.5.1.final.0
python-bits: 64
OS: Linux
OS-release: 4.1.20-1
machine: x86_64
processor: Intel(R)_Core(TM)i5-2520M_CPU@_2.50GHz
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.19.0+112.g75b606a
nose: 1.3.7
pip: 9.0.1
setuptools: 21.0.0
Cython: 0.24
numpy: 1.11.0

@jorisvandenbossche
Copy link
Member

cc @kshedden @Winand

@kshedden
Copy link
Contributor

@pijucha, thanks for the report. I've PR'd a possible fix.

@pijucha
Copy link
Contributor Author

pijucha commented Nov 25, 2016

@kshedden Yes, this should be it. I see you probably also solved #13654. Very nice. Thanks.

@jreback jreback added this to the 0.19.2 milestone Nov 25, 2016
jorisvandenbossche pushed a commit that referenced this issue Dec 15, 2016
@boulund
Copy link

boulund commented May 2, 2017

Is this issue solved? I just got this trying to iterate through a large sas7bdat file (using pandas 0.19.2 via conda)

Traceback (most recent call last):                                                                                   
  File "./extract_subset_of_columns.py", line 35, in <module>                                                        
    extract_columns_from_sas(lmed_file, columns=["lpnr", "KON", "atc", "EDATUM"], output_csv=lmed_file+".csv")       
  File "./extract_subset_of_columns.py", line 25, in extract_columns_from_sas                                        
    for count, chunk in enumerate(reader, start=1):                                                                  
  File "/home/ctmr/anaconda3/lib/python3.6/site-packages/pandas/io/sas/sas7bdat.py", line 229, in __next__           
    da = self.read(nrows=self.chunksize or 1)                                                                        
  File "/home/ctmr/anaconda3/lib/python3.6/site-packages/pandas/io/sas/sas7bdat.py", line 614, in read               
    rslt = self._chunk_to_dataframe()                                                                                
  File "/home/ctmr/anaconda3/lib/python3.6/site-packages/pandas/io/sas/sas7bdat.py", line 663, in _chunk_to_dataframe
    rslt[name] = self._string_chunk[js, :]                                                                           
  File "/home/ctmr/anaconda3/lib/python3.6/site-packages/pandas/core/frame.py", line 2419, in __setitem__            
    self._set_item(key, value)                                                                                       
  File "/home/ctmr/anaconda3/lib/python3.6/site-packages/pandas/core/frame.py", line 2485, in _set_item              
    value = self._sanitize_column(key, value)                                                                        
  File "/home/ctmr/anaconda3/lib/python3.6/site-packages/pandas/core/frame.py", line 2656, in _sanitize_column       
    value = _sanitize_index(value, self.index, copy=False)                                                           
  File "/home/ctmr/anaconda3/lib/python3.6/site-packages/pandas/core/series.py", line 2800, in _sanitize_index       
    raise ValueError('Length of values does not match length of ' 'index')                                           
ValueError: Length of values does not match length of index                                                          

The file is 27GB, iterating with chunksize=100000. It failed approximately 70% through the file.
column_count is 49, row_count is reported as 89065305.

Is this related to the error referenced in this issue?

@kshedden
Copy link
Contributor

kshedden commented May 2, 2017 via email

@boulund
Copy link

boulund commented May 2, 2017

@kshedden I get the following from the compression attribute of the iterator:

In [4]: iter.compression 
Out[4]: b'SASYZCRL'      

Which I interpret as some kind of compression. So, yes, I guess?
It's interesting the error occured first at somewhere after 70% into the file. All information up until this point was extracted without issue.

Edit: I actually got another error for one of my other files. Maybe it's related?

Traceback (most recent call last):                                                                              
  File "pandas/io/sas/saslib.pyx", line 29, in pandas.io.sas.saslib.rle_decompress (pandas/io/sas/saslib.c:2540)
ValueError: Unexpected non-zero end_of_first_byte

@kshedden
Copy link
Contributor

kshedden commented May 2, 2017 via email

@boulund
Copy link

boulund commented May 2, 2017

I see. Really appreciate your effort!

Unfortunately I don't think I can generate the file without compression, but I'll look into it (I don't have access to the source data).

@kshedden
Copy link
Contributor

kshedden commented May 2, 2017 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO SAS SAS: read_sas
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants