Skip to content

read_sas fails when passed a file object from GCSFS #33069

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
tswast opened this issue Mar 27, 2020 · 1 comment · Fixed by #33070
Closed

read_sas fails when passed a file object from GCSFS #33069

tswast opened this issue Mar 27, 2020 · 1 comment · Fixed by #33070
Labels
Bug IO SAS SAS: read_sas
Milestone

Comments

@tswast
Copy link
Contributor

tswast commented Mar 27, 2020

Code Sample, a copy-pastable example if possible

From https://stackoverflow.com/q/60848250/101923

export BUCKET_NAME=swast-scratch-us
curl -L https://wwwn.cdc.gov/Nchs/Nhanes/2017-2018/DEMO_J.XPT | gsutil cp - gs://${BUCKET_NAME}/sas_sample/Nchs/Nhanes/2017-2018/DEMO_J.XPT
import pandas as pd
import gcsfs


bucket_name = "swast-scratch-us"
project_id = "swast-scratch"

fs = gcsfs.GCSFileSystem(project=project_id)
with fs.open(
    "{}/sas_sample/Nchs/Nhanes/2017-2018/DEMO_J.XPT".format(bucket_name),
    "rb"
) as f:
    df = pd.read_sas(f, format="xport")
    print(df)

Problem description

This throws the following exception:

Traceback (most recent call last):
  File "after.py", line 15, in <module>
    df = pd.read_sas(f, format="xport")
  File "/Users/swast/miniconda3/envs/scratch/lib/python3.7/site-packages/pandas/io/sas/sasreader.py", line 70, in read_sas
    filepath_or_buffer, index=index, encoding=encoding, chunksize=chunksize
  File "/Users/swast/miniconda3/envs/scratch/lib/python3.7/site-packages/pandas/io/sas/sas_xport.py", line 280, in __init__
    contents = contents.encode(self._encoding)
AttributeError: 'bytes' object has no attribute 'encode'
(scratch) 

Expected Output

          SEQN  SDDSRVYR  RIDSTATR  RIAGENDR  ...  SDMVSTRA  INDHHIN2  INDFMIN2  INDFMPIR
0      93703.0      10.0       2.0       2.0  ...     145.0      15.0      15.0      5.00
1      93704.0      10.0       2.0       1.0  ...     143.0      15.0      15.0      5.00
2      93705.0      10.0       2.0       2.0  ...     145.0       3.0       3.0      0.82
3      93706.0      10.0       2.0       1.0  ...     134.0       NaN       NaN       NaN
4      93707.0      10.0       2.0       1.0  ...     138.0      10.0      10.0      1.88
...        ...       ...       ...       ...  ...       ...       ...       ...       ...
9249  102952.0      10.0       2.0       2.0  ...     138.0       4.0       4.0      0.95
9250  102953.0      10.0       2.0       1.0  ...     137.0      12.0      12.0       NaN
9251  102954.0      10.0       2.0       2.0  ...     144.0      10.0      10.0      1.18
9252  102955.0      10.0       2.0       2.0  ...     136.0       9.0       9.0      2.24
9253  102956.0      10.0       2.0       1.0  ...     142.0       7.0       7.0      1.56

[9254 rows x 46 columns]

Note: the expected output is printed when a local file is read.

Output of pd.show_versions()

Python 3.7.3 | packaged by conda-forge | (default, Jul 1 2019, 14:38:56)
[Clang 4.0.1 (tags/RELEASE_401/final)] :: Anaconda, Inc. on darwin
Type "help", "copyright", "credits" or "license" for more information.

import pandas as pd
pd.show_versions()

INSTALLED VERSIONS

commit : None
python : 3.7.3.final.0
python-bits : 64
OS : Darwin
OS-release : 19.4.0
machine : x86_64
processor : i386
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 0.25.1
numpy : 1.18.1
pytz : 2019.2
dateutil : 2.8.1
pip : 20.0.2
setuptools : 46.0.0.post20200311
Cython : None
pytest : 5.0.1
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.5.0
html5lib : 1.0.1
pymysql : None
psycopg2 : None
jinja2 : 2.11.1
IPython : 7.7.0
pandas_datareader: None
bs4 : 4.8.0
bottleneck : None
fastparquet : None
gcsfs : 0.6.0
lxml.etree : 4.5.0
matplotlib : 3.1.1
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : 0.11.0
pyarrow : 0.15.1
pytables : None
s3fs : None
scipy : 1.4.1
sqlalchemy : None
tables : None
xarray : 0.12.3
xlrd : None
xlwt : None
xlsxwriter : None

@tswast
Copy link
Contributor Author

tswast commented Mar 27, 2020

Note: the same error occurs with

Code sample

df = pd.read_sas(
    "gs://{}/sas_sample/Nchs/Nhanes/2017-2018/DEMO_J.XPT".format(bucket_name)
)
print(df)

Problem description

Traceback (most recent call last):
  File "after.py", line 19, in <module>
    "gs://{}/sas_sample/Nchs/Nhanes/2017-2018/DEMO_J.XPT".format(bucket_name)
  File "/Users/swast/miniconda3/envs/pandas-dev/lib/python3.7/site-packages/pandas/io/sas/sasreader.py", line 70, in read_sas
    filepath_or_buffer, index=index, encoding=encoding, chunksize=chunksize
  File "/Users/swast/miniconda3/envs/pandas-dev/lib/python3.7/site-packages/pandas/io/sas/sas_xport.py", line 280, in __init__
    contents = contents.encode(self._encoding)
AttributeError: 'bytes' object has no attribute 'encode'

@jreback jreback added the IO SAS SAS: read_sas label Mar 29, 2020
@jreback jreback added this to the 1.1 milestone Mar 29, 2020
@mroeschke mroeschke added the Bug label Apr 3, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO SAS SAS: read_sas
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants