Skip to content

Ability to chose return types of awswrangler.athena.read_sql_query #425

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
finete opened this issue Oct 13, 2020 · 6 comments
Closed

Ability to chose return types of awswrangler.athena.read_sql_query #425

finete opened this issue Oct 13, 2020 · 6 comments
Labels
feature wont implement Will not be implemented.

Comments

@finete
Copy link

finete commented Oct 13, 2020

Is your feature request related to a problem? Please describe.
I am using the read_sql_query to get data from Athena, However I dont use pandas down the stream.
Because of this I have to convert the pd.Dataframe to a python list and convert all the relevant pandas/numpy related types. (pd.Timestamp, np.double, etc...)

Describe the solution you'd like
Add a kwarg for return_type: str that accepts from a list of possible values.
Maybe one of the options will return a Iterable[Dict[str, Any]] where each dict corresponds to a row?

@finete finete added the feature label Oct 13, 2020
@maxispeicher
Copy link
Contributor

@MosheVai Do you think it makes sense to include this functionality in the library. Because as far as I understand, pandas already has this use case covered. Even if you don't use it down the stream you can just use df.to_dict(orient='records') to get a list of dicts.

The only benefit I could see that instead of reading the complete data at once, would be streaming it line for line. WDYT?

@finete
Copy link
Author

finete commented Dec 31, 2020

It does seem like a logical solution to this issue.
However converting from a pandas Dataframe to native python data structure is not that trivial.

Another reason to include this functionality is large datasets.
For example fetching 1 million rows will result in a noticeable time waste because unnecessary constructors.
Both for the contraction of the Dataframe and the conversion to the native python structure.

@maxispeicher

@maxispeicher
Copy link
Contributor

Do you have any other response type than Iterable[Dict[str, Any]] in mind? Otherwise I would maybe opt to include a new function like iter_sql_query() instead of adding a kwarg to read_sql_query(). However if you look at each row independently it could be that you get different data types for the same column in different rows. This also could be suboptimal. Do you think this would be a problem @MosheVai?

@igorborgest
Copy link
Contributor

We should not focus on fetch or deliver big result sets without Pandas/PyArrow. AWS Data Wrangler has a clear intention to rely on these projects to handle big amounts of data with a reasonable performance.

One common second scenario where we could focus is on tiny results. Where we can just reach out the Athena API through get_query_results() skipping the result files on s3 and all the pandas/pyarrow layer.
It would be far away slower to fetch big results but it will be faster for tiny results and will delivery ordinary python types directly from boto3. (All this in a new function)

@MosheVai @maxispeicher what do you think?

@MosheVai does this second scenario help you?

@maxispeicher
Copy link
Contributor

maxispeicher commented Jan 2, 2021

I think get_query_results() does only return strings, so there would be the need for a type conversion inside of AWS Data Wrangler if the function should return native Python types. I don't know if this is the way to go.

Edit: But if this should be included I agree that this should probably go to a separate function.

@igorborgest
Copy link
Contributor

You are right, we would need to cast everything based on the column types provided under ["ResultSetMetadata"]["ColumnInfo"]["Type"] and it is not best way to go indeed...

I feel like it should be resolved inside the pandas project.
df.to_dict(orient='records') seems exactly what we need here. And if it has issues, I think it would be better to have some fix/improvement there than try to create some workaround here.

@igorborgest igorborgest added the wont implement Will not be implemented. label Jan 10, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature wont implement Will not be implemented.
Projects
None yet
Development

No branches or pull requests

3 participants