Ability to chose return types of awswrangler.athena.read_sql_query #425

finete · 2020-10-13T14:58:38Z

Is your feature request related to a problem? Please describe.
I am using the read_sql_query to get data from Athena, However I dont use pandas down the stream.
Because of this I have to convert the pd.Dataframe to a python list and convert all the relevant pandas/numpy related types. (pd.Timestamp, np.double, etc...)

Describe the solution you'd like
Add a kwarg for return_type: str that accepts from a list of possible values.
Maybe one of the options will return a Iterable[Dict[str, Any]] where each dict corresponds to a row?

The text was updated successfully, but these errors were encountered:

maxispeicher · 2020-12-30T09:32:45Z

@MosheVai Do you think it makes sense to include this functionality in the library. Because as far as I understand, pandas already has this use case covered. Even if you don't use it down the stream you can just use df.to_dict(orient='records') to get a list of dicts.

The only benefit I could see that instead of reading the complete data at once, would be streaming it line for line. WDYT?

finete · 2020-12-31T13:12:19Z

It does seem like a logical solution to this issue.
However converting from a pandas Dataframe to native python data structure is not that trivial.

Another reason to include this functionality is large datasets.
For example fetching 1 million rows will result in a noticeable time waste because unnecessary constructors.
Both for the contraction of the Dataframe and the conversion to the native python structure.

@maxispeicher

maxispeicher · 2020-12-31T13:38:18Z

Do you have any other response type than Iterable[Dict[str, Any]] in mind? Otherwise I would maybe opt to include a new function like iter_sql_query() instead of adding a kwarg to read_sql_query(). However if you look at each row independently it could be that you get different data types for the same column in different rows. This also could be suboptimal. Do you think this would be a problem @MosheVai?

igorborgest · 2021-01-02T16:17:14Z

We should not focus on fetch or deliver big result sets without Pandas/PyArrow. AWS Data Wrangler has a clear intention to rely on these projects to handle big amounts of data with a reasonable performance.

One common second scenario where we could focus is on tiny results. Where we can just reach out the Athena API through get_query_results() skipping the result files on s3 and all the pandas/pyarrow layer.
It would be far away slower to fetch big results but it will be faster for tiny results and will delivery ordinary python types directly from boto3. (All this in a new function)

@MosheVai @maxispeicher what do you think?

@MosheVai does this second scenario help you?

maxispeicher · 2021-01-02T17:05:13Z

I think get_query_results() does only return strings, so there would be the need for a type conversion inside of AWS Data Wrangler if the function should return native Python types. I don't know if this is the way to go.

Edit: But if this should be included I agree that this should probably go to a separate function.

igorborgest · 2021-01-02T17:35:02Z

You are right, we would need to cast everything based on the column types provided under ["ResultSetMetadata"]["ColumnInfo"]["Type"] and it is not best way to go indeed...

I feel like it should be resolved inside the pandas project.
df.to_dict(orient='records') seems exactly what we need here. And if it has issues, I think it would be better to have some fix/improvement there than try to create some workaround here.

finete added the feature label Oct 13, 2020

igorborgest added the wont implement Will not be implemented. label Jan 10, 2021

igorborgest closed this as completed Jan 10, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ability to chose return types of awswrangler.athena.read_sql_query #425

Ability to chose return types of awswrangler.athena.read_sql_query #425

finete commented Oct 13, 2020

maxispeicher commented Dec 30, 2020

finete commented Dec 31, 2020 •

edited

Loading

maxispeicher commented Dec 31, 2020

igorborgest commented Jan 2, 2021

maxispeicher commented Jan 2, 2021 •

edited

Loading

igorborgest commented Jan 2, 2021

Ability to chose return types of awswrangler.athena.read_sql_query #425

Ability to chose return types of awswrangler.athena.read_sql_query #425

Comments

finete commented Oct 13, 2020

maxispeicher commented Dec 30, 2020

finete commented Dec 31, 2020 • edited Loading

maxispeicher commented Dec 31, 2020

igorborgest commented Jan 2, 2021

maxispeicher commented Jan 2, 2021 • edited Loading

igorborgest commented Jan 2, 2021

finete commented Dec 31, 2020 •

edited

Loading

maxispeicher commented Jan 2, 2021 •

edited

Loading