Skip to content

Q: Faster way to Filter DataView #6164

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
torronen opened this issue Apr 20, 2022 · 6 comments
Open

Q: Faster way to Filter DataView #6164

torronen opened this issue Apr 20, 2022 · 6 comments
Labels
enhancement New feature or request Microsoft.Data.Analysis All DataFrame related issues and PRs
Milestone

Comments

@torronen
Copy link
Contributor

torronen commented Apr 20, 2022

I filter data from a dataview to get all items within a specific time period.
It seems slow compared to filtering with LINQ from objects in memory. Is there a faster way to do it?

var boolFilter = df["timestamp"].ElementwiseGreaterThanOrEqual(unixStartTime);
var hourlydata = df.Filter(boolFilter);
var boolFilter2 = hourlydata["timestamp"].ElementwiseLessThan(unixEndTime);
hourlydata = hourlydata.Filter(boolFilter2);

In this example, I am creating predictions for a certain time period at a time.
In another example, I may need to filter by exact match. Normally, I might create a dictionary to help, but is there a way to support some type "indices" for DataViews?

@ghost ghost added the untriaged New issue has not been triaged label Apr 20, 2022
@luisquintanilla
Copy link
Contributor

Hi @torronen

Is this for the DataView or DataFrame? Looks like DataFrame, but just wanted to confirm before tagging it.

@torronen
Copy link
Contributor Author

@luisquintanilla yes, you are correct, it is DataFrame.

@luisquintanilla luisquintanilla added enhancement New feature or request Microsoft.Data.Analysis All DataFrame related issues and PRs and removed untriaged New issue has not been triaged labels Apr 25, 2022
@luisquintanilla
Copy link
Contributor

Thanks for that clarification.

@michaelgsharp
Copy link
Contributor

What do you mean by "support some type indices"? Also, do you have any numbers for speed between this and LINQ? It would be good to see how far behind we really are.

@michaelgsharp michaelgsharp added this to the ML.NET Future milestone Apr 27, 2022
@torronen
Copy link
Contributor Author

torronen commented Apr 27, 2022

@michaelgsharp I am thinking about something like a dictionary or hashset to select items quickly. For example, I might want get metrics for observations from each city separately: one test set for Helsinki, 2nd for Seattle etc.

Getting the numbers is a good point to validate it. Actually, this issue is mostly about my perception of slowness and I do not yet have an exact comparison. I will do some, but I might not be able to get them very quickly.

@asmirnov82
Copy link
Contributor

Some increase in performance of Filtering should be achieved with #6869.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request Microsoft.Data.Analysis All DataFrame related issues and PRs
Projects
None yet
Development

No branches or pull requests

4 participants