-
Notifications
You must be signed in to change notification settings - Fork 337
Fixed Issue #115 #126
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixed Issue #115 #126
Conversation
Codecov Report
@@ Coverage Diff @@
## master #126 +/- ##
=====================================
Coverage 100% 100%
=====================================
Files 12 12
Lines 804 853 +49
=====================================
+ Hits 804 853 +49
Continue to review full report at Codecov.
|
@mexxexx Thank you for this PR. It looks like all of the tests have passed so I will no go over everything very carefully with a finer toothed comb. |
For posterity, this PR is a clean up version of #122 |
@mexxexx while it is still fresh in your mind, would you mind compiling a list of things that still need to be done after this PR? A few items come to mind like adding NaN/inf support for |
@seanlaw Of course, I will keep it in this message and update it, if something else comes into my mind.
|
I noticed that the new implementation of calculating the mean and standard deviation might lead to memory errors for very long time series and window lengths. The reason is that apparently numpys np.std() function does a copy of the array that is being passed. So while the rolling_window function only creates a view, this gets converted to a huge array inside np.std. I will try out if this indeed leads to a problem, and if it does we will probably have to change the way the std is computed again. |
@mexxexx Sounds good! I hope that it isn't a problem. In case it matters, currently, our largest supported time series length is 100,000,000 (100 million) data points. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Everything looks pretty good. Just some minor things to check off the list before we merge. Thank you for taking the time to work on all of this! I really appreciate it.
core.check_dtype(T_A) | ||
core.check_nan(T_A) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might just create a check_array(dtype=np.floating, nan=True, inf=True)
function in the future and then throw check_dtype
, check_nan
, check_dtype
into the body of the function. This should help clean up the code.
Nothing for you to do here. Just commenting for future reference
The following code produces a Memory Error on my machine, because numpy.std internally creates a copy of the rolling window.
The easiest fix I see is using the pandas.rolling functionality. This way we get a numerically stable, fast and optimized rolling standard deviation. The only problem is, that pandas is not an installation requirement and I understand if it should not become one.
There is now a further copy of T and we will have to see if that is necessary. But since the calculation for big time series's happens only once in the algorithms, this might be ok. |
@mexxexx Thank you for digging into this! So, I tried this piece of code (10M) on my laptop (a 2017 Macbook Pro with 16GB of RAM) and it ran fine within seconds. The total memory used at the end was around 315MB (note that I store the mean and std in May I ask what your memory (RAM) resources are? I'm surprised that you were able to hit a memory error with 10M data points but maybe I'm only seeing the final memory that was consumed and not the intermediate/temporary memory. I'm feel comfortable to just leave it as you've implemented it (with NumPy only) and then deal with it later if it really becomes a problem. For 10M data points, I think requiring 8-16GB of RAM is reasonable in terms of minimum requirements. |
I have 8GB of RAM installed on my machine, and got the message that (with the code I posted) a memory allocation of 5.9 GB failed. I agree that for the MP computation more RAM should be required anyways, but I still think that resources should not be wasted. Since the std calculations in all algorithms happen on one machine, there might still be trouble if trying to calculate the MP on lets say a Cluster. |
I agree with you but I think we can approach this pragmatically. Fundamentally, the memory error for stddev computation is separate from what this PR solves. I think we should go with the NumPy version since it is computationally more stable (at the cost of some memory inefficiency). Improvements to the memory usage for computing the stddev/mean should be submitted as a new issue and addressed in a separate PR. Let's add it to your list above and file an issue. |
@mexxexx Hopefully, I understand the Here is some example code:
In both cases, I get the same array of stddevs back. So, in your case, we'd do some sort of
|
@seanlaw I like this idea! For the multidimensional case it will be a bit trickier to adapt this rolling window, I'm not sure if it will work out of the box. But then I think that the same approach should work. |
@mexxexx I played around with the multi-dimensional case and it appears that the
|
Very cool! That does look correct to me :) I'll give it a try to see if it fixes the problem I had. |
@mexxexx To be absolutely clear (and to avoid any confusion), my last code/comment is only for the multi-dimensional case. |
Thanks for the clarification, I saw that before. I slightly modified your implementation of the 1D stddev ( I also commented on a few of the changes you requested to clarify things, I'm not sure if you have seen it. |
@mexxexx I missed your comments. I will take a look this morning. Yes, let’s add it to this PR Sent with GitHawk |
@mexxexx I reviewed the comments for the requested changes but I don't see any new comments from you (I had responded to most/all of them a few days ago). Can you take a look and point me to anything that I may have missed? |
I think I see what you mean. I got a little bit too cute with it :) |
@seanlaw No problem :)
Thank you :) |
Yes, I was thinking that you could do something like (not tested):
So, fundamentally, the test itself still iterates through a
I actually commented on this but no worries if you missed it. I'll repeat it here: I see what you are doing now. I didn't realize that the substitution_locations list had changed from
to
Your code is fine. Can you explain why the list changed? I would've expected that you would use the same list for both sequences. In other words, I would've expected:
But it looks like you are using different pairings for the locations. I guess that's fine. I just want to make sure there's nothing special going on that I'm missing. Keeping the list the same would make it easier for maintenance since I'm sure that I'll assume that it's the same as other substitution locations in the other files.
Also answered in the comments but will re-state here: I would do all eight. You could write a second test function and call it Are you able to see my comments here. I wonder if there is a button that I need to press in order to release my comments to be viewed by you. I'm still learning how to more efficiently use these Github tools. The workflow is still a bit clunky for me. :) |
I can only see the first comments you did, but not the one you made on point two for example...
This makes sense, I will implement it that way
I did this to reduce the number of tests. I was using what you suggested, and stumped was called 16 times more often this way. Somehow this caused Dask to crash when running the tests so I reduced the number of test cases while maintaining all important test cases. |
@mexxexx I hit a button that I had neglected. Are you able to see my comments now? |
@seanlaw Yes, perfect :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a few last things and I think we're ready to merge! Really great work and thank you for your persistence and perseverance through this PR. I really appreciate it.
stumpy/core.py
Outdated
|
||
return M_T, Σ_T | ||
|
||
|
||
@njit(parallel=True, fastmath=True) | ||
def calculate_distance_profile(m, QT, μ_Q, σ_Q, M_T, Σ_T): | ||
def calculate_squared_distance_profile(m, QT, μ_Q, σ_Q, M_T, Σ_T): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just adding a self reminder that everything else that calls this function needs to take the square root later
@mexxexx I have some questions for you:
I rarely get a chance to interact so much with my users/contributors so I would like to get to know them when there is a chance. Please let me know if you are ever in the USA! |
@seanlaw Thanks for the review! I decided to do the whole mean std calculation chunked.
Exactly, I'm working from Switzerland at the moment :)
I don't use twitter, but that you for your consideration!
I am working on an ECR-Ion Source that is used to create lead Ions. We create a stream of these Ions, a current, and we want to keep it as intense and stable as possible, because these Ions are later on used for physics experiments.You can imagine the machine as a very sensitive microwave, that requires constant tuning by an operator. The goal is to support the operator in the tuning process, because it is very complex and not well understood. We try to analyze the data stream from our diagnostic tools to search for recurring patterns from which we hope to deduce insights about the inner workings of our machine, and see if we can use any signals for predicting upcoming failure of the system.
Basically this. I explore various methods to reach this goal and right now work intensely to see if the Matrix Profile can serve our needs. |
@mexxexx Thanks again for supporting this PR and congratulations on the successful merge! Do you mind updating the above checklist (I think the last item is done but feel free to add more) and then I can go ahead and create new issues for any outstanding items. I'll attempt to work on them in the coming months but if would be really great if you had the bandwidth to help (absolutely no pressure) or at least assist me if/when I get stuck. Also, I'd appreciate any and all feedback and criticism as it relates to the PR process, the overall code base (i.e., ways that we can improve readability and maintainability, etc), or anything you can think of to improve STUMPY and the community we are trying to foster. I'm all ears! Thanks in advance |
@seanlaw Finally 😄 I'll definitely assist with these points if I can. Depending on how promising the results I can achieve using the 1D Algorithms, I might want to do the same thing in a few more dimensions. So I might have a look at these points in the near future. Overall, I think that the code base is really clear. I didn't have a lot of problems navigating around and I like the comments for all functions. I don't know a lot about software development and I'm sure that there are some ways to improve (eg merging to the master branch only when releasing), but I think it is fine, as stumpy comparatively small. |
This pull request fixes Issue #115, the handling of NaN and inf values in the 1D Matrix Profile Algorithms.