Skip to content

Fixed Issue #115 #126

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Feb 13, 2020
Merged

Conversation

mihailescum
Copy link
Contributor

This pull request fixes Issue #115, the handling of NaN and inf values in the 1D Matrix Profile Algorithms.

@codecov-io
Copy link

codecov-io commented Feb 4, 2020

Codecov Report

Merging #126 into master will not change coverage.
The diff coverage is 100%.

Impacted file tree graph

@@          Coverage Diff          @@
##           master   #126   +/-   ##
=====================================
  Coverage     100%   100%           
=====================================
  Files          12     12           
  Lines         804    853   +49     
=====================================
+ Hits          804    853   +49
Impacted Files Coverage Δ
stumpy/mstump.py 100% <100%> (ø) ⬆️
stumpy/stumped.py 100% <100%> (ø) ⬆️
stumpy/stomp.py 100% <100%> (ø) ⬆️
stumpy/stamp.py 100% <100%> (ø) ⬆️
stumpy/stump.py 100% <100%> (ø) ⬆️
stumpy/__init__.py 100% <100%> (ø) ⬆️
stumpy/core.py 100% <100%> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ea7c26a...0efc399. Read the comment docs.

@seanlaw
Copy link
Contributor

seanlaw commented Feb 4, 2020

@mexxexx Thank you for this PR. It looks like all of the tests have passed so I will no go over everything very carefully with a finer toothed comb.

@seanlaw
Copy link
Contributor

seanlaw commented Feb 4, 2020

For posterity, this PR is a clean up version of #122

@seanlaw
Copy link
Contributor

seanlaw commented Feb 4, 2020

@mexxexx while it is still fresh in your mind, would you mind compiling a list of things that still need to be done after this PR? A few items come to mind like adding NaN/inf support for gpu_stump, changing check_nan to check_nan_inf and giving a warning if either are detected, additional tests that we may have missed etc. This list will inform the creation of new issues after the PR is merged. Thanks in advance!

@mihailescum
Copy link
Contributor Author

mihailescum commented Feb 4, 2020

@seanlaw Of course, I will keep it in this message and update it, if something else comes into my mind.

  • Add nan/inf support to mstump + tests
  • Add nan/inf support to mstumped + tests
  • Add nan/inf support to gpu_stump + tests
  • Change check_nan to check_nan_inf and giving a warning
  • Doing this nan/inf check in the 1D algorithms
  • Implement the rolling chunked stddev for 1D
  • Implement the rolling chunked stddev for multi dimensional timeseries

@mihailescum
Copy link
Contributor Author

I noticed that the new implementation of calculating the mean and standard deviation might lead to memory errors for very long time series and window lengths. The reason is that apparently numpys np.std() function does a copy of the array that is being passed. So while the rolling_window function only creates a view, this gets converted to a huge array inside np.std.

I will try out if this indeed leads to a problem, and if it does we will probably have to change the way the std is computed again.

@seanlaw
Copy link
Contributor

seanlaw commented Feb 4, 2020

I noticed that the new implementation of calculating the mean and standard deviation might lead to memory errors for very long time series and window lengths. The reason is that apparently numpys np.std() function does a copy of the array that is being passed. So while the rolling_window function only creates a view, this gets converted to a huge array inside np.std.

I will try out if this indeed leads to a problem, and if it does we will probably have to change the way the std is computed again.

@mexxexx Sounds good! I hope that it isn't a problem. In case it matters, currently, our largest supported time series length is 100,000,000 (100 million) data points.

Copy link
Contributor

@seanlaw seanlaw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Everything looks pretty good. Just some minor things to check off the list before we merge. Thank you for taking the time to work on all of this! I really appreciate it.

core.check_dtype(T_A)
core.check_nan(T_A)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might just create a check_array(dtype=np.floating, nan=True, inf=True) function in the future and then throw check_dtype, check_nan, check_dtype into the body of the function. This should help clean up the code.

Nothing for you to do here. Just commenting for future reference

@mihailescum
Copy link
Contributor Author

The following code produces a Memory Error on my machine, because numpy.std internally creates a copy of the rolling window.

import stumpy
import numpy as np
T = np.random.random(size=10000000)
m = 1000
stumpy.core.compute_mean_std(T, m)

The easiest fix I see is using the pandas.rolling functionality. This way we get a numerically stable, fast and optimized rolling standard deviation. The only problem is, that pandas is not an installation requirement and I understand if it should not become one.
The alternative is to adapt the old code, maybe in a fashion like below:

def compute_mean_std_nansupport(T, m):
    n = len(T)
    
    # Calculate the mean using a rolling window
    mean = np.mean(stumpy.core.rolling_window(T, m), axis = 1)
    
    # Remove nan values from T so cumsum produces correct values
    T = T.copy()
    T[np.isnan(T)] = 0

    # Calculate the std and variance using the formula Var[X] = E[X^2]-E[X]^2
    cumsum_T_squared = np.empty(len(T) + 1)
    np.cumsum(np.square(T), out=cumsum_T_squared[1:])
    cumsum_T_squared[0] = 0

    subseq_sum_T_squared = cumsum_T_squared[m:] - cumsum_T_squared[: n - m + 1]
    std = np.abs((subseq_sum_T_squared / m) - np.square(mean))
    std = np.sqrt(std)
    
    # Remove nan values from std and set the means to inf
    mean[np.isnan(mean)] = np.inf
    std[np.isnan(std)] = 0
    return mean, std

There is now a further copy of T and we will have to see if that is necessary. But since the calculation for big time series's happens only once in the algorithms, this might be ok.
On word on numerical stability: The approach from the original code calculates Var[X] = E[X^2]-E[X]^2 which can lead to precision errors (see https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance), and if you compare the results with the pandas version for a random series, the results are only equal up to three digits. I'm not sure if this might cause trouble for some users.

@seanlaw
Copy link
Contributor

seanlaw commented Feb 5, 2020

The following code produces a Memory Error on my machine, because numpy.std internally creates a copy of the rolling window.

import stumpy
import numpy as np
T = np.random.random(size=10000000)
m = 1000
a, b = stumpy.core.compute_mean_std(T, m)

@mexxexx Thank you for digging into this!

So, I tried this piece of code (10M) on my laptop (a 2017 Macbook Pro with 16GB of RAM) and it ran fine within seconds. The total memory used at the end was around 315MB (note that I store the mean and std in a and b) . I then increased the size to 100,000,000 (100M) and it was also fine but it did require close to 5GB of memory (in total). This also completed within seconds. Frankly, on a laptop, it would be impossible to generate the matrix profile and so the size that would be supported for a laptop would probably be around 1,000,000 or so.

May I ask what your memory (RAM) resources are? I'm surprised that you were able to hit a memory error with 10M data points but maybe I'm only seeing the final memory that was consumed and not the intermediate/temporary memory.

I'm feel comfortable to just leave it as you've implemented it (with NumPy only) and then deal with it later if it really becomes a problem. For 10M data points, I think requiring 8-16GB of RAM is reasonable in terms of minimum requirements.

@mihailescum
Copy link
Contributor Author

I have 8GB of RAM installed on my machine, and got the message that (with the code I posted) a memory allocation of 5.9 GB failed.

I agree that for the MP computation more RAM should be required anyways, but I still think that resources should not be wasted. Since the std calculations in all algorithms happen on one machine, there might still be trouble if trying to calculate the MP on lets say a Cluster.

@seanlaw
Copy link
Contributor

seanlaw commented Feb 5, 2020

I agree that for the MP computation more RAM should be required anyways, but I still think that resources should not be wasted. Since the std calculations in all algorithms happen on one machine, there might still be trouble if trying to calculate the MP on lets say a Cluster.

I agree with you but I think we can approach this pragmatically. Fundamentally, the memory error for stddev computation is separate from what this PR solves. I think we should go with the NumPy version since it is computationally more stable (at the cost of some memory inefficiency). Improvements to the memory usage for computing the stddev/mean should be submitted as a new issue and addressed in a separate PR. Let's add it to your list above and file an issue.

@seanlaw
Copy link
Contributor

seanlaw commented Feb 6, 2020

@mexxexx Hopefully, I understand the np.nanstd problem a little better and I think that there's an easy fix (at least, for 1D inputs). Since it's a rolling window calculation, there's no reason why we can't split large arrays into multiple parts, compute the rolling np.nanstd for each part, and then concatenate the parts back into one.

Here is some example code:

import numpy as np
import stumpy
import math

# Compute all stddevs at once
T = np.random.rand(23)
m = 5
std = np.nanstd(stumpy.core.rolling_window(T, m), axis=1)
print(std)

# Compute stddevs in three chunks
# Not tested for multi-dimensional arrays
chunks = 3
chunk_size = math.ceil((T.shape[0] + 1)/chunks)
std_chunks = []
for chunk in range(chunks):
    start = chunk * chunk_size
    stop = min(start + chunk_size + m - 1, T.shape[0])
    tmp_std = np.nanstd(stumpy.core.rolling_window(T[start:stop], m), axis=1)
    std_chunks.append(tmp_std)
print(np.concatenate(std_chunks))  # We need to take care for multi-dimensional arrays!

In both cases, I get the same array of stddevs back.

So, in your case, we'd do some sort of try/except where we could continually double the chunk size until the MemoryError error doesn't happen anymore (i.e., when the size of the chunks are small enough to process).

power = 0
while True:  # Keep dividing the array into smaller chunks until there is no MemoryError
    n_chunks = 2**power
    try:
        chunked_std(T, m, n_chunks)  # This function represents the code above
        power = power * 2
        break
    except MemoryError:    
        pass

@mihailescum
Copy link
Contributor Author

@seanlaw I like this idea! For the multidimensional case it will be a bit trickier to adapt this rolling window, I'm not sure if it will work out of the box. But then I think that the same approach should work.

@seanlaw
Copy link
Contributor

seanlaw commented Feb 6, 2020

For the multidimensional case it will be a bit trickier to adapt this rolling window, I'm not sure if it will work out of the box. But then I think that the same approach should work.

@mexxexx I played around with the multi-dimensional case and it appears that the core.rolling_window works out of the box. I might be overlooking things but I think the only thing is needed is to specify the axes for np.nanstd and to reference the correct shape. Technically, EVERYTHING is 2D in shape (in NumPy-land, not matrix profile world) so we should be able to hardcode these in. Additionally, I switched to using np.hstack instead of np.concatenate. Please see my example below and let me know if I've made any errors.

import numpy as np
import stumpy
import math
import numpy.testing as npt

T = np.random.rand(3, 26)
m = 5
full_std = np.nanstd(stumpy.core.rolling_window(T, m), axis=2)  # Note axis=2

chunks = 3
chunk_size = math.ceil((T.shape[1] + 1)/chunks)  # Note T.shape[1] instead of T.shape[0]
stds = []
for chunk in range(chunks):
    start = chunk * chunk_size
    stop = min(start + chunk_size + m - 1, T.shape[1])  # Note T.shape[1] instead of T.shape[0]
    tmp_std = np.nanstd(stumpy.core.rolling_window(T[:, start:stop], m), axis=2)  # Note axis=2
    stds.append(tmp_std)
chunked_std = np.hstack(stds)  # Note the use of hstack instead of concatenate

npt.assert_almost_equal(full_std, chunked_std)

@mihailescum
Copy link
Contributor Author

Very cool! That does look correct to me :) I'll give it a try to see if it fixes the problem I had.

@seanlaw
Copy link
Contributor

seanlaw commented Feb 6, 2020

import numpy as np
import stumpy
import math
import numpy.testing as npt

T = np.random.rand(3, 26)
m = 5
full_std = np.nanstd(stumpy.core.rolling_window(T, m), axis=2) # Note axis=2

chunks = 3
chunk_size = math.ceil((T.shape[1] + 1)/chunks) # Note T.shape[1] instead of T.shape[0]
stds = []
for chunk in range(chunks):
start = chunk * chunk_size
stop = min(start + chunk_size + m - 1, T.shape[1]) # Note T.shape[1] instead of T.shape[0]
tmp_std = np.nanstd(stumpy.core.rolling_window(T[:, start:stop], m), axis=2) # Note axis=2
stds.append(tmp_std)
chunked_std = np.hstack(stds) # Note the use of hstack instead of concatenate

npt.assert_almost_equal(full_std, chunked_std)

@mexxexx To be absolutely clear (and to avoid any confusion), my last code/comment is only for the multi-dimensional case. MSTUMP/MSTUMPED actually has its own _multi_compute_mean_std (that is separate from compute_mean_std) and it can be found in mstump.py. Does that make sense?

@mihailescum
Copy link
Contributor Author

mihailescum commented Feb 7, 2020

Thanks for the clarification, I saw that before. I slightly modified your implementation of the 1D stddev (power shouldn't be starting at 0, better to start with n_chunks=1 and then to n_chunks*=2) and it is working without any issues now. Should we implement this directly into this PR?

I also commented on a few of the changes you requested to clarify things, I'm not sure if you have seen it.

@seanlaw
Copy link
Contributor

seanlaw commented Feb 7, 2020

@mexxexx I missed your comments. I will take a look this morning.

Yes, let’s add it to this PR

Sent with GitHawk

@seanlaw
Copy link
Contributor

seanlaw commented Feb 7, 2020

@mexxexx I reviewed the comments for the requested changes but I don't see any new comments from you (I had responded to most/all of them a few days ago). Can you take a look and point me to anything that I may have missed?

@seanlaw
Copy link
Contributor

seanlaw commented Feb 7, 2020

I slightly modified your implementation of the 1D stddev (power shouldn't be starting at 0, better to start with n_chunks=1 and then to n_chunks*=2) and it is working without any issues now

I think I see what you mean. I got a little bit too cute with it :)

@mihailescum
Copy link
Contributor Author

@seanlaw No problem :)
There are a few small points that are still open for me, or where I don't completely understand your suggestion:

  1. You want to move the substitution_locations outside of the tests. How should I do it? Just move it outside and leave the array as a global variable?
  2. In test_stumped_two_subsequences_inf_A_B_join you commented that the syntax looks wrong. Could you explain what you mean exactly?
  3. In the case of test_stumped_two_subsequences_nan_inf_A_B_join you wanted me to add a new test that swaps the location of nan and inf. Should I do this for all substitution locations creating eight new tests?

Thank you :)

@seanlaw
Copy link
Contributor

seanlaw commented Feb 7, 2020

  1. You want to move the substitution_locations outside of the tests. How should I do it? Just move it outside and leave the array as a global variable?

Yes, I was thinking that you could do something like (not tested):

substitution_locations = [(slice(0, 0), 0, -1, slice(1, 3), [0, 3])]  # Note the outer tuple parentheses

@pytest.mark.parametrize("substitution_locations", substitution_locations)
@pytest.mark.parametrize("T_A, T_B", test_data)
def some_test(T_A, T_B, substitution_location):
    # Some setup code
    for substitution_location_B in substitution_locations:
        for substitution_location_A in substitution_locations:
            T_A_sub[:] = T_A[:]
            T_B_sub[:] = T_B[:]
            T_A_sub[substitution_location_A] = substitute_A
            T_B_sub[substitution_location_B] = substitute_B

So, fundamentally, the test itself still iterates through a substitution_locations for-loop but this information is passed into the test function as a tuple. The benefit now is that the substitution_locations is defined once at the top of the file and reused everywhere (in all other tests).


  1. In test_stumped_two_subsequences_inf_A_B_join you commented that the syntax looks wrong. Could you explain what you mean exactly?

I actually commented on this but no worries if you missed it. I'll repeat it here:

I see what you are doing now. I didn't realize that the substitution_locations list had changed from

substitution_locations = [0, -1, slice(1, 3), [0, 3]]

to

substitution_locations = [(0, 1), (-1, slice(1, 3)), (slice(1, 3), 1), ([0, 3], 1)]

Your code is fine. Can you explain why the list changed? I would've expected that you would use the same list for both sequences. In other words, I would've expected:

substitution_locations = [0, -1, slice(1, 3), [0, 3]]  # Note that these are not tuples
@pytest.mark.parametrize("substitution_location_A", substitution_locations)
@pytest.mark.parametrize("substitution_location_B, substitution_locations)

But it looks like you are using different pairings for the locations. I guess that's fine. I just want to make sure there's nothing special going on that I'm missing. Keeping the list the same would make it easier for maintenance since I'm sure that I'll assume that it's the same as other substitution locations in the other files.


  1. In the case of test_stumped_two_subsequences_nan_inf_A_B_join you wanted me to add a new test that swaps the location of nan and inf. Should I do this for all substitution locations creating eight new tests?

Also answered in the comments but will re-state here:

I would do all eight. You could write a second test function and call it test_stumped_two_subsequences_nan_inf_A_B_join_swap or something like that


Are you able to see my comments here. I wonder if there is a button that I need to press in order to release my comments to be viewed by you. I'm still learning how to more efficiently use these Github tools. The workflow is still a bit clunky for me. :)

@mihailescum
Copy link
Contributor Author

Are you able to see my comments here. I wonder if there is a button that I need to press in order to release my comments to be viewed by you. I'm still learning how to more efficiently use these Github tools. The workflow is still a bit clunky for me. :)

I can only see the first comments you did, but not the one you made on point two for example...

So, fundamentally, the test itself still iterates through a substitution_locations for-loop but this information is passed into the test function as a tuple. The benefit now is that the substitution_locations is defined once at the top of the file and reused everywhere (in all other tests).

This makes sense, I will implement it that way

Your code is fine. Can you explain why the list changed? I would've expected that you would use the same list for both sequences

I did this to reduce the number of tests. I was using what you suggested, and stumped was called 16 times more often this way. Somehow this caused Dask to crash when running the tests so I reduced the number of test cases while maintaining all important test cases.

@seanlaw
Copy link
Contributor

seanlaw commented Feb 10, 2020

@mexxexx I hit a button that I had neglected. Are you able to see my comments now?

@mihailescum
Copy link
Contributor Author

@seanlaw Yes, perfect :)
I'm working an solving all changes and will hopefully push something later that day.

Copy link
Contributor

@seanlaw seanlaw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few last things and I think we're ready to merge! Really great work and thank you for your persistence and perseverance through this PR. I really appreciate it.

stumpy/core.py Outdated

return M_T, Σ_T


@njit(parallel=True, fastmath=True)
def calculate_distance_profile(m, QT, μ_Q, σ_Q, M_T, Σ_T):
def calculate_squared_distance_profile(m, QT, μ_Q, σ_Q, M_T, Σ_T):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just adding a self reminder that everything else that calls this function needs to take the square root later

@seanlaw
Copy link
Contributor

seanlaw commented Feb 12, 2020

@mexxexx I have some questions for you:

  1. I was wondering where you are working from? I take it from the commit times that you are working somewhere in Europe?
  2. What applications do you have for computing the matrix profile?
  3. Do you have a Twitter handle? I'd like to thank you publicly for our next release!
  4. What do you do for your day job?

I rarely get a chance to interact so much with my users/contributors so I would like to get to know them when there is a chance. Please let me know if you are ever in the USA!

@mihailescum
Copy link
Contributor Author

mihailescum commented Feb 13, 2020

@seanlaw Thanks for the review! I decided to do the whole mean std calculation chunked.

I was wondering where you are working from? I take it from the commit times that you are working somewhere in Europe?

Exactly, I'm working from Switzerland at the moment :)

Do you have a Twitter handle? I'd like to thank you publicly for our next release!

I don't use twitter, but that you for your consideration!

What applications do you have for computing the matrix profile?

I am working on an ECR-Ion Source that is used to create lead Ions. We create a stream of these Ions, a current, and we want to keep it as intense and stable as possible, because these Ions are later on used for physics experiments.You can imagine the machine as a very sensitive microwave, that requires constant tuning by an operator. The goal is to support the operator in the tuning process, because it is very complex and not well understood. We try to analyze the data stream from our diagnostic tools to search for recurring patterns from which we hope to deduce insights about the inner workings of our machine, and see if we can use any signals for predicting upcoming failure of the system.

What do you do for your day job?

Basically this. I explore various methods to reach this goal and right now work intensely to see if the Matrix Profile can serve our needs.

@seanlaw seanlaw merged commit beee83f into stumpy-dev:master Feb 13, 2020
@mihailescum mihailescum deleted the feature_handle_nan_inf branch February 13, 2020 14:11
@seanlaw
Copy link
Contributor

seanlaw commented Feb 13, 2020

@seanlaw Of course, I will keep it in this message and update it, if something else comes into my mind.

  • Add nan/inf support to mstump + tests
  • Add nan/inf support to mstumped + tests
  • Add nan/inf support to gpu_stump + tests
  • Change check_nan to check_nan_inf and giving a warning
  • Doing this nan/inf check in the 1D algorithms
  • Implement the rolling chunked stddev for 1D
  • Implement the rolling chunked stddev for multi dimensional timeseries

@mexxexx Thanks again for supporting this PR and congratulations on the successful merge! Do you mind updating the above checklist (I think the last item is done but feel free to add more) and then I can go ahead and create new issues for any outstanding items. I'll attempt to work on them in the coming months but if would be really great if you had the bandwidth to help (absolutely no pressure) or at least assist me if/when I get stuck.

Also, I'd appreciate any and all feedback and criticism as it relates to the PR process, the overall code base (i.e., ways that we can improve readability and maintainability, etc), or anything you can think of to improve STUMPY and the community we are trying to foster. I'm all ears! Thanks in advance

@mihailescum
Copy link
Contributor Author

@seanlaw Finally 😄
The rolling chunked std is still not implemented, even if the code you posted should work. I didn't want to overload this PR.

I'll definitely assist with these points if I can. Depending on how promising the results I can achieve using the 1D Algorithms, I might want to do the same thing in a few more dimensions. So I might have a look at these points in the near future.

Overall, I think that the code base is really clear. I didn't have a lot of problems navigating around and I like the comments for all functions. I don't know a lot about software development and I'm sure that there are some ways to improve (eg merging to the master branch only when releasing), but I think it is fine, as stumpy comparatively small.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants