Skip to content

Implement functions to drop and interpolate over low-confidence datapoints #116

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 26 commits into from
Mar 12, 2024

Conversation

b-peri
Copy link
Collaborator

@b-peri b-peri commented Feb 12, 2024

Description

This PR introduces the filtering module, and adds two new major user-facing functions to movement. The first, filter_by_confidence(), drops pose track datapoints where the associated confidence value falls below a user-defined threshold, and replaces these with NaNs. The second, interpolate_over_time(), interpolates over such NaN values across the time axis using the xarray built-in interpolate_na().

Edit 11/03/2024:
This PR also introduces two new convenience functions for the filtering module:

  1. The log_to_attrs() decorator adds a log of any operations performed (in the form of a dictionary) to the affected xarray.Dataset's attributes, under ["log"].
  2. The filter_diagnostics() function counts and logs the number of datapoints that have been filtered from every individual keypoint.
  • Bug fix
  • Addition of a new feature
  • Other

References

Closes #97

Checklist:

  • The code has been tested locally
  • Tests have been added to cover all new functionality
  • The documentation has been updated to reflect any changes
  • The code has been formatted with pre-commit

@b-peri b-peri marked this pull request as draft February 12, 2024 15:49
Copy link

codecov bot commented Feb 12, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 99.28%. Comparing base (aaba495) to head (78004d1).
Report is 1 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #116      +/-   ##
==========================================
+ Coverage   99.21%   99.28%   +0.06%     
==========================================
  Files           8        9       +1     
  Lines         509      556      +47     
==========================================
+ Hits          505      552      +47     
  Misses          4        4              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@b-peri
Copy link
Collaborator Author

b-peri commented Feb 12, 2024

Some particular items I would like to receive some feedback on:

  • Operation logs are now being stored as a list of dictionaries under the logs attribute, with each individual entry representing a past operation. Each entry follows roughly the following format: {'operation': operation, 'parameter_1': argument1, 'parameter_2': argument2, ...,'datetime': str(datetime.now())}. Very open to discussing alternative ways of doing this!
  • Both functions currently act both on confidence and pose_tracks variables, but I'm unsure whether this makes sense (e.g. I don't think interpolating over confidence values is bound to be particularly informative). Shall I restrict filter_confidence() function to act only on the pose_tracks variable?
  • filter_confidence() currently reports the proportion of frames/predictions that have been filtered (both as a percentage and in absolute number of frames), but I was unsure about whether it might be useful to include anything else. I was also thinking we could write some logic to allow users to (optionally) plot some confidence distributions + imposed thresholds when running this function, but maybe this is overkill. Any feedback on this would be appreciated!
  • Function names, particularly that of interp_pose. I think that a more generic function name (maybe even something like simply "interpolate") may be more informative, but given that the use-case is specifically for pose tracks in this case, I'm a bit unsure of how specific to go.
  • I've put these functions in a new module called filtering.py (under /movement/analysis/, following on from Compute locomotion features #106). If a different module name would be more appropriate please let me know!

To-Do:

  • Add tests

Copy link
Member

@niksirbi niksirbi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot @b-peri! You did a great job, and opening this draft PR was very effective at laying out the problems and the solutions. Here are my high-level thoughts (details can be found in the individual comments):

  • I like the logging, but I think we can generalise it to any filtering function via an appropriate decorator (see example decorator in the respective comment).
  • Filtering should only operate on pose_tracks, and in the future perhaps on velocity, acceleration etc, definitely not on confidence
  • Accordingly I propose renaming the functions as interpolate_over_time() and filter_by_confidence().
  • I think we should completely scrap the "inplace" idea (for reasons given in the comment). Make all functions return a new dataset.
  • I think it's enough to have a text report of dropped values, plots can come later. But the current implementation of this feature is incorrect and should be fixed.
  • For now, I would put the module in movement/filtering.py, not sure it counts as "analysis". When it grows, we can consider making it into a subfolder
  • The interpolation makes use of the bottleneck library, which is not installed by default. The dependency needs to change to xarray[accel] instead of "vanilla" xarray (see explanation)

Let me know your thoughts on the above.

ds_thresholded = ds.where(ds.confidence >= threshold)

# Diagnostics
print("\nDatapoints Filtered:\n")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we shoud log this as "info", instead of printing it.

logger = logging.getLogger(__name__)
logger.info("text goes here".)

@niksirbi niksirbi mentioned this pull request Feb 22, 2024
7 tasks
@niksirbi
Copy link
Member

@b-peri takeaways from today's chat on this:

  • do the filtering only on "pose_tracks", not on "confidence" and (in the future) also not on "velocity" or other derivative variables. Basically we want people to clean first, analyse later.
  • drop the inplace idea for good, return a new dataset.
  • the logging and reporting should probably have been different PRs (this is mostly my fault for shoving these features in here). Feel free to merge just the filtering bit and deal with logging and/or reporting separately. For this particular PR, since you have anyway done most of the work already, also feel free to merge including these bits, but if one of them becomes tedious, drop it, merge without it, and open a separate issue for the missing bit.

Copy link
Collaborator Author

@b-peri b-peri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Many thanks for your extensive feedback @niksirbi! I've now implemented all of your suggestions and added testing for the new functions!

I just had a few minor points that I think would be useful to get some feedback on, which I'll put below

@b-peri b-peri marked this pull request as ready for review March 11, 2024 09:22
lochhh and others added 11 commits March 11, 2024 11:42
* Check for expected `dims` and `data_vars` in dataset

* Fix `missing_dim_dataset` fixture

* Rename `poses` accessor to `move`

* Rename `PoseAccessor` class to `MoveAccessor`

* Rename `poses_accessor.py` to `move_accessor.py`

* Move `move_accessor.py` to the top level

* Fix accessor docstring formatting
* Draft compute velocity

* Add test for displacement

* Fix confidence values in `valid_pose_dataset` fixture

* Refactor kinematics test and functions

* Vectorise kinematic functions

* Refactor repeated calls to compute magnitude + direction

* Displacement to return 0 instead of NaN

* Return x y components in kinematic functions

* Refactor kinematics tests

* Remove unnecessary instantiations

* Improve time diff calculation

* Prefix kinematics methods with `compute_`

* Add kinematic properties to `PosesAccessor`

* Update `test_property` docstring

* Refactor `_vector` methods and kinematics tests

* Update `expected_dataset` docstring

* Rename `poses` to `move` in `PosesAccessor`

* Refactor properties in `PosesAccessor`

* Remove vector util functions and tests

* Update `not_a_dataset` fixture description

* Validate dataset upon accessor property access

* Update `poses_accessor` test description

* Validate input data in kinematic functions

* Remove unused fixture

* Parametrise kinematics tests

* Set `compute_derivative` as internal function

* Update `kinematics.py` docstrings

* Add new modules to API docs

* Update `move_accessor` docstrings

* Rename `test_move_accessor` filename
Bumps [actions/cache](https://github.com/actions/cache) from 3 to 4.
- [Release notes](https://github.com/actions/cache/releases)
- [Changelog](https://github.com/actions/cache/blob/main/RELEASES.md)
- [Commits](actions/cache@v3...v4)

---
updated-dependencies:
- dependency-name: actions/cache
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
updates:
- [github.com/astral-sh/ruff-pre-commit: v0.2.0 → v0.3.0](astral-sh/ruff-pre-commit@v0.2.0...v0.3.0)
- [github.com/psf/black: 24.1.1 → 24.2.0](psf/black@24.1.1...24.2.0)

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* Add codecov token to test_and_deploy.yml

* use test action from main branch

* switch back to using v2 of the test action
Copy link

Quality Gate Passed Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
No data about Coverage
0.0% Duplication on New Code

See analysis details on SonarCloud

Copy link
Member

@niksirbi niksirbi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @b-peri, thanks a lot for implementing the suggestions!

I made some final tweaks:

  • I merged the main branch to resolve conflicts
  • I renamed the filter_diagnostics function to report_nan_values, to more explicitly reflect its purpose. I also changed the way it's invoked by filtering/interpolation functions, so that the user gets an overview of the change in NaNs (made by the operation).
  • I switched to using xarray's built-in isnull() and copy methods (instead of numpy.isnan and copy.copy
  • I rephrased some of the docstrings
  • I added fill_value="extrapolate" to the interpolate_na method, to also take care of nan values at the start and end of the array (if they are smaller than the max_gap).
  • I added the new functions to the API index in the docs
  • I wrote a new example for the docs, showcasing how the new functions can be used to clead data

I will merge this as is now.

I think we should add some more extensive testing to cover some edgecases, but I will open a separate issue for that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Provide function for dropping values with low confidence and interpolating over them
3 participants