Skip to content

Find and use custom collate functions defined on dataset classes#561

Merged
drewoldag merged 5 commits intomainfrom
issue/553/data-provider-custom-collate-map
Dec 9, 2025
Merged

Find and use custom collate functions defined on dataset classes#561
drewoldag merged 5 commits intomainfrom
issue/553/data-provider-custom-collate-map

Conversation

@drewoldag
Copy link
Copy Markdown
Collaborator

@drewoldag drewoldag commented Dec 9, 2025

In this PR we update DataProvider to include a dictionary that maps "friendly_name" to a callable custom collate function if one has been defined on the dataset class associated with the friendly name.

Additionally, in DataProvider.collate, as we collate a batch of data, if we discover that some of the data is from a dataset class that implements a custom collation function, we'll apply that function to the correct portion of data. If there is no custom collate function defined, then DataProvider will do the work of collating the data samples into single large numpy arrays.

Finally, a couple of unit tests were added as well as a test fixture that will monkey patch the HyraxRandomDataset to include a collate static method.

There are a few assumptions being made in the implementation of this PR:

  • The dataset class that defines a custom collate function will name the function collate
  • The custom function will be a @staticmethod in the dataset class.
  • The custom function will expect as input a list of dictionaries of the form: [{'data': {'field_1': <...>, ..., 'field_n': <...>}}, ...]
  • The custom function will return a dictionary of the form: {'data': {'field_1': [<...>, ..., <...>], 'field_n': [<...>, ..., <...>]}, 'object_id': [...]}*.
  • Note that 'object_id' is not used, and will be removed in a future PR

…n `DataProvider.collate`. Added and updated some unit tests.
@drewoldag drewoldag self-assigned this Dec 9, 2025
Copilot AI review requested due to automatic review settings December 9, 2025 00:10
@drewoldag drewoldag linked an issue Dec 9, 2025 that may be closed by this pull request
@drewoldag drewoldag requested review from a team, maxwest-uw and mtauraso and removed request for a team December 9, 2025 00:10
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds support for custom collate functions in dataset classes. The DataProvider now discovers and applies custom collate methods defined on dataset classes as static methods, allowing datasets to implement their own data batching logic beyond the default numpy array stacking.

Key changes:

  • Added custom_collate_functions dictionary to track dataset-specific collate methods
  • Modified prepare_datasets() to detect and store custom collate functions from dataset instances
  • Updated collate() method to route data samples to custom collate functions when available

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 10 comments.

File Description
src/hyrax/data_sets/data_provider.py Implements custom collate function discovery and application logic in DataProvider
tests/hyrax/conftest.py Adds test fixture with monkey-patched custom collate function for HyraxRandomDataset
tests/hyrax/test_data_provider.py Adds two new tests verifying custom collate function detection and application

@codecov
Copy link
Copy Markdown

codecov bot commented Dec 9, 2025

Codecov Report

❌ Patch coverage is 75.00000% with 5 lines in your changes missing coverage. Please review.
✅ Project coverage is 55.34%. Comparing base (78c207c) to head (bf9190d).
⚠️ Report is 129 commits behind head on main.

Files with missing lines Patch % Lines
src/hyrax/data_sets/data_provider.py 75.00% 5 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #561      +/-   ##
==========================================
+ Coverage   55.26%   55.34%   +0.07%     
==========================================
  Files          53       53              
  Lines        5155     5175      +20     
==========================================
+ Hits         2849     2864      +15     
- Misses       2306     2311       +5     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

Copilot AI commented Dec 9, 2025

@drewoldag I've opened a new pull request, #562, to work on those changes. Once the pull request is ready, I'll request review from you.

Copy link
Copy Markdown
Contributor

Copilot AI commented Dec 9, 2025

@drewoldag I've opened a new pull request, #563, to work on those changes. Once the pull request is ready, I'll request review from you.

@github-actions
Copy link
Copy Markdown

github-actions bot commented Dec 9, 2025

Before [78c207c] After [fb32696] Ratio Benchmark (Parameter)
8.33±0.2ms 8.55±0.08ms 1.03 vector_db_benchmarks.VectorDBSearchBenchmarks.time_search_by_vector_many_shards(128, 'chromadb')
234±0.8μs 239±2μs 1.02 data_request_benchmarks.DatasetRequestBenchmarks.time_request_all_data
1.76±0.01s 1.78±0.02s 1.01 benchmarks.time_database_connection_help
1.77±0.02s 1.78±0.02s 1.01 benchmarks.time_infer_help
1.76±0.01s 1.77±0.02s 1.01 benchmarks.time_prepare_help
33.5±0.04s 33.8±0.06s 1.01 vector_db_benchmarks.VectorDBInsertBenchmarks.time_load_vector_db(16384, 'qdrant')
504±4ms 508±3ms 1.01 vector_db_benchmarks.VectorDBInsertBenchmarks.time_load_vector_db(256, 'chromadb')
388±1ms 391±4ms 1.01 vector_db_benchmarks.VectorDBSearchBenchmarks.time_search_by_vector_many_shards(64, 'qdrant')
1.77±0.02s 1.78±0.02s 1 benchmarks.time_help
1.78±0.01s 1.78±0.02s 1 benchmarks.time_lookup_help

Click here to view all benchmarks.

… functions. Removing the monkey patched collate function in unit tests.
Copy link
Copy Markdown
Collaborator

@maxwest-uw maxwest-uw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm!

@drewoldag drewoldag merged commit 441823f into main Dec 9, 2025
10 checks passed
@drewoldag drewoldag deleted the issue/553/data-provider-custom-collate-map branch December 9, 2025 18:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Have DataProvider maintain a map of custom collate functions

4 participants