Competition evaluation #103

kirahowe · 2024-05-15T14:45:37Z

Changelogs

Adds support in Polaris client for evaluating competitions. The client runs the same evaluation logic as it does for benchmarks. Competitions differ from regular benchmarks because they have no knowledge of their own test sets (with targets hidden or otherwise). Test data is downloaded by Polaris Hub and passed to the competition evaluation.

API documentation will be done in a separate PR in the context of the rest of the competitions documentation.

Checklist:

Was this PR discussed in an issue? It is recommended to first discuss a new feature into a GitHub issue before opening a PR.
Add tests to cover the fixed bug(s) or the newly introduced feature(s) (if appropriate).
Update the API documentation if a new function is added, or an existing one is deleted.
Write concise and explanatory changelogs above.
If possible, assign one of the following labels to the PR: feature, fix or test (or ask a maintainer to do it for you).

Competitions use the exact same evaluation logic as regular benchmarks, but with the difference that test labels are passed in separately (rather than accessed from the benchmark/competition dataset). In the case of competitions, the target labels literally do not exist (as opposed to being hidden programatically), and are downloaded on the hub then passed to the evaluation function.

This raises some questions about whether competitions are really as similar to regular benchmarks as we thought they would be, but for now this works and the competition/benchmark distinction can be discussed separately.

kirahowe · 2024-05-15T14:47:08Z

polaris/evaluate/utils.py

+from polaris.utils.context import tmp_attribute_change
+from polaris.evaluate import BenchmarkResults, ResultsType
+
+def evaluate_benchmark(model, y_pred, test):


This is copied, unmodified, from the implementation previously used in the BenchmarkSpecification evaluate method.

Let's start a thread here to continue the conversations we've been having over Slack to decide if/how/when we should modify this utility to support both standard and competition benchmarks. The main issue here is that we want to share all evaluation logic between the two types of benchmarks, but the amount of data available/needed in the competition evaluation service is much less than the client. With the current utility function, we need to hack around this and create dummy data to bypass Pydantic schema checks and get this function to work correctly.

A couple options that come to mind:

Extend this function to accept both pydantic objects or dict/array arguments (for standard and competition benchmarks, respectively). We can have type checks at the start of the function which will check the type of the argument passed and then extract all necessary data points according to the type (e.g. for pydantic objects model.target_cols and a regular dict model["target_cols"]). This should side step the need to change surrounding code which uses this function.

Update this function to expect dict/array arguments only. This will require some changes to the existing client code that evaluates standard benchmarks. However, it will standardize what is expected during evaluation which I like.

Personally, I think option 2 is the cleanest, and if the changes to existing client code aren't huge, I would prefer to do this as part of the competition feature development. I also am ok with option 1.

Curious to hear our thoughts on this and other options as well.

I think you're right and it wouldn't be very difficult to implement this in a more "functional" (data-in/data-out) kind of way.. it might make sense to just do in this PR, assuming there are no other potentially conflicting changes in progress to benchmark evaluation upstream.

polaris/hub/client.py

kirahowe · 2024-05-15T14:50:38Z

polaris/hub/client.py

+             A `BenchmarkResults` object.
+        """
+        return self._base_request_to_hub(


This will call the corresponding hub endpoint (we should decide on what the URL should be), download the labels, then run the evaluation, returning a benchmark results object.

polaris/competition/_competition.py

Andrewq11 · 2024-05-16T19:34:10Z

polaris/competition/_competition.py

+        dataset = Dataset(
+            table=pd.DataFrame(test, columns=self.target_cols),
+            name=f'{self.name}_test_set',
+            description=f"Target labels for competition {self.name}. Used internally to evaluate competition predictions."
+        )
+        test_subset = Subset(
+            dataset=dataset,
+            indices=list(range(len(self.split[1]))),
+            input_cols=self.input_cols,
+            target_cols=self.target_cols,
+            hide_targets=False
+        )


I don't think this is the end of the world if we decide to keep our evaluate utility expecting pydantic objects, but I think taking the extra couple hours to avoid being hacky makes sense.

Basically, I +1 the thoughts @kiramclean has been sharing on this.

polaris/hub/client.py

Andrewq11 · 2024-05-16T19:39:58Z

polaris/hub/client.py

+                                                      exclude=["dataset",
+                                                               "split"],
+                                                      by_alias=True),


Just revisiting our thoughts on what we need to send over the wire here. This approach works, but I'm more in favour of just sending the competition identifier (owner/name) along with the predictions.

Using the identifier, we can then call the Hub server side to get the necessary data we need into the evaluation service.

Once we decide on a server to server auth mechanism, we'll be able to implement that!

This makes so much more sense than trying to reconstruct a valid CompetitionSpecification on the hub side -- I was initially trying to avoid having to figure out how to auth and access the db from the python endpoints run as verbal serverless functions, but you're right that we'll have to figure that out anyway regardless, so may as well do it here. It'll simplify a lot of things!

Andrewq11 · 2024-05-16T19:52:04Z

Nice stuff! Left a few comments and I hope we can use this PR as a thread for discussing what we will do regarding the subtle but clear differences in standard and competition benchmark evaluations.

kirahowe · 2024-05-18T03:05:56Z

polaris/evaluate/utils.py

+    multiple test set benchmark)."""
+    return not isinstance(vals, dict) or set(vals.keys()) == set(target_cols)
+
+def evaluate_benchmark(y_pred: PredictionsType,


This depends on fewer objects now, with the evaluation logic not requiring any unnecessary or wrapper-only data types. Thanks for being up for a bit of re-jiggering things around for the sake of simplicity.

Andrewq11 · 2024-05-22T13:50:36Z

polaris/benchmark/_base.py

+            test = make_test_subset(test_split)
+
+        return test
+
    def get_train_test_split(


I think the cleanup you've done here is quite nice. In combination with pulling out the evaluation logic into a separate utility, it makes the base logic here more easily understandable.

polaris/competition/_competition.py

Andrewq11 · 2024-05-22T14:06:21Z

polaris/evaluate/utils.py

+    given data. If all keys in the given data match the target column names, we
+    assume they are target names (as opposed to test set names for a single-task,
+    multiple test set benchmark)."""


This makes sense, but I can see this becoming a little messy in the future. Maybe this is something we can think about making more robust after we finish competitions.

Yeah I totally agree.. this is at least isolated and explained here now, but we may want to re-think the data format we accept for predictions. There's a brief discussion about it here. I think it would be reasonable to always expect something like:

{"test_set_name": {"target_col_name": [1, 2, 3, 4, 5]}"

I get the desire to be as succinct as possible too though.. you could imagine it would be annoying to have to submit that ☝️ every time you just have a list of numbers to submit.

Andrewq11 · 2024-05-22T14:31:43Z

I like the cleanup done here! The complexity around the various prediction objects that can be passed for evaluation is making me think we should spend some time on this after we get a working version of competitions (and before we launch it).

I know you and @cwognum have already been sharing some good thoughts on this as well.

Co-authored-by: Andrew Quirke <[email protected]>

kirahowe · 2024-05-27T16:06:49Z

Sorry for this list ☝️ Sad attempt to peacefully resolve merge conflicts went awry..

* competition wip * wip * wip * adding methods for interfacing w/ competitions * Continuing to integrate polaris client with the Hub for comps * comp wip * updating date serializer * Competition evaluation (#103) * call hub evaluate endpoint from client evaluate_competitions method * add super basic test for evaluating competitions * be more specific in evaluate_benchmark signature * Update polaris/hub/client.py Co-authored-by: Andrew Quirke <[email protected]> * start refactoring object dependencies out of evaluation logic * refactor test subset object out of evaluation logic * clean up as much as possible for now * updating date serializer * call hub evaluate endpoint from client evaluate_competitions method * Update polaris/competition/_competition.py Co-authored-by: Andrew Quirke <[email protected]> * updating date serializer * call hub evaluate endpoint from client evaluate_competitions method * add super basic test for evaluating competitions * comp wip * updating date serializer * call hub evaluate endpoint from client evaluate_competitions method * fix bad merge resolution * only send competition artifact ID to hub --------- Co-authored-by: Andrew Quirke <[email protected]> Co-authored-by: Andrew Quirke <[email protected]> * Use evaluation logic directly in hub, no need for wrapper (#109) * use evaluation logic directly in hub, no need for wrapper * include evaluate_benchmark in package * remove unnecessary imports * read incoming scores sent as json * light formatting updates * updating fallback version for dev build * integrating results for comps (#111) * integrating results for comps * Update polaris/hub/client.py Co-authored-by: Cas Wognum <[email protected]> * addressing comments & adding CompetitionResults class * test competition evalution works for multi-column dataframes * add single column test to competition evaluation * fix multitask-single-test-set cases * fix bug with multi-test-set benchmarks * adding functions to serialize & deserialize pred objs for external eval * updating return for evaluate_competition method in client * updating evaluate_competition method to pass additional result info to hub --------- Co-authored-by: Cas Wognum <[email protected]> Co-authored-by: Kira McLean <[email protected]> * updates to enable fetching & interacting with comps * updating requirement for eval name * Feat/competition/eval (#114) * integrating results for comps * Update polaris/hub/client.py Co-authored-by: Cas Wognum <[email protected]> * addressing comments & adding CompetitionResults class * test competition evalution works for multi-column dataframes * add single column test to competition evaluation * fix multitask-single-test-set cases * fix bug with multi-test-set benchmarks * adding functions to serialize & deserialize pred objs for external eval * updating return for evaluate_competition method in client * updating evaluate_competition method to pass additional result info to hub * refuse early to upload a competition with a zarr-based dataset * removing merge conflicts --------- Co-authored-by: Andrew Quirke <[email protected]> Co-authored-by: Andrew Quirke <[email protected]> Co-authored-by: Cas Wognum <[email protected]> * test that all rows of a competition test set will have at least a value (#116) * update competition evaluation to support y_prob * run ruff on all files and fix issues * fix wrong url printout after upload * Clarifying typing for nested types * removing if_exists arg from comps * raising error for trying to make zarr comp * updating name of ArtifactType to ArtifactSubtype * updating comments & removing redundant class attributes * moving split validator logic from comp spec to benchmark spec * removing redundant checks from CompetitionDataset class * creating pydantic model for comp predictions * split validator logic, redundant pydantic checks, comp pred pydantic model * changes for comps wrap up * Adding CompetitionsPredictionsType * adding conversion validator for comp prediction type * setting predictions validator as class method * Using self instead of cls for field validators * removing model validation on fetch from hub * Creating HubOwner object in comp result eval method * Documentation & tutorials for competitions * Removing create comp method, fixing failing tests, updating benchmark label struct * Updating docs for create comp & benchmark pred structure * tiny wording change in competition tutorial * Addressing PR feedback * fixing tests & removing dataset redefinition from CompetitionDataset class * Commenting out line in tutorial to fix test * fixing formatting * small fixes & depending on tableContent for dataset storage info --------- Co-authored-by: Andrew Quirke <[email protected]> Co-authored-by: Andrew Quirke <[email protected]> Co-authored-by: Cas Wognum <[email protected]>

kirahowe added the feature Annotates any PR that adds new features; Used in the release process label May 15, 2024

kirahowe requested review from jstlaurent and Andrewq11 May 15, 2024 14:45

kirahowe requested a review from cwognum as a code owner May 15, 2024 14:45

kirahowe commented May 15, 2024

View reviewed changes

polaris/hub/client.py Outdated Show resolved Hide resolved

kirahowe commented May 15, 2024

View reviewed changes

Andrewq11 reviewed May 16, 2024

View reviewed changes

polaris/competition/_competition.py Outdated Show resolved Hide resolved

Andrewq11 reviewed May 16, 2024

View reviewed changes

polaris/hub/client.py Show resolved Hide resolved

Andrewq11 reviewed May 16, 2024

View reviewed changes

kirahowe commented May 18, 2024

View reviewed changes

kirahowe requested a review from Andrewq11 May 21, 2024 15:39

Andrewq11 reviewed May 22, 2024

View reviewed changes

polaris/competition/_competition.py Outdated Show resolved Hide resolved

Andrewq11 reviewed May 22, 2024

View reviewed changes

kirahowe force-pushed the km/evaluation branch from 45281f3 to ef519a2 Compare May 27, 2024 15:23

kirahowe and others added 11 commits May 27, 2024 12:54

call hub evaluate endpoint from client evaluate_competitions method

8978856

add super basic test for evaluating competitions

6d4a958

be more specific in evaluate_benchmark signature

d990c50

Update polaris/hub/client.py

7ef3b79

Co-authored-by: Andrew Quirke <[email protected]>

start refactoring object dependencies out of evaluation logic

02db480

refactor test subset object out of evaluation logic

34edabe

clean up as much as possible for now

8a1e613

updating date serializer

66614d0

call hub evaluate endpoint from client evaluate_competitions method

4decc80

Update polaris/competition/_competition.py

7e9746f

Co-authored-by: Andrew Quirke <[email protected]>

updating date serializer

2be92b2

kirahowe and others added 6 commits May 27, 2024 13:01

call hub evaluate endpoint from client evaluate_competitions method

b1b7d24

add super basic test for evaluating competitions

d56e860

comp wip

9a26773

updating date serializer

a9a2006

call hub evaluate endpoint from client evaluate_competitions method

ae5affb

fix bad merge resolution

4d71242

kirahowe force-pushed the km/evaluation branch from ebe82f1 to 4d71242 Compare May 27, 2024 16:06

only send competition artifact ID to hub

aee2b55

kirahowe merged commit 6704b6f into feat/competitions May 27, 2024

kirahowe deleted the km/evaluation branch May 27, 2024 16:20

Competition evaluation #103

Competition evaluation #103

Uh oh!

Conversation

kirahowe commented May 15, 2024

Changelogs

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Andrewq11 commented May 16, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Andrewq11 commented May 22, 2024

Uh oh!

kirahowe commented May 27, 2024

Uh oh!

Uh oh!