PLUMBER #27

simonatoptic · 2024-11-20T08:57:21Z

simonatoptic
Nov 20, 2024

Polaris Link

https://polarishub.io/datasets/optic/plumber

README

Hey everyone! We've created a new benchmark for sequence-based binding affinity models.

Big picture blog post: Bioptic shares ML benchmark for small molecule binding prediction on Polaris
Detailed README: GitHub repository
Annotated code: Notebooks on GitHub

Let me know if you have any questions.

Dataset Source

BindingDB, ChEMBL, BioLip2

Dataset Curation

https://github.com/optic-inc/plumber/tree/041ddb780ca9001448d5f9b373b66547443a768a

Dataset Completeness

I confirm that I filled out at least the readme, source and curation_reference fields for my Polaris dataset.

Anything else we should know?

No response

Bellabonay · 2024-11-23T16:45:44Z

Bellabonay
Nov 23, 2024

Thanks for sharing this exciting benchmark! The integration of BindingDB, ChEMBL, and BioLip2 as dataset sources is intriguing. Are there any particular challenges you faced during the curation process, especially with integrating data from such diverse sources? Looking forward to exploring the annotated code in the notebooks!

0 replies

cwognum · 2024-11-25T21:20:07Z

cwognum
Nov 25, 2024
Maintainer

Hi @simonatoptic, thank you for your submission! This is a great resource for the community! The sheer size of it makes it a valuable resource for pre-training and I'm excited to see what people will use it for. We also like that you've built on top of the Plinder splits.

When certifying datasets on Polaris, given that we focus on providing resources for model evaluation, we ensure that the datasets meet the three main criteria outlined in our Dataset 101 criteria.

Based on what was submitted, it appears that PLUMBER doesn’t meet some of these criteria, mainly:

The dataset stems from a consistent, original source
Creators of the dataset must share references to where the dataset was originally sourced from. If data is aggregated from multiple sources or preprocessed in some way, this process needs to be transparent and the rationale should be well documented. Blindly combining datasets can introduce significant noise.

After reviewing your code base, it's clear that you've made some thoughtful decisions (e.g., binarizing and deduplicating the test set), and we appreciate the transparency. However, based on what was submitted and our Dataset 101 guidelines, the dataset cannot be certified as it currently stands due to issues related to model evaluation. Specifically, we worry that the high variability in the databases you pull from - even for the same assay - will introduce noise in your test set.

That said, if there’s additional context about the dataset or ways to address the issues we’ve flagged, we’d be happy to support you in resolving them as we believe this could become an even more powerful tool for the community.

Looking forward to your feedback!

0 replies

VladVin · 2024-12-24T22:25:56Z

VladVin
Dec 24, 2024

Thank you @cwognum for your review! For the note, the current test set already comes from the BindingDB database only, although it still covers a lot of different sources. What we came up with @simonatoptic is to select certain assays that are consistent and came from one source. So, if we find those assays, we will create a subset of the current test set and update the dataset/benchmark. Do you think that can be enough? Also, if you know whom to talk best about it, please let us know!

1 reply

cwognum Feb 3, 2025
Maintainer

Hi @VladVin and @simonatoptic , sorry for the delay here.

We're changing how we do certification. There is nuance that now gets lost due to the binary distinction we're making. Specifically for PLUMBER: We see the value of the work you've done and although there are some risks associated with merging data from multiple sources, we believe that with the currently available public data you can't do much better.

We'll get back to you here once we've implemented the changes on our end!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

PLUMBER #27

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 3 comments 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

PLUMBER #27

Uh oh!

Uh oh!

simonatoptic Nov 20, 2024

Polaris Link

README

Dataset Source

Dataset Curation

Dataset Completeness

Anything else we should know?

Replies: 3 comments · 1 reply

Uh oh!

Uh oh!

Bellabonay Nov 23, 2024

Uh oh!

cwognum Nov 25, 2024 Maintainer

Uh oh!

VladVin Dec 24, 2024

Uh oh!

cwognum Feb 3, 2025 Maintainer

simonatoptic
Nov 20, 2024

Replies: 3 comments 1 reply

Bellabonay
Nov 23, 2024

cwognum
Nov 25, 2024
Maintainer

VladVin
Dec 24, 2024

cwognum Feb 3, 2025
Maintainer