PLUMBER #27
Replies: 3 comments 1 reply
-
Thanks for sharing this exciting benchmark! The integration of BindingDB, ChEMBL, and BioLip2 as dataset sources is intriguing. Are there any particular challenges you faced during the curation process, especially with integrating data from such diverse sources? Looking forward to exploring the annotated code in the notebooks! |
Beta Was this translation helpful? Give feedback.
-
Hi @simonatoptic, thank you for your submission! This is a great resource for the community! The sheer size of it makes it a valuable resource for pre-training and I'm excited to see what people will use it for. We also like that you've built on top of the Plinder splits. When certifying datasets on Polaris, given that we focus on providing resources for model evaluation, we ensure that the datasets meet the three main criteria outlined in our Dataset 101 criteria. Based on what was submitted, it appears that PLUMBER doesn’t meet some of these criteria, mainly:
After reviewing your code base, it's clear that you've made some thoughtful decisions (e.g., binarizing and deduplicating the test set), and we appreciate the transparency. However, based on what was submitted and our Dataset 101 guidelines, the dataset cannot be certified as it currently stands due to issues related to model evaluation. Specifically, we worry that the high variability in the databases you pull from - even for the same assay - will introduce noise in your test set. That said, if there’s additional context about the dataset or ways to address the issues we’ve flagged, we’d be happy to support you in resolving them as we believe this could become an even more powerful tool for the community. Looking forward to your feedback! |
Beta Was this translation helpful? Give feedback.
-
Thank you @cwognum for your review! For the note, the current test set already comes from the BindingDB database only, although it still covers a lot of different sources. What we came up with @simonatoptic is to select certain assays that are consistent and came from one source. So, if we find those assays, we will create a subset of the current test set and update the dataset/benchmark. Do you think that can be enough? Also, if you know whom to talk best about it, please let us know! |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Polaris Link
https://polarishub.io/datasets/optic/plumber
README
Hey everyone! We've created a new benchmark for sequence-based binding affinity models.
Big picture blog post: Bioptic shares ML benchmark for small molecule binding prediction on Polaris
Detailed README: GitHub repository
Annotated code: Notebooks on GitHub
Let me know if you have any questions.
Dataset Source
BindingDB, ChEMBL, BioLip2
Dataset Curation
https://github.com/optic-inc/plumber/tree/041ddb780ca9001448d5f9b373b66547443a768a
Dataset Completeness
readme
,source
andcuration_reference
fields for my Polaris dataset.Anything else we should know?
No response
Beta Was this translation helpful? Give feedback.
All reactions