MassSpecGym: A benchmark for the discovery and identification of molecules #28
Replies: 2 comments 2 replies
-
Hi! We have developed a new benchmark for the discovery of molecules from biological and environmental samples, called MassSpecGym. We would be happy to include it as a certified dataset in the Polaris Hub. Could you please let us know if this is possible and, if so, whether it can be accomplished before NeurIPS 2024, where we are going to present it? Thank you! |
Beta Was this translation helpful? Give feedback.
-
Hi @roman-bushuiev, The work is an excellent contribution to the field, and we appreciate the effort behind it. As datasets are published to the community and the field continues to evolve, future discussions and questions may arise. We encourage you to stay engaged in the ongoing discourse and explore opportunities for further work. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Polaris Link
https://polarishub.io/datasets/roman-bushuiev/massspecgym
README
MassSpecGym: A benchmark for the discovery and identification of molecules
The discovery and identification of molecules in biological and environmental samples is crucial for advancing biomedical and chemical sciences. Tandem mass spectrometry (MS/MS) is the leading technique for high-throughput elucidation of molecular structures. However, decoding a molecular structure from its mass spectrum is exceptionally challenging, even when performed by human experts. As a result, the vast majority of acquired MS/MS spectra remain uninterpreted, thereby limiting our understanding of the underlying (bio)chemical processes. Despite decades of progress in machine learning applications for predicting molecular structures from MS/MS spectra, the development of new methods is severely hindered by the lack of standard datasets and evaluation protocols.
To address this problem, we propose MassSpecGym - the first comprehensive benchmark for the discovery and identification of molecules from MS/MS data. Our benchmark comprises the largest publicly available collection of high-quality labeled MS/MS spectra and defines three MS/MS annotation challenges: de novo molecular structure generation, molecule retrieval, and spectrum simulation. It includes new evaluation metrics and a generalization-demanding data split, therefore standardizing the MS/MS annotation tasks and rendering the problem accessible to the broad machine learning community.
🧪 MassSpecGym dataset
MassSpecGym comprises the largest publicly available collection of 231 thousand high-quality MS/MS spectra labeled with molecular structures. The dataset includes spectra exhaustively collected from well-established public repositories (i.e., MoNA, MassBank, and GNPS), as well as new spectra generated from our in-house mass spectrometry measurements.
The curation of the dataset involves a pipeline of cleaning and standardization steps, ensuring high-quality data while preserving a broad coverage of molecular structures and mass spectrometry settings. The table below introduces all the variables present in MassSpecGym. The main ones are
mzs
andintensities
, representing MS/MS spectra, andsmiles
, representing the corresponding molecular structures.identifier
mzs
intensities
smiles
inchikey
formula
precursor_formula
parent mass
precursor_mz
adduct
instrument_type
collision_energy
fold
simulation_challenge
🏆 MassSpecGym benchmark
MassSpecGym provides three challenges for benchmarking the discovery and identification of new molecules from MS/MS spectra:
The provided challenges abstract the process of scientific discovery from biological and environmental samples into well-defined machine learning problems with pre-defined datasets, data splits, and evaluation metrics.
📩 Contact
For any questions or suggestions, please contact the authors via email: [email protected].
🔗 References
📃 NeurIPS 2024 Splotlight paper: https://arxiv.org/abs/2410.23326.
💻 GitHub repository: https://github.com/pluskal-lab/MassSpecGym.
🤗 Hugging Face page: https://huggingface.co/datasets/roman-bushuiev/MassSpecGym.
If you use MassSpecGym in your work, please cite the following paper:
Dataset Source
https://doi.org/10.48550/arXiv.2410.23326
Dataset Curation
https://github.com/pluskal-lab/MassSpecGym
Dataset Completeness
readme
,source
andcuration_reference
fields for my Polaris dataset.Anything else we should know?
No response
Beta Was this translation helpful? Give feedback.
All reactions