MassSpecGym: A benchmark for the discovery and identification of molecules #28

roman-bushuiev · 2024-11-26T22:11:52Z

roman-bushuiev
Nov 26, 2024

Polaris Link

https://polarishub.io/datasets/roman-bushuiev/massspecgym

README

MassSpecGym: A benchmark for the discovery and identification of molecules

The discovery and identification of molecules in biological and environmental samples is crucial for advancing biomedical and chemical sciences. Tandem mass spectrometry (MS/MS) is the leading technique for high-throughput elucidation of molecular structures. However, decoding a molecular structure from its mass spectrum is exceptionally challenging, even when performed by human experts. As a result, the vast majority of acquired MS/MS spectra remain uninterpreted, thereby limiting our understanding of the underlying (bio)chemical processes. Despite decades of progress in machine learning applications for predicting molecular structures from MS/MS spectra, the development of new methods is severely hindered by the lack of standard datasets and evaluation protocols.

To address this problem, we propose MassSpecGym - the first comprehensive benchmark for the discovery and identification of molecules from MS/MS data. Our benchmark comprises the largest publicly available collection of high-quality labeled MS/MS spectra and defines three MS/MS annotation challenges: de novo molecular structure generation, molecule retrieval, and spectrum simulation. It includes new evaluation metrics and a generalization-demanding data split, therefore standardizing the MS/MS annotation tasks and rendering the problem accessible to the broad machine learning community.

🧪 MassSpecGym dataset

MassSpecGym comprises the largest publicly available collection of 231 thousand high-quality MS/MS spectra labeled with molecular structures. The dataset includes spectra exhaustively collected from well-established public repositories (i.e., MoNA, MassBank, and GNPS), as well as new spectra generated from our in-house mass spectrometry measurements.

Dataset	Spectra	High-quality spectra	Molecules	Split
GNPS [Wang et al., 2016]	322K	104K	16K	✗
MoNA [Fiehn lab]	98K	62K	10K	✗
MassBank [Horai et al., 2010]	62K	58K	4K	✗
MIST CANOPUS [Goldman et al., 2023]	11K	≤11K	≤9K	✓
MassSpecGym (ours)	231K	231K	29K	✓

The curation of the dataset involves a pipeline of cleaning and standardization steps, ensuring high-quality data while preserving a broad coverage of molecular structures and mass spectrometry settings. The table below introduces all the variables present in MassSpecGym. The main ones are mzs and intensities, representing MS/MS spectra, and smiles, representing the corresponding molecular structures.

Variable	Description	Data type	Num. unique values	Example
`identifier`	Unique entry identifier	string	231,104	MassSpecGymID0088683
`mzs`	Array of spectrum m/z values	n × float	231,104	[55.0542, 57.0699, ..., 238.0995]
`intensities`	Array of spectrum intensities	n × float	231,104	[0.0240, 1.0, ..., 0.5356]
`smiles`	SMILES string of molecule	string	31,602	CCCCOCN(C1=C(C=C...CCl
`inchikey`	2D InChI key	string	28,929	HKPHPIREJKHECO
`formula`	Chemical formula of molecule	string	17,634	C17H26ClNO2
`precursor_formula`	Chemical formula of precursor ion	string	21,653	C17H27ClNO2
`parent mass`	Mass of molecule	float	32,228	311.1652
`precursor_mz`	M/z of precursor ion	float	32,275	312.1725
`adduct`	Ionization adduct	string	2	[M+H]+
`instrument_type`	Type of MS instrument	string	2	Orbitrap
`collision_energy`	Energy of CID fragmentation	float	9,737	30.0
`fold`	Split fold which entry belongs to	string	3	train
`simulation_challenge`	Entry is used for simulation challenge	boolean	2	True

🏆 MassSpecGym benchmark

MassSpecGym provides three challenges for benchmarking the discovery and identification of new molecules from MS/MS spectra:

💥 De novo molecule generation (MS/MS spectrum → molecular structure)
- ✨ Bonus chemical formulae challenge (MS/MS spectrum + chemical formula → molecular structure)
💥 Molecule retrieval (MS/MS spectrum → ranked list of candidate molecular structures)
- ✨ Bonus chemical formulae challenge (MS/MS spectrum → ranked list of candidate molecular structures with ground-truth chemical formulae)
💥 Spectrum simulation (molecular structure → MS/MS spectrum)
- ✨ Bonus chemical formulae challenge (molecular structure → MS/MS spectrum; evaluated on the retrieval of molecular structures with ground-truth chemical formulae)

The provided challenges abstract the process of scientific discovery from biological and environmental samples into well-defined machine learning problems with pre-defined datasets, data splits, and evaluation metrics.

📩 Contact

For any questions or suggestions, please contact the authors via email: [email protected].

🔗 References

📃 NeurIPS 2024 Splotlight paper: https://arxiv.org/abs/2410.23326.

💻 GitHub repository: https://github.com/pluskal-lab/MassSpecGym.

🤗 Hugging Face page: https://huggingface.co/datasets/roman-bushuiev/MassSpecGym.

If you use MassSpecGym in your work, please cite the following paper:

@article{bushuiev2024massspecgym,
      title={MassSpecGym: A benchmark for the discovery and identification of molecules}, 
      author={Roman Bushuiev and Anton Bushuiev and Niek F. de Jonge and Adamo Young and Fleming Kretschmer and Raman Samusevich and Janne Heirman and Fei Wang and Luke Zhang and Kai Dührkop and Marcus Ludwig and Nils A. Haupt and Apurva Kalia and Corinna Brungs and Robin Schmid and Russell Greiner and Bo Wang and David S. Wishart and Li-Ping Liu and Juho Rousu and Wout Bittremieux and Hannes Rost and Tytus D. Mak and Soha Hassoun and Florian Huber and Justin J. J. van der Hooft and Michael A. Stravs and Sebastian Böcker and Josef Sivic and Tomáš Pluskal},
      year={2024},
      eprint={2410.23326},
      url={https://arxiv.org/abs/2410.23326},
      doi={10.48550/arXiv.2410.23326}
}

Dataset Source

https://doi.org/10.48550/arXiv.2410.23326

Dataset Curation

https://github.com/pluskal-lab/MassSpecGym

Dataset Completeness

I confirm that I filled out at least the readme, source and curation_reference fields for my Polaris dataset.

Anything else we should know?

No response

roman-bushuiev · 2024-11-26T22:16:46Z

roman-bushuiev
Nov 26, 2024
Author

Hi! We have developed a new benchmark for the discovery of molecules from biological and environmental samples, called MassSpecGym. We would be happy to include it as a certified dataset in the Polaris Hub. Could you please let us know if this is possible and, if so, whether it can be accomplished before NeurIPS 2024, where we are going to present it? Thank you!

1 reply

cwognum Nov 29, 2024
Maintainer

Hi @roman-bushuiev , thanks for the submission! We'll get back to you before NeurIPS! 👍

zhu0619 · 2024-12-06T17:42:55Z

zhu0619
Dec 6, 2024
Maintainer

Hi @roman-bushuiev,
Thank you for submitting the dataset. After a thorough review, we are pleased to inform you that MassSpecGym is certified!

The work is an excellent contribution to the field, and we appreciate the effort behind it. As datasets are published to the community and the field continues to evolve, future discussions and questions may arise. We encourage you to stay engaged in the ongoing discourse and explore opportunities for further work.

1 reply

roman-bushuiev Dec 9, 2024
Author

Hi @zhu0619! This is great news, and we are very happy about it! We are looking forward to future collaborations!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

MassSpecGym: A benchmark for the discovery and identification of molecules #28

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

MassSpecGym: A benchmark for the discovery and identification of molecules #28

Uh oh!

roman-bushuiev Nov 26, 2024

Polaris Link

README

MassSpecGym: A benchmark for the discovery and identification of molecules

🧪 MassSpecGym dataset

🏆 MassSpecGym benchmark

📩 Contact

🔗 References

Dataset Source

Dataset Curation

Dataset Completeness

Anything else we should know?

Replies: 2 comments · 2 replies

Uh oh!

roman-bushuiev Nov 26, 2024 Author

Uh oh!

cwognum Nov 29, 2024 Maintainer

Uh oh!

zhu0619 Dec 6, 2024 Maintainer

Uh oh!

roman-bushuiev Dec 9, 2024 Author

roman-bushuiev
Nov 26, 2024

Replies: 2 comments 2 replies

roman-bushuiev
Nov 26, 2024
Author

cwognum Nov 29, 2024
Maintainer

zhu0619
Dec 6, 2024
Maintainer

roman-bushuiev Dec 9, 2024
Author