Authors: Adamo Young, Hannes Röst & Bo Wang
DOI: doi:10.1038/s42256-024-00816-8
Published: April 2024
Journal: Nature Machine Intelligence
Why discuss this paper?
- I chose this article because, even after decades of work in the field, we still lack reference spectra, and the vast majority of small molecules lack experimental reference spectra, making the identification of new compounds a difficult task.
- It tackles the fundamental challenge in Mass Spectrometry (MS) of accurately simulating the fragmentation process.
- The results show promise for improved MS-based compound identification.
Context
Overview of Mass Spectrometry
Challenges in Mass Spectrometry
- Mass spectrometry faces several significant challenges that impact its effectiveness.
Overview of Existing Solutions and Limitations
Overview of MassFormer
Overview of Steps involved in MassFormer
Molecule to Spectrum - Unlike traditional methods, MassFormer utilizes a graph-based approach to represent molecules. - Incorporating spectral metadata, such as collision energy used during the measurement, allows MassFormer to account for factors that might influence the fragmentation patterns observed in the final spectrum prediction. - Multihead self-attention : MHA - allows an input to attend to relevant parts of itself - Multilayer perceptrons : MLPs - capture non-linear relationships between the resulting representation and the final output.
Results
Spectrum similarity experiments: Predicted vs. Real spectra for two held-out compounds (Azlocillin and Flamprop-isopropyl) - Accurate Predictions: MassFormer can predict realistic mass spectra for held out compounds. The high cosine similarity values (around 0.6) indicate a close match between the predicted and actual spectra. - Outperforms Existing Models: When compared to other deep learning models (CFM, FP, WLN) on a larger dataset, MassFormer consistently performs better. It shows the highest average cosine similarity across different data splits. - NIST: National Institute of Standards and Technology - MoNA: MassBank of North America - FP: Fingerprint (FP) neural network model - WLN: Weisfeiler–Lehman (WLN) graph neural network model
Collision energy experiments: Benzhydrylpiperazine - MassFormer can not only predict mass spectra accurately but also captures the influence of collision energy on the fragmentation patterns observed in the spectra. - This ability is important for real-world applications where collision energy is a variable factor during data acquisition. - NCEs: Normalized collision energies
Explainability using gradient attributions: Mass spectrum of propranolol - Separation of Nitrogen-Containing Peaks: The gradient attribution maps reveal that nitrogen-containing peaks can be linearly separated from those without nitrogen, demonstrating that the model captures meaningful chemical information related to nitrogen content. - Improved Classification Accuracy: The linear classification accuracy is significantly higher when using nitrogen labeling compared to random labeling, with nearly half of the peaks being perfectly separable based on their gradient attributions, highlighting the effectiveness of the attributions in distinguishing chemical compositions.
Advantages of MassFormer
Takeaways
- The current study introduces MassFormer, a novel method utilizing graph transformers for small molecule MS/MS spectra prediction.
- While demonstrating strong performance, MassFormer’s applicability is currently limited by compatibility with various data types.
- The model offers explainability for its predictions; however, further development is needed for detailed peak annotations.
- MassFormer holds significant promise for MS-based compound identification, potentially enhancing existing tools and even aiding spectrum-to-structure generation. (Focuses on future potential and broader applications)
Data availability
All public data from the study have been uploaded to Zenodo at https://doi.org/10.5281/zenodo.8399738. Some data that support the findings of this study are available from the National Institute of Standards and Technology (NIST). However, its access is subject to restrictions, requiring the purchase of an appropriate license or special permission from NIST.
Code availability
The code used in this study is open-source (BSD-2-Clause license) and can be found in a GitHub repository (https://github.com/Roestlab/massformer/) with a DOI of https://doi.org/10.5281/zenodo.10558852.