NMR shift prediction from small data quantities

Author

Noura Rayya

Published

January 22, 2025

Why discuss this paper?

I chose the NMR shift prediction from small data quantities (Rull, Fischer, and Kuhn 2023) for current topics in the cheminformatics seminar because: - Many of the team memebers and I work on nmrXiv repository development and we would like to introduce NMR shift prediction in the future, which makes the article very relevant. - It is crucial to us to be familiar with available NMR shift prediction models and to understand the conditions under which they perform best. - To understand the effect of NMR experiment parameters (nuclei, solvents) on the prediction results.

Context

Machine Learning (ML) models perform best when trained with the maximum amount of data available. For heteronuclei NMR experiments, such an amount of data is not available. The paper demonstrates a novel ML model (called the 2023 model) that achieves good results for low amounts of data. They predict ¹³C and ¹⁹F NMR chemical shifts of small molecules in specific solvents. Fig1: Graphic abstract of the paper

Introduction

NMR chemical shift prediction can be done in two ways: - Ab-Initio: Calculate the chemical shifts based on the molecular structure and theoretical principles. They require no experimental data but are computationally intensive. - Using Existing Data: Like using ML models to predict shifts for new compounds relying on the experimental data of others. They are computationally efficient (Jonas, Kuhn, and Schlörer 2022).

Problem setting

The amount of experimental data in NMR for heteronuclei, specific classes of compounds, and particular solvents is limited.
The number of data points used is a significant factor in the quality of the predictions.
Some machine learning methods only show suitable predictive power with more than 5000 training examples.

Approach

The research was restricted to small organic molecules in solution, as other NMR fields require different and more specific models.
It focuses on the prediction of NMR chemical shifts using ML (graph neural networks).
The molecules were represented as a graph, where the atoms are nodes, and the bonds are edges. Each has its own features.
Atoms / Nodes features: Atomic number, Atomic radius, Number of neutrons, Electronegativity, Electron affinity.
Bonds / Edges features: Bond length, Bond type.
This specific feature selection was chosen because of preliminary experiments, which measured the impact of each feature on the final prediction.
The best-performing features were combined until the prediction quality was no longer improving.

Information Flow

Information Flow Fig2: Schematic flow of information in the message‑passing graph network. The workflow can be split into the following steps: encoding, message‑passing, and prediction of shifts with an MLP.

The chosen features are encoded by multi-layer perceptron functions.
The encoded features are then processed through multiple rounds of message-passing in graph network (Duvenaud et al. 2015) (Gilmer et al. 2017).
This results in new node features that get passed to another MLP that predicts the final chemical shifts.
The most important hyperparameters were the number of message-passing steps, the learning rate, and the weight decay.

Performance comparison for ¹³C NMR shift prediction

Two other prediction methods were used for comparison:
1. Hierarchically Ordered Spherical Environment (HOSE) codes:
  - It describes atoms and their environments as strings.
  - Compounds with similar strings and known chemical shifts are looked up and used for prediction (Bremser 1978).
2. 2019 model
- It uses a convolutional graphical neural network (Jonas and Kuhn 2019).
Data:
- All data was taken from NMRShiftDB2, an open NMR database (Oellien, Fechner, and Engel 2012), which contains lists of chemical shift values.
- The datasets consist of random selections of structures.
Evaluation:
- 75:25 training-to-test split.
- No separate validation set due to the small size of the datasets.
- The predictive performance of the different models was analyzed when trained on an increasing number of molecules.
Error metrics:
- The mean absolute error (MAE): Less Sensitive to Large Errors.
- The root mean squared error (RMSE): It gives more weight to larger errors.
- The mean absolute scaled error (MASE): Insensitive to the scale of the data, such as dealing with different nuclei or solvents.
- The standard deviation σ of the error: Variability of the errors in a set of predictions.

Results

MAE MAE of a ¹³C NMR shift prediction, using increasing numbers of spectra

RMSE RMSE of a ¹³C NMR shift prediction, using increasing numbers of spectra

MASE MASE of a ¹³C NMR shift prediction, using increasing numbers of spectra

Sigma σ of a ¹³C NMR shift prediction, using increasing numbers of spectra

The 2023 model outperforms the 2019 model when trained on up to 2500 data points (2023 is better for small data).
The HOSE codes is even better than the 2023 model for small data. However, in some cases, a prediction based on HOSE codes is not possible if no examples with high enough similarity exist in the training set.

Other heteronuclei

Othe than ¹³C, ¹⁹F spectra was used to test 2023 model.
In NMRShiftDB2, there were 957 structures with measured ¹⁹F spectra.

MAE of a 19F NMR shift prediction, using increasing number of samples, on a logarithmic scale

MAE of a ¹⁹F NMR shift prediction, using increasing number of samples, on a logarithmic scale

Results

HOSE codes is again the best with small data with the 2019 model being the worst with them.
The 2023 model almost reached HOSE codes performance with the increasing number of spectra.
Hose codes always has the disadvantage that it might not give a prediction at all.

Solvents

The used solvent is a major factor influencing the chemical shift values of a particular compound due to: - its influence on the chemical environment of the molecule - the possibility of forming hydrogen bonds - changes in the charge state of the investigated molecule.

Accurate predictions require using solvent information, but this makes the available data even fewer. Using ¹³C spectra from NMRShiftDB2 they train separate models for each solvent and compare the results to the values achieved by using all ¹³C spectra. The models are not optimized for a solvent-specific prediction.

Results

The overall tendency is similar to what we have seen before: The predictive quality of the 2019 model starts off with high errors and significantly improves beyond 1000 spectra. The 2023 model outperforms the 2019 model on smaller datasets. HOSE codes are generally doing well.

MAE of a ¹³C NMR shift prediction, using increasing numbers of samples

The solvent-specific training produces much better results compared to the overall model.

Model optimization options:

2023 model can possibly be further optimized by: - Ensuring structural diversity in training/testing datasets, which wasn’t done in model 2023. The random distribution of data was assumed. - optimizing a model hyperparameters specifically for a nucleus or a solvent.

Takeaways

Predicting NMR shifts is a challenging task for small data, which is the case with heteronuclei and even further with different solvents.
This paper demonstrates a new ML model that performs better than another model by the same group presented in 2019 for small data.
Even though Hose codes can be even better than 2023 for small data, it can’t always provide a prediction.

References

Bremser, W. 1978. “Hose—a Novel Substructure Code.” Analytica Chimica Acta 103 (4): 355–65.

Ding, Yuheng, Yusong Wang, Bo Qiang, Jie Yu, Qi Li, Yiran Zhou, and Zhenmin Liu. 2025. “NaFM: Pre-Training a Foundation Model for Small-Molecule Natural Products.” https://doi.org/10.48550/ARXIV.2503.17656.

Duvenaud, David K, Dougal Maclaurin, Jorge Iparraguirre, Rafael Bombarell, Timothy Hirzel, Alán Aspuru-Guzik, and Ryan P Adams. 2015. “Convolutional Networks on Graphs for Learning Molecular Fingerprints.” Advances in Neural Information Processing Systems 28.

Gilmer, Justin, Samuel S Schoenholz, Patrick F Riley, Oriol Vinyals, and George E Dahl. 2017. “Neural Message Passing for Quantum Chemistry,” 1263–72.

Guo, Jeff, and Philippe Schwaller. 2024. “It Takes Two to Tango: Directly Optimizing for Constrained Synthesizability in Generative Molecular Design.” arXiv. https://doi.org/10.48550/ARXIV.2410.11527.

Jonas, Eric, and Stefan Kuhn. 2019. “Rapid Prediction of NMR Spectral Properties with Quantified Uncertainty.” Journal of Cheminformatics 11: 1–7.

Jonas, Eric, Stefan Kuhn, and Nils Schlörer. 2022. “Prediction of Chemical Shift in NMR: A Review.” Magnetic Resonance in Chemistry 60 (11): 1021–31.

Landrum, Gregory A., Jessica Braun, Paul Katzberger, Marc T. Lehner, and Sereina Riniker. 2024. “Lwreg: A Lightweight System for Chemical Registration and Data Storage.” Journal of Chemical Information and Modeling 64 (16): 6247–52. https://doi.org/10.1021/acs.jcim.4c01133.

Oellien, Frank, Uli Fechner, and Thomas Engel. 2012. “7th German Conference on Chemoinformatics: 25 CIC-Workshop Goslar, Germany. 6-8 November 2011. Abstracts.” Journal of Cheminformatics 4 (Suppl 1): A1–P62.

Orsi, Markus, and Jean-Louis Reymond. 2024. “One Chiral Fingerprint to Find Them All.” Journal of Cheminformatics 16 (1): 53.

Rull, Herman, Markus Fischer, and Stefan Kuhn. 2023. “NMR Shift Prediction from Small Data Quantities.” Journal of Cheminformatics 15 (1): 114.

Why discuss this paper?

Context

Introduction

Problem setting

Approach

Information Flow

Performance comparison for 13C NMR shift prediction

Results

Other heteronuclei

Results

Solvents

Results

Model optimization options:

Takeaways

References

Performance comparison for ¹³C NMR shift prediction