LLMs in Chemistry
1. Introduction
- Three key challenges in chemistry:
- Property prediction: Predicting physical/chemical properties of compounds
- Property-directed molecule generation: Creating molecules with desired properties
- Synthesis prediction: Determining optimal ways to make target molecules
- Chemical space is vast: Estimated 10^180 possible stable compounds – requires AI to explore effectively
- Automation: The fourth challenge connecting the other three
2. Large Language Models (LLMs) Fundamentals
- What are LLMs?: Deep neural networks trained on vast data, primarily using transformer architecture
- Transformer architecture consists of:
- Input embeddings: Converts the input tokens into a numerical/vector representation
- Attention mechanism: Allows the model to focus on relevant parts of input
- Positional encoding: Helps maintain sequence order information
- Training:
- Pretraining Approaches
- Self-supervised learning: Learning patterns from unlabeled chemical data (like SMILES strings) without explicit annotations
- Domain-specific Vocabulary: Creating specialized tokenization for chemical notation rather than using general language vocabularies
- Fine-tuning Approaches
- Supervised Fine-tuning: Training on labeled datasets of specific chemical properties or reactions
- Instruction Tuning: Teaching models to follow specific chemistry-related instructions through examples
- Parameter-Efficient Fine-Tuning (PEFT): Updating only a small subset of model parameters for chemistry tasks
- Three main transformer architectures:
- Encoder-only (bidirectional context) (e.g., BERT): Good for classification and property prediction (understanding)
- Decoder-only (causal context) (e.g., GPT): Excellent for generative tasks and molecule design (generation)
- Encoder-decoder (e.g., BART): Ideal for translation tasks like reaction prediction
3. The Data Challenge in Chemistry
- Limited chemical data: Chemistry datasets are orders of magnitude smaller than general language datasets
- General LLMs (e.g., LLaMA2): Trained on trillions of tokens
- Chemical datasets: Only billions of tokens (mostly hypothetical compounds)
- Verified synthesized compounds: ~5 orders of magnitude less data than general LLMs
- Data quality issues:
- Current benchmarks like MoleculeNet have errors and inconsistencies
- Many datasets include theoretical rather than experimental values
- Need for more reliable, experimentally-verified data
- Chemical representation: Models use text-based notations like SMILES, SELFIES, and InChI
4. LLMs for Chemistry Applications
Property Prediction (Encoder-only Models)
Name | Architecture | Training Data | Aim | Achievements | Innovations |
---|---|---|---|---|---|
Mol-BERT | BERT | 4B SMILES from ZINC15/ChEMBL27 | Property prediction | Outperformed existing methods by 2%+ on benchmark datasets | Used Morgan fingerprint fragments as “words” and molecules as “sentences” |
MolFormer | BERT | 1.1B molecules from ZINC/PubChem | Molecular property prediction | Outperformed GNNs on MoleculeNet tasks | Implemented rotary positional embeddings for better sequence relationships |
ChemBERTa | RoBERTa | 10M SMILES from PubChem | Property prediction | Showed performance improved with dataset size | Explored impact of tokenization strategies and representation types |
ChemBERTa-2 | RoBERTa | 77M SMILES from PubChem | Foundational model for multiple tasks | Matched state-of-the-art on MoleculeNet | Added Multi-Task Regression to pretraining |
SELFormer | RoBERTa | 2M drug-like compounds | Property prediction | Outperformed competitors on several tasks | Used SELFIES instead of SMILES for better robustness |
SolvBERT | BERT | 1M solute-solvent pairs | Solubility/solvation prediction | Comparable to graph-based models | Predicted 3D-dependent properties from 1D representations |
MatSciBERT | BERT | 150K materials science papers | Material property extraction | 3% improvement over SciBERT | Fine-tuned for materials science domain |
Molecule Generation (Decoder-only Models)
Name | Architecture | Training Data | Aim | Achievements | Innovations |
---|---|---|---|---|---|
MolGPT | GPT | MOSES/GuacaMol datasets | Molecular generation | Better performance than VAE approaches | Used masked self-attention for long-range dependencies in SMILES |
ChemGPT | GPT-neo | 10M molecules from PubChem | Explore hyperparameter tuning | Refined generative models for specific domains | Explored impact of data scale on generation |
cMolGPT | GPT | MOSES | Target-specific molecule generation | Generated compounds with binding activity | Integrated protein-ligand interactions with conditional generation |
iupacGPT | GPT | IUPAC names | Generate molecules using IUPAC | Human-readable molecular generation | Used IUPAC names directly instead of SMILES |
Taiga | GPT + RL | Various | Property-directed molecule generation | Optimized molecules for targeted properties | Employed REINFORCE for property optimization |
LlaSMol | LLaMA/Mistral | SMolInstruct | Multi-task molecular modeling | State-of-the-art on property prediction | Used parameter-efficient fine-tuning on pretrained models |
GPTChem | GPT-3 | Multiple benchmarks | Property prediction and inverse design | Competed with specialized models | Showed generalist models can perform chemical tasks |
Synthesis Prediction (Encoder-decoder Models)
- Evolution from template-based to template-free approaches:
- Template-based: Limited to known reaction types
- Semi-template-based: More flexible but still constrained
- Template-free: Learn directly from data without predefined rules
Name | Architecture | Training Data | Aim | Achievements | Innovations |
---|---|---|---|---|---|
Molecular Transformer | Transformer | USPTO datasets | Synthesis prediction | Outperformed prior algorithms | Required no handcrafted rules for chemical transformations |
Chemformer | BART | 100M SMILES from ZINC-15 | Multi-task chemical modeling | State-of-the-art in synthesis and property tasks | Emphasized computational efficiency through pretraining |
Text+Chem T5 | T5 | 11.5M-33.5M samples | Multi-task chemical processing | Effective in diverse synthesis tasks | Combined reaction data from multiple sources |
MOLGEN | BART | 100M molecules (SELFIES) | Molecular generation | Addressed bias against natural product-like molecules | Used domain-specific molecular prefix tuning |
ReactionT5 | T5 | ZINC and ORD | Reaction prediction | Improved synthesis prediction | Combined property and reaction prediction |
TSChem | T5 | USPTO | Versatile chemical modeling | Strong performance on multiple tasks | Integrated property and synthesis prediction |
5. Multi-modal Approaches
- Text2Mol (2021): Connects molecular representations with text descriptions
- MolT5 (2022): Generates molecular captions and predicts structures from descriptions
- Challenges:
- Building quality datasets pairing chemical structures with descriptions
- Multiple valid ways to describe molecules (therapeutic effects, structure, etc.)
- Need for chemical domain expertise to evaluate outputs
- Applications: Molecular retrieval from text queries, structure generation from descriptions ## Multi-modal Models
Name | Architecture | Training Data | Aim | Achievements | Innovations |
---|---|---|---|---|---|
Text2Mol | SciBERT w/ decoder | CheBI-20 | Connecting text with molecules | Enabled retrieval of molecules using text | Created paired dataset linking molecules with descriptions |
MolT5 | T5 | C4 dataset | Molecule captioning and generation | Generated structures from descriptions | Used denoising objective to handle complex chemical descriptions |
BioT5+ | T5 | ZINC20, UniRef50, PubMed, etc. | Bridging text and molecules | Multi-task performance across domains | Integrated diverse biological and chemical data sources |
CLAMP | Combined model | Biochemical data | Predict biochemical activity | Enhanced prediction with language | Combined separate chemical and language modules |
GIT-Mol | Multi-modal | Graphs, images, text | Integrated chemical understanding | Improved accuracy with multiple data types | Combined three modalities for better chemical modeling |
6. Conclusion and Future Directions
- LLMs show tremendous potential for transforming chemical research
- Integration of multiple models can address different aspects of chemical discovery
- Key to progress: Better data, improved benchmarks, and enhanced interpretability
- Moving toward autonomous systems that combine the strengths of different models
Ding, Yuheng, Yusong Wang, Bo Qiang, Jie Yu, Qi Li, Yiran Zhou, and Zhenmin Liu. 2025. “NaFM: Pre-Training a Foundation Model for Small-Molecule Natural Products.” https://doi.org/10.48550/ARXIV.2503.17656.
Guo, Jeff, and Philippe Schwaller. 2024. “It Takes Two to Tango: Directly Optimizing for Constrained Synthesizability in Generative Molecular Design.” arXiv. https://doi.org/10.48550/ARXIV.2410.11527.
Landrum, Gregory A., Jessica Braun, Paul Katzberger, Marc T. Lehner, and Sereina Riniker. 2024. “Lwreg: A Lightweight System for Chemical Registration and Data Storage.” Journal of Chemical Information and Modeling 64 (16): 6247–52. https://doi.org/10.1021/acs.jcim.4c01133.
Orsi, Markus, and Jean-Louis Reymond. 2024. “One Chiral Fingerprint to Find Them All.” Journal of Cheminformatics 16 (1): 53.