LLMs in Chemistry

Author

Sri Ram Sagar Kanakam

Published

April 9, 2025

1. Introduction

Three key challenges in chemistry:
1. Property prediction: Predicting physical/chemical properties of compounds
2. Property-directed molecule generation: Creating molecules with desired properties
3. Synthesis prediction: Determining optimal ways to make target molecules
Chemical space is vast: Estimated 10^180 possible stable compounds – requires AI to explore effectively
Automation: The fourth challenge connecting the other three

2. Large Language Models (LLMs) Fundamentals

What are LLMs?: Deep neural networks trained on vast data, primarily using transformer architecture
Transformer architecture consists of:
- Input embeddings: Converts the input tokens into a numerical/vector representation
- Attention mechanism: Allows the model to focus on relevant parts of input
- Positional encoding: Helps maintain sequence order information
Training:
1. Pretraining Approaches
- Self-supervised learning: Learning patterns from unlabeled chemical data (like SMILES strings) without explicit annotations
- Domain-specific Vocabulary: Creating specialized tokenization for chemical notation rather than using general language vocabularies
1. Fine-tuning Approaches
- Supervised Fine-tuning: Training on labeled datasets of specific chemical properties or reactions
- Instruction Tuning: Teaching models to follow specific chemistry-related instructions through examples
- Parameter-Efficient Fine-Tuning (PEFT): Updating only a small subset of model parameters for chemistry tasks
Three main transformer architectures:
1. Encoder-only (bidirectional context) (e.g., BERT): Good for classification and property prediction (understanding)
2. Decoder-only (causal context) (e.g., GPT): Excellent for generative tasks and molecule design (generation)
3. Encoder-decoder (e.g., BART): Ideal for translation tasks like reaction prediction

3. The Data Challenge in Chemistry

Limited chemical data: Chemistry datasets are orders of magnitude smaller than general language datasets
- General LLMs (e.g., LLaMA2): Trained on trillions of tokens
- Chemical datasets: Only billions of tokens (mostly hypothetical compounds)
- Verified synthesized compounds: ~5 orders of magnitude less data than general LLMs
Data quality issues:
- Current benchmarks like MoleculeNet have errors and inconsistencies
- Many datasets include theoretical rather than experimental values
- Need for more reliable, experimentally-verified data
Chemical representation: Models use text-based notations like SMILES, SELFIES, and InChI

4. LLMs for Chemistry Applications

Property Prediction (Encoder-only Models)

Name	Architecture	Training Data	Aim	Achievements	Innovations
Mol-BERT	BERT	4B SMILES from ZINC15/ChEMBL27	Property prediction	Outperformed existing methods by 2%+ on benchmark datasets	Used Morgan fingerprint fragments as “words” and molecules as “sentences”
MolFormer	BERT	1.1B molecules from ZINC/PubChem	Molecular property prediction	Outperformed GNNs on MoleculeNet tasks	Implemented rotary positional embeddings for better sequence relationships
ChemBERTa	RoBERTa	10M SMILES from PubChem	Property prediction	Showed performance improved with dataset size	Explored impact of tokenization strategies and representation types
ChemBERTa-2	RoBERTa	77M SMILES from PubChem	Foundational model for multiple tasks	Matched state-of-the-art on MoleculeNet	Added Multi-Task Regression to pretraining
SELFormer	RoBERTa	2M drug-like compounds	Property prediction	Outperformed competitors on several tasks	Used SELFIES instead of SMILES for better robustness
SolvBERT	BERT	1M solute-solvent pairs	Solubility/solvation prediction	Comparable to graph-based models	Predicted 3D-dependent properties from 1D representations
MatSciBERT	BERT	150K materials science papers	Material property extraction	3% improvement over SciBERT	Fine-tuned for materials science domain

Molecule Generation (Decoder-only Models)

Name	Architecture	Training Data	Aim	Achievements	Innovations
MolGPT	GPT	MOSES/GuacaMol datasets	Molecular generation	Better performance than VAE approaches	Used masked self-attention for long-range dependencies in SMILES
ChemGPT	GPT-neo	10M molecules from PubChem	Explore hyperparameter tuning	Refined generative models for specific domains	Explored impact of data scale on generation
cMolGPT	GPT	MOSES	Target-specific molecule generation	Generated compounds with binding activity	Integrated protein-ligand interactions with conditional generation
iupacGPT	GPT	IUPAC names	Generate molecules using IUPAC	Human-readable molecular generation	Used IUPAC names directly instead of SMILES
Taiga	GPT + RL	Various	Property-directed molecule generation	Optimized molecules for targeted properties	Employed REINFORCE for property optimization
LlaSMol	LLaMA/Mistral	SMolInstruct	Multi-task molecular modeling	State-of-the-art on property prediction	Used parameter-efficient fine-tuning on pretrained models
GPTChem	GPT-3	Multiple benchmarks	Property prediction and inverse design	Competed with specialized models	Showed generalist models can perform chemical tasks

Synthesis Prediction (Encoder-decoder Models)

Evolution from template-based to template-free approaches:
- Template-based: Limited to known reaction types
- Semi-template-based: More flexible but still constrained
- Template-free: Learn directly from data without predefined rules

Name	Architecture	Training Data	Aim	Achievements	Innovations
Molecular Transformer	Transformer	USPTO datasets	Synthesis prediction	Outperformed prior algorithms	Required no handcrafted rules for chemical transformations
Chemformer	BART	100M SMILES from ZINC-15	Multi-task chemical modeling	State-of-the-art in synthesis and property tasks	Emphasized computational efficiency through pretraining
Text+Chem T5	T5	11.5M-33.5M samples	Multi-task chemical processing	Effective in diverse synthesis tasks	Combined reaction data from multiple sources
MOLGEN	BART	100M molecules (SELFIES)	Molecular generation	Addressed bias against natural product-like molecules	Used domain-specific molecular prefix tuning
ReactionT5	T5	ZINC and ORD	Reaction prediction	Improved synthesis prediction	Combined property and reaction prediction
TSChem	T5	USPTO	Versatile chemical modeling	Strong performance on multiple tasks	Integrated property and synthesis prediction

5. Multi-modal Approaches

Text2Mol (2021): Connects molecular representations with text descriptions
MolT5 (2022): Generates molecular captions and predicts structures from descriptions
Challenges:
- Building quality datasets pairing chemical structures with descriptions
- Multiple valid ways to describe molecules (therapeutic effects, structure, etc.)
- Need for chemical domain expertise to evaluate outputs
Applications: Molecular retrieval from text queries, structure generation from descriptions ## Multi-modal Models

Name	Architecture	Training Data	Aim	Achievements	Innovations
Text2Mol	SciBERT w/ decoder	CheBI-20	Connecting text with molecules	Enabled retrieval of molecules using text	Created paired dataset linking molecules with descriptions
MolT5	T5	C4 dataset	Molecule captioning and generation	Generated structures from descriptions	Used denoising objective to handle complex chemical descriptions
BioT5+	T5	ZINC20, UniRef50, PubMed, etc.	Bridging text and molecules	Multi-task performance across domains	Integrated diverse biological and chemical data sources
CLAMP	Combined model	Biochemical data	Predict biochemical activity	Enhanced prediction with language	Combined separate chemical and language modules
GIT-Mol	Multi-modal	Graphs, images, text	Integrated chemical understanding	Improved accuracy with multiple data types	Combined three modalities for better chemical modeling

6. Conclusion and Future Directions

LLMs show tremendous potential for transforming chemical research
Integration of multiple models can address different aspects of chemical discovery
Key to progress: Better data, improved benchmarks, and enhanced interpretability
Moving toward autonomous systems that combine the strengths of different models

Ding, Yuheng, Yusong Wang, Bo Qiang, Jie Yu, Qi Li, Yiran Zhou, and Zhenmin Liu. 2025. “NaFM: Pre-Training a Foundation Model for Small-Molecule Natural Products.” https://doi.org/10.48550/ARXIV.2503.17656.

Guo, Jeff, and Philippe Schwaller. 2024. “It Takes Two to Tango: Directly Optimizing for Constrained Synthesizability in Generative Molecular Design.” arXiv. https://doi.org/10.48550/ARXIV.2410.11527.

Landrum, Gregory A., Jessica Braun, Paul Katzberger, Marc T. Lehner, and Sereina Riniker. 2024. “Lwreg: A Lightweight System for Chemical Registration and Data Storage.” Journal of Chemical Information and Modeling 64 (16): 6247–52. https://doi.org/10.1021/acs.jcim.4c01133.

Orsi, Markus, and Jean-Louis Reymond. 2024. “One Chiral Fingerprint to Find Them All.” Journal of Cheminformatics 16 (1): 53.