Leveraging language representation for materials exploration and discovery

Why discussing this paper?

I chose Jiaxing et al.’s paper for our journal club because

LLMs successfully applied in other domain, interesting to see what can be done in material science
One among the few paper where LLMs are applied in materials science for actual material discovery.
Nice embedding figures

Context

Material space is not completely explored. And there is possibility of finding better materials in many applications.
ML recommender systems for exploring material spaces are already there but not many using “LLMs”
LLM framework for recommending prototype crystal structures and later validate through first-principles calculations and experiments
Why LLMs? - Universal task agnostic representations

Some Previous LLM Models

MatSciBERT

MatsciBERT was pretrained on whole sections of more than 1 million materials science articles with masked language modelling.

MatBERT

MatBERT was trained by sampling 50 million paragraphs from 2 million articles masked language modelling.

Word2Vec

Mat2Vec was trained similarly as Word2vec training through skip-gram with negative sampling. Each word is embedded into a 200-dimensional vector.

Problem setting

Hand-crafted features and specialized structural models have limitations in providing universal and task-agnostic representations within the vast material space.
Additional contexts are also very useful. for eg: (doping, temperature, synthesis conditions)
In materials exploration and discovery context:

effective representations of both chemical and structural complexity, (ii) successful recall of relevant candidates to property of interest
accurate candidate ranking based on multiple desired functional properties.

Approach

The authors propose a two-step funnel based approach

1. RECALL - Given a material finding similar material from a set of materials
1. RANKING - Based on functional properties rank the recalled materials

Recalling similar materials

The authors use Robocrystallographer representation to describe the material. Encode the material description using pretrained MatBERT (compared other encoders as well), and use this as a feature vector

Use a Query material (a well studied known material with property of interest).
Encode all material in database and Query material (Robocrystallographer + MatBERT)
Look at cosine similarity of feature vectors (material in database with Query material)

Recall material based on cosine similarity

Ranking potential materials

Based on multiple properties the recalled materials are ranked.

Usually for any application, and in this paper, for thermoelectric material as well many properties are important hence a ranker based on performance on different functional aspects.

Author train a Multitask Mixture of Expert Model (MMoEM), using multitask learning to rank the materials.

Rank material based on cosine similarity

Results

The authors perform ablations to understand the importance of the different components of their model. While there are some differences, the differences are not drastic.

Ablation on Representation suitable for RECALL step

Two set of models 1. Uses only composition of materials Baseline: Mat2Vec

A(Composition)–> B(MatBERT)

Uses Both composition and structure Baseline: CrystalNN Fingerprint

A(Material)–> B(RoboCrystallographer)–> C(MatBERT)

Embeddings from composition only and Composition + Structure

Structure level representations exhibit more distinct separation (well-defined domains) by material groups

UMAP of embeddings from different representations

For further evaluation, authors evaluated material embedding performance on downstream property prediction tasks.

The task models were multi-layer perceptrons (MLPs) with meanabsolute-error (MAE) training loss.

The tasks consisted of band gap, energy per atom, bulk modulus, shear modulus, Debye temperature, and coefficient of thermal expansion from AFLOW dataset.

Property prediction using embeddings from different representations

Finding similar materials

Starting with known materials with favorable properties for TEs such as PbTe, we analyzed the top recalled candidates and found significantly different predicted figure-of-merit zT distributions from selected baseline representations.

Distributions of predicted zT of the top-100 recalled candidates for PbTe as the query material predicted by MatBERT

Ranking potential materials

Learning from multiple related tasks provides superior performance over single-task learning by modeling task-specific objectives and cross-task relationships.

In addition to the embeddings derived from language models, the authors added further information based on context (one hot encoded temperature)

Multi-task learning framework for material property prediction.

Performance for 6 material property prediction tasks between single-task models and MMoE using composition or structure embeddings.

Takeaways

Might not need a Language model for this task
Good to see that some of the materials where later tested in lab
Composition vs Composition + structure not convincing.

Ding, Yuheng, Yusong Wang, Bo Qiang, Jie Yu, Qi Li, Yiran Zhou, and Zhenmin Liu. 2025. “NaFM: Pre-Training a Foundation Model for Small-Molecule Natural Products.” https://doi.org/10.48550/ARXIV.2503.17656.

Guo, Jeff, and Philippe Schwaller. 2024. “It Takes Two to Tango: Directly Optimizing for Constrained Synthesizability in Generative Molecular Design.” arXiv. https://doi.org/10.48550/ARXIV.2410.11527.

Landrum, Gregory A., Jessica Braun, Paul Katzberger, Marc T. Lehner, and Sereina Riniker. 2024. “Lwreg: A Lightweight System for Chemical Registration and Data Storage.” Journal of Chemical Information and Modeling 64 (16): 6247–52. https://doi.org/10.1021/acs.jcim.4c01133.

Orsi, Markus, and Jean-Louis Reymond. 2024. “One Chiral Fingerprint to Find Them All.” Journal of Cheminformatics 16 (1): 53.