MatText Representations🔗

Creating Text Representations for Crystal Structures🔗

Converting structures into text representations can be done using our TextRep class

from mattext.representations import TextRep
from pymatgen.core import Structure

# Load structure from a CIF file
from_file = "InCuS2_p1.cif"
structure = Structure.from_file(from_file, "cif")

# Initialize TextRep Class
text_rep = TextRep.from_input(structure)

get_requested_text_reps Method🔗

The get_requested_text_reps method retrieves the requested text representations of the crystal structure and returns them in a dictionary.

For example, to obtain the cif_p1, slice, atoms_params, crystal_llm_rep, and zmatrix representations, use the following code:

text_rep = TextRep(structure)

# Define a list of requested representations
requested_reps = [

# Get the requested text representations
requested_text_reps = text_rep.get_requested_text_reps(requested_reps)
{'cif_p1': "data_InCuS2\n_symmetry_space_group_name_H-M   'P 1'\n_cell_length_a   5.52\n_cell_length_b   5.52\n_cell_length_c   6.8\n_cell_angle_alpha   113.96\n_cell_angle_beta   113.96\n_cell_angle_gamma   90.0\n_symmetry_Int_Tables_number   1\n_chemical_formula_structural   InCuS2\n_chemical_formula_sum   'In2 Cu2 S4'\n_cell_volume   169.53\n_cell_formula_units_Z   2\nloop_\n _symmetry_equiv_pos_site_id\n _symmetry_equiv_pos_as_xyz\n  1  'x, y, z'\nloop_\n _atom_type_symbol\n _atom_type_oxidation_number\n  In3+  3.0\n  Cu+  1.0\n  S2-  -2.0\nloop_\n _atom_site_type_symbol\n _atom_site_label\n _atom_site_symmetry_multiplicity\n _atom_site_fract_x\n _atom_site_fract_y\n _atom_site_fract_z\n _atom_site_occupancy\n  Cu+  Cu4  1  0.25  0.75  0.5  1.0\n  Cu+  Cu5  1  0.0  0.0  0.0  1.0\n  In3+  In0  1  0.5  0.5  0.0  1.0\n  In3+  In1  1  0.75  0.25  0.5  1.0\n  S2-  S8  1  0.9  0.88  0.25  1.0\n  S2-  S9  1  0.62  0.1  0.75  1.0\n  S2-  S10  1  0.35  0.38  0.25  1.0\n  S2-  S11  1  0.12  0.65  0.75  1.0\n", 'slice': 'Cu Cu In In S S S S 0 7 o o o 0 4 - o o 0 6 o o o 0 5 o + o 1 4 - - o 1 5 - o - 1 7 o - - 1 6 o o o 2 6 o o o 2 7 o o - 2 5 o o - 2 4 o o o 3 5 o o o 3 6 o o o 3 4 o - o 3 7 + o o ', 'atoms_params': 'Cu Cu In In S S S S 5.52 5.52 6.8 113 113 90', 'crystal_llm_rep': '5.5 5.5 6.8\n113 113 90\nCu+\n0.25 0.75 0.50\nCu+\n0.00 0.00 0.00\nIn3+\n0.50 0.50 0.00\nIn3+\n0.75 0.25 0.50\nS2-\n0.90 0.87 0.25\nS2-\n0.62 0.10 0.75\nS2-\n0.35 0.37 0.25\nS2-\n0.12 0.65 0.75', 'zmatrix': 'Cu\nCu 1 3.9\nIn 2 3.9 1 60\nIn 1 3.9 2 60 3 -71\nS 3 2.5 4 90 1 93\nS 4 2.5 2 90 1 93\nS 1 2.3 2 32 3 -35\nS 1 2.3 7 111 6 -32'}

Supported Text Representations🔗

The TextRep class currently supports the following text representations:

  • SlICES (slices): SLICE representation of the crystal structure.
  • Composition (composition): Chemical composition in hill format.
  • CIF Symmetrized (cif_symmetrized): Multi-line CIF representation with symmetrized structure and rounded float numbers.
  • CIF \(P_1\) (cif_p1): Multi-line CIF representation with the conventional unit cell and rounded float numbers.
  • Crystal-text-LLM Representation (crystal_text_llm): Representation following the format specified in the cited work.
  • Robocrystallographer Representation (robocrys_rep): Representation generated by Robocrystallographer.
  • Atom Sequences (atom_sequences): List of atoms inside the unit cell.
  • Atoms Squences++ (atom_sequences_plusplus): List of atoms with lattice parameters.
  • Z-Matrix (zmatrix): Z-Matrix representation of the crystal structure.
  • Local-Env (local_env): List of Wyckoff label and SMILES separated by line breaks for each local environment.

For more details on each representation and how to obtain them, refer to the respective method documentation in the TextRep class.


The TextRep class supports various transformations that can be applied to the input structure.

Permute Structure🔗

The permute_structure transformation randomly permutes the order of atoms in a structure.

from mattext.representations import TextRep
from pymatgen.core.structure import Structure

structure_1 = Structure.from_file("N2_p1.cif", "cif")

transformations = [("permute_structure", {"seed": 42})]

text_rep = TextRep.from_input(structure_1, transformations)
text_representations_requested = ["atoms", "crystal_llm_rep"]
print("Permuted Pymatgen Structure:")
print("Permuted Text Representations:")
Permuted Pymatgen Structure:
Full Formula (N4)
Reduced Formula: N2
abc   :   5.605409   5.605409   5.605409
angles:  90.000000  90.000000  90.000000
pbc   :       True       True       True
Sites (4)
#  SP          a        b        c
---  ----  -------  -------  -------
0  N0+   0.02321  0.02321  0.02321
1  N0+   0.97679  0.52321  0.47679
2  N0+   0.52321  0.47679  0.97679
3  N0+   0.47679  0.97679  0.52321
Permuted Text Representations:
{'atoms': 'N N N N', 'crystal_llm_rep': '5.6 5.6 5.6\n90 90 90\nN0+\n0.02 0.02 0.02\nN0+\n0.98 0.52 0.48\nN0+\n0.52 0.48 0.98\nN0+\n0.48 0.98 0.52'}

Translate Structure🔗

The translate_structure transformation randomly translates all atoms in a structure by a specified vector. This can simulate small displacements in the structure.

transformations = [("translate_structure", {"seed": 42, "vector": [0.1, 0.1, 0.1]})]

text_rep = TextRep.from_input(structure_1, transformations)
text_representations_requested = ["crystal_llm_rep"]
print("Translated Crystal-text-LLM Representations:")
Translated Crystal-text-LLM Representations:
{'crystal_llm_rep': '5.6 5.6 5.6\n90 90 90\nN0+\n0.58 0.08 0.62\nN0+\n0.08 0.62 0.58\nN0+\n0.12 0.12 0.12\nN0+\n0.62 0.58 0.08'}

Augmenting Data🔗

In principle, we can generate valid text representations with random transformations with physically meaningful parameters. Dummy example is shown below

from mattext.representations import TextRep

# Define transformations
translation_vectors = np.random.uniform(low=0.1, high=0.5, size=(3, 3))
for vector in translation_vectors:
    transformations = [
        ("permute_structure", {"seed": 42}),
        ("perturb_structure", {"seed": 42, "max_distance": 0.1}),
        ("translate_structure", {"seed": 42, "vector": vector.tolist()})
    text_rep = TextRep.from_input(structure_2, transformations)
    text_representations_requested = ["crystal_llm_rep"]
    print("Translated Text Representations:")
Translated Text Representations:{'crystal_llm_rep': '3.9 3.9 3.9\n90 90 90\nO2-\n0.76 0.98 0.41\nTi4+\n0.77 0.98 0.89\nO2-\n0.76 0.49 0.89\nO2-\n0.26 0.97 0.88\nSr2+\n0.25 0.47 0.38'}

Translated Text Representations:{'crystal_llm_rep': '3.9 3.9 3.9\n90 90 90\nO2-\n0.85 0.66 0.18\nTi4+\n0.86 0.66 0.66\nO2-\n0.85 0.17 0.66\nO2-\n0.35 0.65 0.65\nSr2+\n0.34 0.15 0.15'}

Translated Text Representations:{'crystal_llm_rep': '3.9 3.9 3.9\n90 90 90\nO2-\n0.63 0.94 0.35\nTi4+\n0.64 0.94 0.84\nO2-\n0.64 0.45 0.84\nO2-\n0.13 0.94 0.83\nSr2+\n0.12 0.43 0.33'}

More examples are available as notebook in the repository.

The following transformations are available for transforming structures:

Randomly permute structure🔗

permute_structure randomly permutes the order of atoms in a structure.

Randomly Translate Single Atom🔗

translate_single_atom randomly translates one or more atoms in a structure.

Randomly Perturb Structure🔗

perturb_structure randomly perturbs atoms in a structure.

Randomly Translate Structure🔗

translate_structure randomly translates the atoms in a structure.


This transformation supports additional keyword arguments for fine-tuning the translation.

MatText leverages methods from pymatgen and support all the keyword arguments in Structure.translate_sites method.

All transformations utilize a common seed value for reproducibility and accept additional parameters for customization.

For more details on each transformation and its parameters, refer to the respective function documentation.