MatText Tokenizers🔗
MatText comes with custom tokenizers suitable for modelling text representations.
Currently available tokenizer
are SliceTokenizer
, CompositionTokenizer
, CifTokenizer
, RobocrysTokenizer
, SmilesTokenizer
. All the MatText representations can be translated to tokens using one of these tokenizers.
Using MatText Tokenizers🔗
By default, the tokenizer is initialized with [CLS]
and [SEP]
tokens. For an example, see the SliceTokenizer
usage:
from mattext.tokenizer import SliceTokenizer
tokenizer = SliceTokenizer(
model_max_length=512,
truncation=True,
padding="max_length",
max_length=512
)
print(tokenizer.cls_token)
print(tokenizer.tokenize("Ga Ga P P 0 3 - - o 0 2 - o - 0 1 o - -"))
output
[CLS]
['[CLS]', 'Ga', 'Ga', 'P', 'P', '0', '3', '- - o', '0', '2', '- o -', '0', '1', 'o - -', '[SEP]']
tip
You can access the [CLS]
token using the cls_token
attribute of the tokenizer.
During decoding, you can utilize the skip_special_tokens
parameter to skip these special tokens.
token_ids = tokenizer.encode("Ga Ga P P 0 3 - - o 0 2 - o - 0 1 o - -")
print(token_ids)
decoded = tokenizer.decode(token_ids, skip_special_tokens=True)
print(decoded)
output
[149, 57, 57, 41, 41, 139, 142, 24, 139, 141, 20, 139, 140, 8, 150]
Ga Ga P P 0 3 - - o 0 2 - o - 0 1 o - -
Initializing Tokenizers With Custom Special Tokens🔗
In scenarios where the [CLS]
token is not required, you can initialize
the tokenizer with an empty special_tokens dictionary.
Initialization without [CLS]
and [SEP]
tokens:
tokenizer = SliceTokenizer(
model_max_length=512,
special_tokens={},
truncation=True,
padding="max_length",
max_length=512
)
All MatText Tokenizer
instances inherit from
PreTrainedTokenizer and accept arguments compatible with the Hugging Face tokenizer.
Tokenizers With Special Number Tokenization🔗
The special_num_token
argument (by default False
) can be
set to True
to tokenize numbers in a special way as designed and
implemented by
RegressionTransformer.
tokenizer = SliceTokenizer(
special_num_token=True,
model_max_length=512,
special_tokens={},
truncation=True,
padding="max_length",
max_length=512
)
tokenizer.tokenize("H2SO4")
output
['H', '_2_0_', 'S', 'O', '_4_0_']
Updating Vocabulary in the Tokenizers🔗
The MatText tokenizers allow one to update the vocabulary or load a custom vocabulary file.
Default vocabulary can be determined by calling the vocab
attribute.
tokenizer = SliceTokenizer()
print(tokenizer.vocab)
output
{'o o o': 0, 'o o +': 1, 'o o -': 2, 'o + o': 3, 'o + +': 4, 'o + -': 5, 'o - o': 6, 'o - +': 7, 'o - -': 8, '+ o o': 9, '+ o +': 10, '+ o -': 11, '+ + o': 12, '+ + +': 13, '+ + -': 14, '+ - o': 15, '+ - +': 16, '+ - -': 17, '- o o': 18, '- o +': 19, '- o -': 20, '- + o': 21, '- + +': 22, '- + -': 23, '- - o': 24, '- - +': 25, '- - -': 26, 'H': 27, 'He': 28, 'Li': 29, 'Be': 30, 'B': 31, 'C': 32, 'N': 33, 'O': 34, 'F': 35, 'Ne': 36, 'Na': 37, 'Mg': 38, 'Al': 39, 'Si': 40, 'P': 41, 'S': 42, 'Cl': 43, 'K': 44, 'Ar': 45, 'Ca': 46, 'Sc': 47, 'Ti': 48, 'V': 49, 'Cr': 50, 'Mn': 51, 'Fe': 52, 'Ni': 53, 'Co': 54, 'Cu': 55, 'Zn': 56, 'Ga': 57, 'Ge': 58, 'As': 59, 'Se': 60, 'Br': 61, 'Kr': 62, 'Rb': 63, 'Sr': 64, 'Y': 65, 'Zr': 66, 'Nb': 67, 'Mo': 68, 'Tc': 69, 'Ru': 70, 'Rh': 71, 'Pd': 72, 'Ag': 73, 'Cd': 74, 'In': 75, 'Sn': 76, 'Sb': 77, 'Te': 78, 'I': 79, 'Xe': 80, 'Cs': 81, 'Ba': 82, 'La': 83, 'Ce': 84, 'Pr': 85, 'Nd': 86, 'Pm': 87, 'Sm': 88, 'Eu': 89, 'Gd': 90, 'Tb': 91, 'Dy': 92, 'Ho': 93, 'Er': 94, 'Tm': 95, 'Yb': 96, 'Lu': 97, 'Hf': 98, 'Ta': 99, 'W': 100, 'Re': 101, 'Os': 102, 'Ir': 103, 'Pt': 104, 'Au': 105, 'Hg': 106, 'Tl': 107, 'Pb': 108, 'Bi': 109, 'Th': 110, 'Pa': 111, 'U': 112, 'Np': 113, 'Pu': 114, 'Am': 115, 'Cm': 116, 'Bk': 117, 'Cf': 118, 'Es': 119, 'Fm': 120, 'Md': 121, 'No': 122, 'Lr': 123, 'Rf': 124, 'Db': 125, 'Sg': 126, 'Bh': 127, 'Hs': 128, 'Mt': 129, 'Ds': 130, 'Rg': 131, 'Cn': 132, 'Nh': 133, 'Fl': 134, 'Mc': 135, 'Lv': 136, 'Ts': 137, 'Og': 138, '0': 139, '1': 140, '2': 141, '3': 142, '4': 143, '5': 144, '6': 145, '7': 146, '8': 147, '9': 148, '[CLS]': 149, '[SEP]': 150}
# Path to your custom vocabulary file (JSON or TXT format)
vocab_file_path = "path/to/your/vocab_file.json"
tokenizer = SliceTokenizer(
special_num_token=True,
model_max_length=512,
special_tokens={},
truncation=True,
padding="max_length",
max_length=512,
vocab_file=vocab_file_path
)
here is an example format for the vocabulary json file
import json
import tempfile # (1)
new_vocab = {
"H": 1,
"He": 2,
"New_Atom": 3,
"1":4,
"2":5,}
with tempfile.NamedTemporaryFile(delete=False, mode='w', suffix='.json') as temp_file:
# Write the JSON string to the temporary file
json.dump(my_dict, temp_file)
temp_file_path = temp_file.name
tokenizer = SliceTokenizer(
model_max_length=512,
truncation=True,
padding="max_length",
max_length=512,
vocab_file=temp_file_path
)
print(tokenizer.tokenize("H He Na New_Atom 1 2 9"))
print(tokenizer.vocab)
- here notice that we are writing the vocabulary dictionary to a temporary file, since Tokenizer class accepts only file paths.
output
['[CLS]', 'H', 'He', 'New_Atom', '1', '2', '[SEP]']
#Atoms that are not within the vocabulary are ignored. and newly added atoms are correctly tokenized.
{'H': 1, 'He': 2, 'New_Atom': 3, '1': 4, '2': 5, '[CLS]': 6, '[SEP]': 7}