Lessons Learnd in Tuning and Using LLMs¶
LoRA Fine-tuning for Large Language Models¶
- Adjusting the LoRA rank is essential, and so is selecting an apt alpha value. A good heuristic is setting alpha at twice the rank's value. scaling = alpha / r weight += (lora_B @ lora_A) * scaling
-
Read "LoRA Learns Less and Forgets Less":
To conclude, based on our main results and hyperparameter sweeps, we recommend: (a) using LoRA for instruction finetuning and not continued pretraining; (b) if GPU memory allows, targeting “All” transformer modules with a rank of 256, since ranks 16 − 64 tend not to suffice for code tasks; (c) using α = 2r, and (d) sweeping over learning rates between [1e − 5, 5e − 4], picking the highest value that enables stable training.
-
when merging LoRA adapters, take care of the numeric precision.
Learning rate (LR) and LR schedule¶
- LR, LR warmup ratio and LR schedule seem to be very sensitive parameters
- It is a typical choice to start with a small learning rate (eg: 5e-5) with some warmup steps (0.1 warmup ratio, 10% of training to be dedicated to a linear warmup where learning rate gradually increases).
- 0.03 is also a common warmup ratio (eg: olmo-2 models), I (Nawaf) think if we train for more epochs or more tokens we can go for smaller ratio like this
- Usually warmup is followed by a cosine annealing.
- Note: sometimes some libraries (eg: transformer) exposes both warmup ratio and warmup steps. It is for the user to decide which variable they want to use to set warmup, what is important is to define only one, if you define both one takes precedence over other if there is mismatch.
Completion only loss¶
- During SFT, there is an option to train only on the completion (only on answer).
- For MatText I (Nawaf) tried both completion only and full prompt. Completion only worked better for me (generations were numbers as expected in other case sometime model generated prompt/material representation )
- For setting completion only loss some libraries (eg: transformer) ask for a template string and all tokens after this template string is masked. We need to be careful here, because with some tokenizer token ids change for this template string if the preceding tokens or tokens after this template string are special tokens or if there is some space. (how this works is, the library first tokenizes the template string to get the token ids and then during training when it looks for these token ids, it will not find them because the token ids of the template string changed due to a space before the template string)
Template and tokenizers¶
- Check the
tokenizer_config.jsonthe EOS token, BOS token etc, sometimes they are incorrectly set. There is also cases when some models are trained without this and they still set it by the libraries, since their framework needs some of them to be defined. - Check
padding=left/right - Apply the recommended template while doing instruction fine tuning
DP and DDP¶
- If we are using multiple GPUs it is better to use DDP over naive pipeline parallelism or data parallel
- Note: DP gets extremely slow if the CPUs are shared by someone else (for example, if you do not reserve the CPU and launch the training, and someone else also launches some other CPU jobs) then the GPUs are waiting most of the time. This you will also see in
nvidia-smi-Volatile GPU-Utilwould be 0% even if some process is running in this GPU. In the case of DDP, even training can crash, because if some process waits for some time (there is a timeout we can set) and does not receive back communication because CPU is busy with shared process you can run into timeout. - DDP, configure
wandbto run only in the main process, otherwise there is multiple runs and then it is confusing, or even better there are configurations to combine from multiple process - Ensures that each device contributes the same number of samples per global batch.
Managing configs¶
- Hydra is very nice for managing configs, especially for hyperparameter sweeps. It also supports plugins like
submititso it is easier to launch multiple slurm jobs - There is one issue that I (Nawaf) faced however, that is sometimes hydra cli would override the
torchtuneoracceleratecli and variables if you want to use both of them together.
PEFT¶
- After loading the model with adapters, use
prepare_model_for_kbit_training
During Evaluation¶
- For deterministic generation temperature should be zero, some libraries expect we set
do_sample=Falsefor zero temperature. - Set model in evaluation mode.
Checkpointing¶
- Save optimizer state also if you want to restart
- for LoRA base model weights need not be saved
Determinism¶
Note that there is quite some nuance in non-determinism in using LLMs. Even the batch size you use can matter. Please read this blog post