Lessons Learnd in Tuning and Using LLMs¶

LoRA Fine-tuning for Large Language Models¶

Adjusting the LoRA rank is essential, and so is selecting an apt alpha value. A good heuristic is setting alpha at twice the rank's value. scaling = alpha / r weight += (lora_B @ lora_A) * scaling
Read "LoRA Learns Less and Forgets Less":

To conclude, based on our main results and hyperparameter sweeps, we recommend: (a) using LoRA for instruction finetuning and not continued pretraining; (b) if GPU memory allows, targeting “All” transformer modules with a rank of 256, since ranks 16 − 64 tend not to suffice for code tasks; (c) using α = 2r, and (d) sweeping over learning rates between [1e − 5, 5e − 4], picking the highest value that enables stable training.
when merging LoRA adapters, take care of the numeric precision.

LR, LR warmup ratio and LR schedule seem to be very sensitive parameters
It is a typical choice to start with a small learning rate (eg: 5e-5) with some warmup steps (0.1 warmup ratio, 10% of training to be dedicated to a linear warmup where learning rate gradually increases).
0.03 is also a common warmup ratio (eg: olmo-2 models), I (Nawaf) think if we train for more epochs or more tokens we can go for smaller ratio like this
Usually warmup is followed by a cosine annealing.
Note: sometimes some libraries (eg: transformer) exposes both warmup ratio and warmup steps. It is for the user to decide which variable they want to use to set warmup, what is important is to define only one, if you define both one takes precedence over other if there is mismatch.

During SFT, there is an option to train only on the completion (only on answer).
For MatText I (Nawaf) tried both completion only and full prompt. Completion only worked better for me (generations were numbers as expected in other case sometime model generated prompt/material representation )
For setting completion only loss some libraries (eg: transformer) ask for a template string and all tokens after this template string is masked. We need to be careful here, because with some tokenizer token ids change for this template string if the preceding tokens or tokens after this template string are special tokens or if there is some space. (how this works is, the library first tokenizes the template string to get the token ids and then during training when it looks for these token ids, it will not find them because the token ids of the template string changed due to a space before the template string)

Check the tokenizer_config.json the EOS token, BOS token etc, sometimes they are incorrectly set. There is also cases when some models are trained without this and they still set it by the libraries, since their framework needs some of them to be defined.
Check padding=left/right
Apply the recommended template while doing instruction fine tuning

If we are using multiple GPUs it is better to use DDP over naive pipeline parallelism or data parallel
Note: DP gets extremely slow if the CPUs are shared by someone else (for example, if you do not reserve the CPU and launch the training, and someone else also launches some other CPU jobs) then the GPUs are waiting most of the time. This you will also see in nvidia-smi - Volatile GPU-Util would be 0% even if some process is running in this GPU. In the case of DDP, even training can crash, because if some process waits for some time (there is a timeout we can set) and does not receive back communication because CPU is busy with shared process you can run into timeout.
DDP, configure wandb to run only in the main process, otherwise there is multiple runs and then it is confusing, or even better there are configurations to combine from multiple process
Ensures that each device contributes the same number of samples per global batch.

Hydra is very nice for managing configs, especially for hyperparameter sweeps. It also supports plugins like submitit so it is easier to launch multiple slurm jobs
There is one issue that I (Nawaf) faced however, that is sometimes hydra cli would override the torchtune or accelerate cli and variables if you want to use both of them together.

For deterministic generation temperature should be zero, some libraries expect we set do_sample=False for zero temperature.
Set model in evaluation mode.

Note that there is quite some nuance in non-determinism in using LLMs. Even the batch size you use can matter. Please read this blog post