Skip to content

👑 ChemBenchmark

Rank Model Fraction Correct
1 o1 0.643
2 Claude-3.5 (Sonnet) 0.626
3 Claude-3.5 (Sonnet) React 0.625
4 GPT-4o 0.611
5 Llama-3.1-405B-Instruct 0.579
6 Mistral-Large-2 0.57
7 Claude-3 (Opus) 0.57
8 PaperQA2 0.569
9 Llama-3.1-70B-Instruct 0.534
10 Llama-3-70B-Instruct 0.518
11 Llama-3-70B-Instruct (Temperature 1.0) 0.516
12 Llama-3.1-70B-Instruct (Temperature 1.0) 0.511
13 GPT-4o React 0.505
14 Gemma-2-9B-it 0.482
15 Gemma-2-9B-it (Temperature 1.0) 0.48
16 Phi-3-Medium-4k-Instruct 0.475
17 Claude-2 0.473
18 Llama-3.1-8B-Instruct 0.472
19 GPT-3.5 Turbo 0.466
20 Llama-3.1-8B-Instruct (Temperature 1.0) 0.461
21 Llama-3-8B-Instruct (Temperature 1.0) 0.457
22 Llama-3-8B-Instruct 0.456
23 Gemini-Pro 0.453
24 Command-R+ 0.448
25 Mixtral-8x7b-Instruct 0.424
26 Mixtral-8x7b-Instruct (Temperature 1.0) 0.418
27 GPT-4 0.413
28 Llama-2-70B Chat 0.266
29 Gemma-1.1-7B-it 0.192
30 Gemma-1.1-7B-it (Temperature 1.0) 0.191
31 Galactica-120b 0.015

Leaderboard Plot

The following plot shows the leaderboard of the models based on the fraction of correctly answered questions. This fraction is calculated as the number of correct answers divided by the total number of answers. The leaderboard is sorted in descending order of the fraction correct.