ChemBenchmark
Rank | Model | Fraction Correct |
---|---|---|
1 | o1 | 0.643 |
2 | Claude-3.5 (Sonnet) | 0.626 |
3 | Claude-3.5 (Sonnet) React | 0.625 |
4 | GPT-4o | 0.611 |
5 | Llama-3.1-405B-Instruct | 0.579 |
6 | Mistral-Large-2 | 0.57 |
7 | Claude-3 (Opus) | 0.57 |
8 | PaperQA2 | 0.569 |
9 | Llama-3.1-70B-Instruct | 0.534 |
10 | Llama-3-70B-Instruct | 0.518 |
11 | Llama-3-70B-Instruct (Temperature 1.0) | 0.516 |
12 | Llama-3.1-70B-Instruct (Temperature 1.0) | 0.511 |
13 | GPT-4o React | 0.505 |
14 | Gemma-2-9B-it | 0.482 |
15 | Gemma-2-9B-it (Temperature 1.0) | 0.48 |
16 | Phi-3-Medium-4k-Instruct | 0.475 |
17 | Claude-2 | 0.473 |
18 | Llama-3.1-8B-Instruct | 0.472 |
19 | GPT-3.5 Turbo | 0.466 |
20 | Llama-3.1-8B-Instruct (Temperature 1.0) | 0.461 |
21 | Llama-3-8B-Instruct (Temperature 1.0) | 0.457 |
22 | Llama-3-8B-Instruct | 0.456 |
23 | Gemini-Pro | 0.453 |
24 | Command-R+ | 0.448 |
25 | Mixtral-8x7b-Instruct | 0.424 |
26 | Mixtral-8x7b-Instruct (Temperature 1.0) | 0.418 |
27 | GPT-4 | 0.413 |
28 | Llama-2-70B Chat | 0.266 |
29 | Gemma-1.1-7B-it | 0.192 |
30 | Gemma-1.1-7B-it (Temperature 1.0) | 0.191 |
31 | Galactica-120b | 0.015 |
Leaderboard Plot
The following plot shows the leaderboard of the models based on the fraction of correctly answered questions. This fraction is calculated as the number of correct answers divided by the total number of answers. The leaderboard is sorted in descending order of the fraction correct.