ChemBenchmark

Rank	Model	Fraction Correct
1	o1	0.643
2	Claude-3.5 (Sonnet)	0.626
3	Claude-3.5 (Sonnet) React	0.625
4	GPT-4o	0.611
5	Llama-3.1-405B-Instruct	0.579
6	Mistral-Large-2	0.57
7	Claude-3 (Opus)	0.57
8	PaperQA2	0.569
9	Llama-3.1-70B-Instruct	0.534
10	Llama-3-70B-Instruct	0.518
11	Llama-3-70B-Instruct (Temperature 1.0)	0.516
12	Llama-3.1-70B-Instruct (Temperature 1.0)	0.511
13	GPT-4o React	0.505
14	Gemma-2-9B-it	0.482
15	Gemma-2-9B-it (Temperature 1.0)	0.48
16	Phi-3-Medium-4k-Instruct	0.475
17	Claude-2	0.473
18	Llama-3.1-8B-Instruct	0.472
19	GPT-3.5 Turbo	0.466
20	Llama-3.1-8B-Instruct (Temperature 1.0)	0.461
21	Llama-3-8B-Instruct (Temperature 1.0)	0.457
22	Llama-3-8B-Instruct	0.456
23	Gemini-Pro	0.453
24	Command-R+	0.448
25	Mixtral-8x7b-Instruct	0.424
26	Mixtral-8x7b-Instruct (Temperature 1.0)	0.418
27	GPT-4	0.413
28	Llama-2-70B Chat	0.266
29	Gemma-1.1-7B-it	0.192
30	Gemma-1.1-7B-it (Temperature 1.0)	0.191
31	Galactica-120b	0.015

Leaderboard Plot

The following plot shows the leaderboard of the models based on the fraction of correctly answered questions. This fraction is calculated as the number of correct answers divided by the total number of answers. The leaderboard is sorted in descending order of the fraction correct.