Model Leaderboard

Rank	Model	Score
1	Claude-3.5-Sonnet	0.706
2	GPT-4o	0.567
3	Gemini-1.5-Pro	0.471
4	Llama 3.2 90B Vision	0.406
5	Baseline	0.240

Overall Performance

Model performance across all tasks, showing average scores.

The following plot shows the leaderboard of the models based on the fraction of correctly answered questions. The leaderboard is sorted in descending order of the fraction correct.