Model Leaderboard
Rank | Model | Score |
---|---|---|
1 | Claude-3.5-Sonnet | 0.706 |
2 | GPT-4o | 0.567 |
3 | Gemini-1.5-Pro | 0.471 |
4 | Llama 3.2 90B Vision | 0.406 |
5 | Baseline | 0.240 |
Overall Performance
Model performance across all tasks, showing average scores.
The following plot shows the leaderboard of the models based on the fraction of correctly answered questions. The leaderboard is sorted in descending order of the fraction correct.