A framework for the science of agents and agents for science

Extensive utilities that facilitate research into agent methodologies and simplify the creation, deployment, and evaluation of scientific agents and environments.

Get Started

How Corral Works

A microservice architecture ensuring flexibility, scalability, and robust isolation.

Environments

The "world" the agent interacts with. Defines the task space, available tools, and provides observable feedback. From chemistry labs to HPC clusters.

Agents

Modular entities for perception and decision-making. Built with LLMs using scaffolds like ReAct, ToolCalling, LLMPlanner, and Reflection.

Tasks

Define problems for agents to solve with scoring functions for evaluation. Chain tasks into TaskGroups for complex multi-stage challenges.

Decoupled Architecture

Corral separates agents from environments via a client-server design with REST API communication.

CorralServer: Hosts and manages environments, provides the interaction interface via CorralRouter.
CorralRunner: Executes agents, orchestrates their lifecycle, feeds observations and relays actions.

Corral architecture diagram showing CorralRunner with Agent communicating via REST API to CorralServer with CorralRouter and Environment

Environments

Pre-built scientific environments spanning chemistry, physics, materials science, and more.

Foundational Principles

Cite this work

If you use Corral in your research, please consider citing:

@article{ríos-garcía2026ai,
  title   = {AI scientists produce results without reasoning scientifically},
  author  = {Martiño Ríos-García and Nawaf Alampara and Chandan Gupta and Indrajeet Mandal and Sajid Mannan and Ali Asghar Aghajani and N. M. Anoop Krishnan and Kevin Maik Jablonka},
  year    = {2026},
  journal = {arXiv preprint arXiv: 2604.18805}
}

Ready to benchmark?

Start evaluating AI agents on scientific tasks in minutes.

Read the Docs View on GitHub

Scope:

Verbosity

tools.py

Python

No tasks at this scope.

Tools Used

score.py

Python

No subtasks at this scope.

subtask

Tools Used

score.py

Python

Node Types

Hypothesis

Test

Evidence

Judgment

Update

Commitment

Select a trace from the sidebar

to visualize its epistemological graph

Load Trace Directory

Navigate Traces

File 0 of 0

View Mode

- Agent Type

0

Nodes

0

Tool Calls

Load a trace directory to begin

Select a folder with JSON trace files

Annotator

Identifier

API Endpoint (optional)

Behavioral Markers

Positive

validation_attempt — explicit validation

backtrack_trigger — recognizes dead end, pivots

planning_statement — explicit plan/subgoal

reasoning_statement — joins hypotheses and evidence

correct_submission — follows required format

todo_list — creates/checks a todo list

Neutral

neutral — nothing notable happens

iteration_limit — max iterations reached

Negative

missing_validation — lack of validation

unnecessary_tool_use — unneeded tool use

non_sense — incoherent content

loop_instance — repeated tool pattern

hallucination — fabricated content

wrong_planning — incorrect planning

wrong_reasoning — incorrect reasoning

syntax_error — syntax is incorrect

early_final_answer — premature submission

give_up — agent gives up

inefficient_tool_call — vague/incomplete tool use

Positive

Neutral

Negative