Evaluation
This guide provides an overview of how to integrate your own evaluation benchmark into the OpenHands framework.
Setup Environment and LLM Configuration
Please follow instructions here to setup your local development environment.
OpenHands in development mode uses config.toml
to keep track of most configurations.
Here's an example configuration file you can use to define and use multiple LLMs:
[llm]
# IMPORTANT: add your API key here, and set the model to the one you want to evaluate
model = "claude-3-5-sonnet-20241022"
api_key = "sk-XXX"
[llm.eval_gpt4_1106_preview_llm]
model = "gpt-4-1106-preview"
api_key = "XXX"
temperature = 0.0
[llm.eval_some_openai_compatible_model_llm]
model = "openai/MODEL_NAME"
base_url = "https://OPENAI_COMPATIBLE_URL/v1"
api_key = "XXX"
temperature = 0.0
How to use OpenHands in the command line
OpenHands can be run from the command line using the following format:
poetry run python ./openhands/core/main.py \
-i <max_iterations> \
-t "<task_description>" \
-c <agent_class> \
-l <llm_config>
For example:
poetry run python ./openhands/core/main.py \
-i 10 \
-t "Write me a bash script that prints hello world." \
-c CodeActAgent \
-l llm
This command runs OpenHands with:
- A maximum of 10 iterations
- The specified task description
- Using the CodeActAgent
- With the LLM configuration defined in the
llm
section of yourconfig.toml
file
How does OpenHands work
The main entry point for OpenHands is in openhands/core/main.py
. Here's a simplified flow of how it works:
- Parse command-line arguments and load the configuration
- Create a runtime environment using
create_runtime()
- Initialize the specified agent
- Run the controller using
run_controller()
, which:- Attaches the runtime to the agent
- Executes the agent's task
- Returns a final state when complete
The run_controller()
function is the core of OpenHands's execution. It manages the interaction between the agent, the runtime, and the task, handling things like user input simulation and event processing.
Easiest way to get started: Exploring Existing Benchmarks
We encourage you to review the various evaluation benchmarks available in the evaluation/
directory of our repository.
To integrate your own benchmark, we suggest starting with the one that most closely resembles your needs. This approach can significantly streamline your integration process, allowing you to build upon existing structures and adapt them to your specific requirements.
How to create an evaluation workflow
To create an evaluation workflow for your benchmark, follow these steps:
-
Import relevant OpenHands utilities:
import openhands.agenthub
from evaluation.utils.shared import (
EvalMetadata,
EvalOutput,
make_metadata,
prepare_dataset,
reset_logger_for_multiprocessing,
run_evaluation,
)
from openhands.controller.state.state import State
from openhands.core.config import (
AppConfig,
SandboxConfig,
get_llm_config_arg,
parse_arguments,
)
from openhands.core.logger import openhands_logger as logger
from openhands.core.main import create_runtime, run_controller
from openhands.events.action import CmdRunAction
from openhands.events.observation import CmdOutputObservation, ErrorObservation
from openhands.runtime.runtime import Runtime -
Create a configuration:
def get_config(instance: pd.Series, metadata: EvalMetadata) -> AppConfig:
config = AppConfig(
default_agent=metadata.agent_class,
runtime='eventstream',
max_iterations=metadata.max_iterations,
sandbox=SandboxConfig(
base_container_image='your_container_image',
enable_auto_lint=True,
timeout=300,
),
)
config.set_llm_config(metadata.llm_config)
return config -
Initialize the runtime and set up the evaluation environment:
def initialize_runtime(runtime: Runtime, instance: pd.Series):
# Set up your evaluation environment here
# For example, setting environment variables, preparing files, etc.
pass -
Create a function to process each instance:
from openhands.utils.async_utils import call_async_from_sync
def process_instance(instance: pd.Series, metadata: EvalMetadata) -> EvalOutput:
config = get_config(instance, metadata)
runtime = create_runtime(config)
call_async_from_sync(runtime.connect)
initialize_runtime(runtime, instance)
instruction = get_instruction(instance, metadata)
state = run_controller(
config=config,
task_str=instruction,
runtime=runtime,
fake_user_response_fn=your_user_response_function,
)
# Evaluate the agent's actions
evaluation_result = await evaluate_agent_actions(runtime, instance)
return EvalOutput(
instance_id=instance.instance_id,
instruction=instruction,
test_result=evaluation_result,
metadata=metadata,
history=compatibility_for_eval_history_pairs(state.history),
metrics=state.metrics.get() if state.metrics else None,
error=state.last_error if state and state.last_error else None,
) -
Run the evaluation:
metadata = make_metadata(llm_config, dataset_name, agent_class, max_iterations, eval_note, eval_output_dir)
output_file = os.path.join(metadata.eval_output_dir, 'output.jsonl')
instances = prepare_dataset(your_dataset, output_file, eval_n_limit)
await run_evaluation(
instances,
metadata,
output_file,
num_workers,
process_instance
)
This workflow sets up the configuration, initializes the runtime environment, processes each instance by running the agent and evaluating its actions, and then collects the results into an EvalOutput
object. The run_evaluation
function handles parallelization and progress tracking.
Remember to customize the get_instruction
, your_user_response_function
, and evaluate_agent_actions
functions according to your specific benchmark requirements.
By following this structure, you can create a robust evaluation workflow for your benchmark within the OpenHands framework.
Understanding the user_response_fn
The user_response_fn
is a crucial component in OpenHands's evaluation workflow. It simulates user interaction with the agent, allowing for automated responses during the evaluation process. This function is particularly useful when you want to provide consistent, predefined responses to the agent's queries or actions.
Workflow and Interaction
The correct workflow for handling actions and the user_response_fn
is as follows:
- Agent receives a task and starts processing
- Agent emits an Action
- If the Action is executable (e.g., CmdRunAction, IPythonRunCellAction):
- The Runtime processes the Action
- Runtime returns an Observation
- If the Action is not executable (typically a MessageAction):
- The
user_response_fn
is called - It returns a simulated user response
- The
- The agent receives either the Observation or the simulated response
- Steps 2-5 repeat until the task is completed or max iterations are reached
Here's a more accurate visual representation:
[Agent]
|
v
[Emit Action]
|
v
[Is Action Executable?]
/ \
Yes No
| |
v v
[Runtime] [user_response_fn]
| |
v v
[Return Observation] [Simulated Response]
\ /
\ /
v v
[Agent receives feedback]
|
v
[Continue or Complete Task]
In this workflow:
- Executable actions (like running commands or executing code) are handled directly by the Runtime
- Non-executable actions (typically when the agent wants to communicate or ask for clarification) are handled by the
user_response_fn
- The agent then processes the feedback, whether it's an Observation from the Runtime or a simulated response from the
user_response_fn
This approach allows for automated handling of both concrete actions and simulated user interactions, making it suitable for evaluation scenarios where you want to test the agent's ability to complete tasks with minimal human intervention.
Example Implementation
Here's an example of a user_response_fn
used in the SWE-Bench evaluation:
def codeact_user_response(state: State | None) -> str:
msg = (
'Please continue working on the task on whatever approach you think is suitable.\n'
'If you think you have solved the task, please first send your answer to user through message and then <execute_bash> exit </execute_bash>.\n'
'IMPORTANT: YOU SHOULD NEVER ASK FOR HUMAN HELP.\n'
)
if state and state.history:
# check if the agent has tried to talk to the user 3 times, if so, let the agent know it can give up
user_msgs = [
event
for event in state.history
if isinstance(event, MessageAction) and event.source == 'user'
]
if len(user_msgs) >= 2:
# let the agent know that it can give up when it has tried 3 times
return (
msg
+ 'If you want to give up, run: <execute_bash> exit </execute_bash>.\n'
)
return msg
This function does the following:
- Provides a standard message encouraging the agent to continue working
- Checks how many times the agent has attempted to communicate with the user
- If the agent has made multiple attempts, it provides an option to give up
By using this function, you can ensure consistent behavior across multiple evaluation runs and prevent the agent from getting stuck waiting for human input.