This guide provides an overview of how to integrate your own evaluation benchmark into the OpenHands framework.
Please follow instructions here to setup your local development environment.
OpenHands in development mode uses config.toml
to keep track of most configurations.
Here’s an example configuration file you can use to define and use multiple LLMs:
OpenHands can be run from the command line using the following format:
For example:
This command runs OpenHands with:
llm
section of your config.toml
fileThe main entry point for OpenHands is in openhands/core/main.py
. Here’s a simplified flow of how it works:
create_runtime()
run_controller()
, which:
The run_controller()
function is the core of OpenHands’s execution. It manages the interaction between the agent, the runtime, and the task, handling things like user input simulation and event processing.
We encourage you to review the various evaluation benchmarks available in the evaluation/benchmarks/
directory of our repository.
To integrate your own benchmark, we suggest starting with the one that most closely resembles your needs. This approach can significantly streamline your integration process, allowing you to build upon existing structures and adapt them to your specific requirements.
To create an evaluation workflow for your benchmark, follow these steps:
Import relevant OpenHands utilities:
Create a configuration:
Initialize the runtime and set up the evaluation environment:
Create a function to process each instance:
Run the evaluation:
This workflow sets up the configuration, initializes the runtime environment, processes each instance by running the agent and evaluating its actions, and then collects the results into an EvalOutput
object. The run_evaluation
function handles parallelization and progress tracking.
Remember to customize the get_instruction
, your_user_response_function
, and evaluate_agent_actions
functions according to your specific benchmark requirements.
By following this structure, you can create a robust evaluation workflow for your benchmark within the OpenHands framework.
user_response_fn
The user_response_fn
is a crucial component in OpenHands’s evaluation workflow. It simulates user interaction with the agent, allowing for automated responses during the evaluation process. This function is particularly useful when you want to provide consistent, predefined responses to the agent’s queries or actions.
The correct workflow for handling actions and the user_response_fn
is as follows:
user_response_fn
is calledHere’s a more accurate visual representation:
In this workflow:
user_response_fn
user_response_fn
This approach allows for automated handling of both concrete actions and simulated user interactions, making it suitable for evaluation scenarios where you want to test the agent’s ability to complete tasks with minimal human intervention.
Here’s an example of a user_response_fn
used in the SWE-Bench evaluation:
This function does the following:
By using this function, you can ensure consistent behavior across multiple evaluation runs and prevent the agent from getting stuck waiting for human input.
This guide provides an overview of how to integrate your own evaluation benchmark into the OpenHands framework.
Please follow instructions here to setup your local development environment.
OpenHands in development mode uses config.toml
to keep track of most configurations.
Here’s an example configuration file you can use to define and use multiple LLMs:
OpenHands can be run from the command line using the following format:
For example:
This command runs OpenHands with:
llm
section of your config.toml
fileThe main entry point for OpenHands is in openhands/core/main.py
. Here’s a simplified flow of how it works:
create_runtime()
run_controller()
, which:
The run_controller()
function is the core of OpenHands’s execution. It manages the interaction between the agent, the runtime, and the task, handling things like user input simulation and event processing.
We encourage you to review the various evaluation benchmarks available in the evaluation/benchmarks/
directory of our repository.
To integrate your own benchmark, we suggest starting with the one that most closely resembles your needs. This approach can significantly streamline your integration process, allowing you to build upon existing structures and adapt them to your specific requirements.
To create an evaluation workflow for your benchmark, follow these steps:
Import relevant OpenHands utilities:
Create a configuration:
Initialize the runtime and set up the evaluation environment:
Create a function to process each instance:
Run the evaluation:
This workflow sets up the configuration, initializes the runtime environment, processes each instance by running the agent and evaluating its actions, and then collects the results into an EvalOutput
object. The run_evaluation
function handles parallelization and progress tracking.
Remember to customize the get_instruction
, your_user_response_function
, and evaluate_agent_actions
functions according to your specific benchmark requirements.
By following this structure, you can create a robust evaluation workflow for your benchmark within the OpenHands framework.
user_response_fn
The user_response_fn
is a crucial component in OpenHands’s evaluation workflow. It simulates user interaction with the agent, allowing for automated responses during the evaluation process. This function is particularly useful when you want to provide consistent, predefined responses to the agent’s queries or actions.
The correct workflow for handling actions and the user_response_fn
is as follows:
user_response_fn
is calledHere’s a more accurate visual representation:
In this workflow:
user_response_fn
user_response_fn
This approach allows for automated handling of both concrete actions and simulated user interactions, making it suitable for evaluation scenarios where you want to test the agent’s ability to complete tasks with minimal human intervention.
Here’s an example of a user_response_fn
used in the SWE-Bench evaluation:
This function does the following:
By using this function, you can ensure consistent behavior across multiple evaluation runs and prevent the agent from getting stuck waiting for human input.