config.toml
to keep track of most configurations.
Here’s an example configuration file you can use to define and use multiple LLMs:
llm
section of your config.toml
fileopenhands/core/main.py
. Here’s a simplified flow of how it works:
create_runtime()
run_controller()
, which:
run_controller()
function is the core of OpenHands’s execution. It manages the interaction between the agent, the runtime, and the task, handling things like user input simulation and event processing.
evaluation/benchmarks/
directory of our repository.
To integrate your own benchmark, we suggest starting with the one that most closely resembles your needs. This approach can significantly streamline your integration process, allowing you to build upon existing structures and adapt them to your specific requirements.
EvalOutput
object. The run_evaluation
function handles parallelization and progress tracking.
Remember to customize the get_instruction
, your_user_response_function
, and evaluate_agent_actions
functions according to your specific benchmark requirements.
By following this structure, you can create a robust evaluation workflow for your benchmark within the OpenHands framework.
user_response_fn
user_response_fn
is a crucial component in OpenHands’s evaluation workflow. It simulates user interaction with the agent, allowing for automated responses during the evaluation process. This function is particularly useful when you want to provide consistent, predefined responses to the agent’s queries or actions.
user_response_fn
is as follows:
user_response_fn
is calleduser_response_fn
user_response_fn
user_response_fn
used in the SWE-Bench evaluation: