evaluation.benchmarks.discoverybench.run_infer
get_dv_query_for_real
def get_dv_query_for_real(datasets,
question,
domain_knowledge=None,
workflow_tags=None)
Prepare a structured query for the agent to execute on the specified datasets.
This function constructs a query by compiling metadata from the provided datasets, along with any relevant domain knowledge and workflow tags.
Arguments:
datasets
- List of datasetsquestion
- Query to be answereddomain_knowledge
- Domain knowledge if anyworkflow_tags
- Workflow tags if any
Returns:
query_to_dv
- Query to be run on the datasetdataset_meta
- Metadata of the dataset
initialize_runtime
def initialize_runtime(runtime: Runtime, data_files: list[str])
Initialize the runtime for the agent.
This function is called before the runtime is used to run the agent.
process_instance
def process_instance(instance: pd.Series,
metadata: EvalMetadata,
reset_logger: bool = True)
Process and evaluate a single instance of the dataset.
This function executes the OpenHands agent for a specific instance of the dataset. It retrieves the agent's results and evaluates them against the gold hypothesis.
Arguments:
instance
- A single row of the datasetmetadata
- Metadata for the evaluationreset_logger
- Whether to reset the logger
Returns:
output
- EvalOutput object
create_dataset
def create_dataset(repo_location: str, split: str = 'test')
Create a dataset from the discoverybench repository by walking through the repository and extracting metadata from the metadata_.json files
Arguments:
repo_location
- Location of the repositorysplit
- Split of the dataset to use
Returns:
df
- DataFrame containing the dataset instances