evaluation.benchmarks.discoverybench.run_infer

get_dv_query_for_real

def get_dv_query_for_real(datasets,
                          question,
                          domain_knowledge=None,
                          workflow_tags=None)

Prepare a structured query for the agent to execute on the specified datasets.

This function constructs a query by compiling metadata from the provided datasets, along with any relevant domain knowledge and workflow tags.

Arguments:

datasets - List of datasets
question - Query to be answered
domain_knowledge - Domain knowledge if any
workflow_tags - Workflow tags if any

Returns:

query_to_dv - Query to be run on the dataset
dataset_meta - Metadata of the dataset

initialize_runtime

def initialize_runtime(runtime: Runtime, data_files: list[str])

Initialize the runtime for the agent.

This function is called before the runtime is used to run the agent.

process_instance

def process_instance(instance: pd.Series,
                     metadata: EvalMetadata,
                     reset_logger: bool = True)

Process and evaluate a single instance of the dataset.

This function executes the OpenHands agent for a specific instance of the dataset. It retrieves the agent's results and evaluates them against the gold hypothesis.

Arguments:

instance - A single row of the dataset
metadata - Metadata for the evaluation
reset_logger - Whether to reset the logger

Returns:

output - EvalOutput object

create_dataset

def create_dataset(repo_location: str, split: str = 'test')

Create a dataset from the discoverybench repository by walking through the repository and extracting metadata from the metadata_.json files

Arguments:

repo_location - Location of the repository
split - Split of the dataset to use

Returns:

df - DataFrame containing the dataset instances

get_dv_query_for_real​

initialize_runtime​

process_instance​

create_dataset​

get_dv_query_for_real

initialize_runtime

process_instance

create_dataset