跳到主要内容

evaluation.benchmarks.discoverybench.run_infer

get_dv_query_for_real

def get_dv_query_for_real(datasets,
question,
domain_knowledge=None,
workflow_tags=None)

Prepare a structured query for the agent to execute on the specified datasets.

This function constructs a query by compiling metadata from the provided datasets, along with any relevant domain knowledge and workflow tags.

Arguments:

  • datasets - List of datasets
  • question - Query to be answered
  • domain_knowledge - Domain knowledge if any
  • workflow_tags - Workflow tags if any

Returns:

  • query_to_dv - Query to be run on the dataset
  • dataset_meta - Metadata of the dataset

initialize_runtime

def initialize_runtime(runtime: Runtime, data_files: list[str])

Initialize the runtime for the agent.

This function is called before the runtime is used to run the agent.

process_instance

def process_instance(instance: pd.Series,
metadata: EvalMetadata,
reset_logger: bool = True)

Process and evaluate a single instance of the dataset.

This function executes the OpenHands agent for a specific instance of the dataset. It retrieves the agent's results and evaluates them against the gold hypothesis.

Arguments:

  • instance - A single row of the dataset
  • metadata - Metadata for the evaluation
  • reset_logger - Whether to reset the logger

Returns:

  • output - EvalOutput object

create_dataset

def create_dataset(repo_location: str, split: str = 'test')

Create a dataset from the discoverybench repository by walking through the repository and extracting metadata from the metadata_.json files

Arguments:

  • repo_location - Location of the repository
  • split - Split of the dataset to use

Returns:

  • df - DataFrame containing the dataset instances