evaluation.benchmarks.ml_bench.run_infer

Implements evaluation of agents on ML-Bench, a benchmark for assessing the effectiveness of Large Language Models (LLMs) in leveraging existing functions in open-source libraries for machine learning tasks. The benchmark is introduced in the paper "ML-Bench: Evaluating Large Language Models for Code Generation in Repository-Level Machine Learning Tasks" (https://arxiv.org/abs/2311.09835).

Please see https://ghcr.io/super-dainiu/ml_bench and https://huggingface.co/datasets/super-dainiu/ml-bench for more details on the dataset and docker image used in this evaluation script.

TODOs:

Support additional evaluation settings, such as providing raw README content or using a retriever to extract relevant segments.
Clean up the code and docker image used for evaluation.

initialize_runtime

def initialize_runtime(runtime: Runtime, instance: pd.Series)

Initialize the runtime for the agent.

This function is called before the runtime is used to run the agent.

complete_runtime

def complete_runtime(runtime: Runtime, instance: pd.Series) -> dict[str, Any]

Complete the runtime for the agent.

This function is called before the runtime is used to run the agent. If you need to do something in the sandbox to get the correctness metric after the agent has run, modify this function.

initialize_runtime​

complete_runtime​

initialize_runtime

complete_runtime