evaluation.benchmarks.humanevalfix.run_infer

Implements evaluation of agents on HumanEvalFix from the HumanEvalPack benchmark introduced in "OctoPack: Instruction Tuning Code Large Language Models" (https://arxiv.org/abs/2308.07124). Please see https://github.com/bigcode-project/bigcode-evaluation-harness/blob/main/bigcode_eval/tasks/humanevalpack.py for the reference implementation used in the paper.

TODOs:

Potentially support other HumanEvalPack datasets (Explain & Synthesize)
Support other languages (currently only Python)

initialize_runtime

def initialize_runtime(runtime: Runtime, instance: pd.Series)

Initialize the runtime for the agent.

This function is called before the runtime is used to run the agent.

complete_runtime

def complete_runtime(runtime: Runtime, instance: pd.Series) -> dict[str, Any]

Complete the runtime for the agent.

This function is called before the runtime is used to run the agent. If you need to do something in the sandbox to get the correctness metric after the agent has run, modify this function.

initialize_runtime​

complete_runtime​

initialize_runtime

complete_runtime