evaluation.gpqa.run_infer

Overview: This code implements the evaluation of agents on the GPQA Benchmark with Open Book setting.

The benchmark consists of 448 high-quality and extremely difficult multiple-choice questions in the domains of biology, physics, and chemistry. The questions are intentionally designed to be "Google-proof," meaning that even highly skilled non-expert validators achieve only 34% accuracy despite unrestricted access to the web.
Even experts in the corresponding domains achieve only 65% accuracy.
State-of-the-art AI systems achieve only 39% accuracy on this challenging dataset.

Accurate solving of above graduate level questions would require both tool use (e.g., python for calculations) and web-search for finding related facts as information required for the questions might not be part of the LLM knowledge / training data.

Further references:

TODOs:

Add evaluation on other Agent classes
Batch inference and evaluation of agents on the GPQA Benchmark.

parse_final_answer

def parse_final_answer(final_answer: str | None) -> str | None

Parse the final answer from the final message generated by the agent to extract the final answer. The final answer is usually enclosed in the format: <<FINAL_ANSWER|| <insert correct answer here> ||FINAL_ANSWER>>

compare_answers

def compare_answers(model_output: str | None, ground_truth: str)

Compare the predicted answer with the ground truth answer

convert_instance_dict

def convert_instance_dict(instance)

Used for preprocessing the hf dataset into a format that can be used by the agent. Reads and extracts relevant information from the dataset instance.

parse_final_answer​

compare_answers​

convert_instance_dict​

parse_final_answer

compare_answers

convert_instance_dict