evaluation.benchmarks.gpqa.run_infer
Overview: This code implements the evaluation of agents on the GPQA Benchmark with Open Book setting.
- The benchmark consists of 448 high-quality and extremely difficult multiple-choice questions in the domains of biology, physics, and chemistry. The questions are intentionally designed to be "Google-proof," meaning that even highly skilled non-expert validators achieve only 34% accuracy despite unrestricted access to the web.
- Even experts in the corresponding domains achieve only 65% accuracy.
- State-of-the-art AI systems achieve only 39% accuracy on this challenging dataset.
Accurate solving of above graduate level questions would require both tool use (e.g., python for calculations) and web-search for finding related facts as information required for the questions might not be part of the LLM knowledge / training data.
Further references:
- https://arxiv.org/pdf/2311.12022
- https://paperswithcode.com/dataset/gpqa
- https://github.com/idavidrein/gpqa
TODOs:
- Add evaluation on other Agent classes
- Batch inference and evaluation of agents on the GPQA Benchmark.
parse_final_answer
def parse_final_answer(final_answer: str | None) -> str | None
Parse the final answer from the final message generated by the agent to extract the final answer. The final answer is usually enclosed in the format: <<FINAL_ANSWER|| <insert correct answer here> ||FINAL_ANSWER>>
compare_answers
def compare_answers(model_output: str | None, ground_truth: str)
Compare the predicted answer with the ground truth answer
convert_instance_dict
def convert_instance_dict(instance)
Used for preprocessing the hf dataset into a format that can be used by the agent. Reads and extracts relevant information from the dataset instance.