跳到主要内容

evaluation.benchmarks.the_agent_company.scripts.summarise_results

calculate_cost

def calculate_cost(model: str, prompt_tokens: int,
completion_tokens: int) -> float

Calculate the cost of the model call.

analyze_eval_json_file

def analyze_eval_json_file(filepath: str) -> Tuple[int, int]

Analyze a single eval JSON file and extract the total and result from final_score.

Arguments:

  • filepath - Path to the JSON file

Returns:

Tuple containing (total, result) from final_score

analyze_traj_json_file

def analyze_traj_json_file(filepath: str) -> Tuple[int, float]

Analyze a single trajectory JSON file and extract the steps and tokens for each step. Then estimate the cost based on the tokens and the model type. Note: this is assuming there's no prompt caching at all.

analyze_folder

def analyze_folder(
folder_path: str
) -> Tuple[Dict[str, Tuple[int, int]], Dict[str, Tuple[int, float]]]

Analyze all eval_.json & traj_.json files in the specified folder.

Arguments:

  • folder_path - Path to the folder containing JSON files

Returns:

dictionaries:

  • eval_results: Dictionary with filename as key and (total, result) tuple as value
  • traj_results: Dictionary with filename as key and (steps, cost) tuple as value

get_task_nature_category

def get_task_nature_category(task_name: str) -> str

Get the nature category of the task.

calculate_score

def calculate_score(total: int, result: int) -> float

Calculate the score as a number between 0 and 1.

Formula: score = (result / total) * 0.5 + (result // total) * 0.5 Explanation:

  • (result / total) * 0.5: This is the completion ratio, scaled down to a 0-0.5 range.
  • (result // total) * 0.5: This is a binary score indicating whether the task was completed or not.

Arguments:

  • total - Total possible points
  • result - Actual points achieved

Returns:

Score as a number between 0 and 1

is_perfect_completion

def is_perfect_completion(total: int, result: int) -> bool

Check if the task achieved perfect completion.

Arguments:

  • total - Total possible points
  • result - Actual points achieved

Returns:

True if result equals total, False otherwise