evaluation.benchmarks.the_agent_company.scripts.summarise_results
calculate_cost
def calculate_cost(model: str, prompt_tokens: int,
completion_tokens: int) -> float
Calculate the cost of the model call.
analyze_eval_json_file
def analyze_eval_json_file(filepath: str) -> Tuple[int, int]
Analyze a single eval JSON file and extract the total and result from final_score.
Arguments:
filepath
- Path to the JSON file
Returns:
Tuple containing (total, result) from final_score
analyze_traj_json_file
def analyze_traj_json_file(filepath: str) -> Tuple[int, float]
Analyze a single trajectory JSON file and extract the steps and tokens for each step. Then estimate the cost based on the tokens and the model type. Note: this is assuming there's no prompt caching at all.
analyze_folder
def analyze_folder(
folder_path: str
) -> Tuple[Dict[str, Tuple[int, int]], Dict[str, Tuple[int, float]]]
Analyze all eval_.json & traj_.json files in the specified folder.
Arguments:
folder_path
- Path to the folder containing JSON files
Returns:
dictionaries:
- eval_results: Dictionary with filename as key and (total, result) tuple as value
- traj_results: Dictionary with filename as key and (steps, cost) tuple as value
get_task_nature_category
def get_task_nature_category(task_name: str) -> str
Get the nature category of the task.
calculate_score
def calculate_score(total: int, result: int) -> float
Calculate the score as a number between 0 and 1.
Formula: score = (result / total) * 0.5 + (result // total) * 0.5 Explanation:
- (result / total) * 0.5: This is the completion ratio, scaled down to a 0-0.5 range.
- (result // total) * 0.5: This is a binary score indicating whether the task was completed or not.
Arguments:
total
- Total possible pointsresult
- Actual points achieved
Returns:
Score as a number between 0 and 1
is_perfect_completion
def is_perfect_completion(total: int, result: int) -> bool
Check if the task achieved perfect completion.
Arguments:
total
- Total possible pointsresult
- Actual points achieved
Returns:
True if result equals total, False otherwise