evaluation.benchmarks.the_agent_company.scripts.summarise_results

calculate_cost

def calculate_cost(model: str, prompt_tokens: int,
                   completion_tokens: int) -> float

Calculate the cost of the model call.

analyze_eval_json_file

def analyze_eval_json_file(filepath: str) -> Tuple[int, int]

Analyze a single eval JSON file and extract the total and result from final_score.

Arguments:

filepath - Path to the JSON file

Returns:

Tuple containing (total, result) from final_score

analyze_traj_json_file

def analyze_traj_json_file(filepath: str) -> Tuple[int, float]

Analyze a single trajectory JSON file and extract the steps and tokens for each step. Then estimate the cost based on the tokens and the model type. Note: this is assuming there's no prompt caching at all.

analyze_folder

def analyze_folder(
    folder_path: str
) -> Tuple[Dict[str, Tuple[int, int]], Dict[str, Tuple[int, float]]]

Analyze all eval_.json & traj_.json files in the specified folder.

Arguments:

folder_path - Path to the folder containing JSON files

Returns:

dictionaries:

eval_results: Dictionary with filename as key and (total, result) tuple as value
traj_results: Dictionary with filename as key and (steps, cost) tuple as value

get_task_nature_category

def get_task_nature_category(task_name: str) -> str

Get the nature category of the task.

calculate_score

def calculate_score(total: int, result: int) -> float

Calculate the score as a number between 0 and 1.

Formula: score = (result / total) * 0.5 + (result // total) * 0.5 Explanation:

(result / total) * 0.5: This is the completion ratio, scaled down to a 0-0.5 range.
(result // total) * 0.5: This is a binary score indicating whether the task was completed or not.

Arguments:

total - Total possible points
result - Actual points achieved

Returns:

Score as a number between 0 and 1

is_perfect_completion

def is_perfect_completion(total: int, result: int) -> bool

Check if the task achieved perfect completion.

Arguments:

total - Total possible points
result - Actual points achieved

Returns:

True if result equals total, False otherwise

calculate_cost​

analyze_eval_json_file​

analyze_traj_json_file​

analyze_folder​

get_task_nature_category​

calculate_score​

is_perfect_completion​

calculate_cost

analyze_eval_json_file

analyze_traj_json_file

analyze_folder

get_task_nature_category

calculate_score

is_perfect_completion