evaluation.benchmarks.mint.utils

check_correctness

def check_correctness(solution_code: str,
                      test_code: str,
                      timeout: float = 10,
                      completion_id: int | None = None) -> dict

Evaluates the functional correctness of a completion by running the test

suite provided in the problem.

Arguments:

completion_id: an optional completion ID so we can match the results later even if execution finishes asynchronously.

WriteOnlyStringIO Objects

class WriteOnlyStringIO(io.StringIO)

StringIO that throws an exception when it's read from

readable

def readable(*args, **kwargs)

Returns True if the IO object can be read.

reliability_guard

def reliability_guard(maximum_memory_bytes: int | None = None)

This disables various destructive functions and prevents the generated code from interfering with the test (e.g. fork bomb, killing other processes, removing filesystem files, etc.)

Warnings:

This function is NOT a security sandbox. Untrusted code, including, model- generated code, should not be blindly executed outside of one. See the Codex paper for more information about OpenAI's code sandbox, and proceed with caution.

check_correctness​

WriteOnlyStringIO Objects​

readable​

reliability_guard​

check_correctness

WriteOnlyStringIO Objects

readable

reliability_guard