evaluation.benchmarks.mint.utils
check_correctness
def check_correctness(solution_code: str,
test_code: str,
timeout: float = 10,
completion_id: int | None = None) -> dict
Evaluates the functional correctness of a completion by running the test
suite provided in the problem.
Arguments:
completion_id
: an optional completion ID so we can match the results later even if execution finishes asynchronously.
WriteOnlyStringIO Objects
class WriteOnlyStringIO(io.StringIO)
StringIO that throws an exception when it's read from
readable
def readable(*args, **kwargs)
Returns True if the IO object can be read.
reliability_guard
def reliability_guard(maximum_memory_bytes: int | None = None)
This disables various destructive functions and prevents the generated code from interfering with the test (e.g. fork bomb, killing other processes, removing filesystem files, etc.)
Warnings:
This function is NOT a security sandbox. Untrusted code, including, model- generated code, should not be blindly executed outside of one. See the Codex paper for more information about OpenAI's code sandbox, and proceed with caution.