Benchmarks

Honest benchmarks. Including the parts we lose.

Corvid optimizes for correctness, not raw throughput. Here's what that lands as on the canonical reference apps.

Compile-time rejection

Bug classes that Corvid rejects at compile time and Python/TypeScript accept silently.

Source: benches/moat/compile_time_rejection/RESULTS.md → Reproduce: corvid bench compare python

Lines of code required to express the same agent safely in Corvid vs Python vs TypeScript.

Stack	Feature lines	Governance lines	Total	Governance %
corvid	15.0	15.0	30.0	50.0%
python	45.0	35.0	80.0	43.8%
typescript	22.0	52.0	74.0	70.3%

Governance lines Corvid saves vs Python: +20.0 Governance lines Corvid saves vs TypeScript: +37.0

Stack	Feature lines	Governance lines	Total	Governance %
corvid	18.0	9.0	27.0	33.3%
python	42.0	31.0	73.0	42.5%
typescript	25.0	43.0	68.0	63.2%

Governance lines Corvid saves vs Python: +22.0 Governance lines Corvid saves vs TypeScript: +34.0

Stack	Feature lines	Governance lines	Total	Governance %
corvid	19.0	16.0	35.0	45.7%
python	48.0	39.0	87.0	44.8%
typescript	27.0	55.0	82.0	67.1%

Governance lines Corvid saves vs Python: +23.0 Governance lines Corvid saves vs TypeScript: +39.0

Source: benches/moat/governance_lines/RESULTS.md → Reproduce: corvid bench compare python

How often a model-derived string makes it into a downstream call without a citation.

Source: benches/moat/provenance_preservation/RESULTS.md → Reproduce: corvid bench compare python

Whether re-running a captured trace produces byte-identical output across compilers, runtimes, hosts.

Corvid (cargo run -p refund_bot_demo): 190/190 byte-identical pairs (rate = 1.000, N = 20)
Python (LangChain + LangSmith): bounty-open (no _summary.json)
TypeScript (Vercel AI SDK + OTEL): bounty-open (no _summary.json)

Source: benches/moat/replay_determinism/RESULTS.md → Reproduce: corvid bench compare python

How long it takes a reviewer to confirm a specific safety property holds for a given agent.

Lines of audit-logic code required to answer all 5 representative audit questions against the stack's canonical trace surface (lower is better):

Source: benches/moat/time_to_audit/RESULTS.md → Reproduce: corvid bench compare python