Benchmarks

Honest benchmarks. Including the parts we lose.

Corvid optimizes for correctness, not raw throughput. Here's what that lands as on the canonical reference apps.

Compile-time rejection

Bug classes that Corvid rejects at compile time and Python/TypeScript accept silently.

  • Cases run: 50
  • Rejected by Corvid: 50/50
  • Rejected by Python (mypy --strict + pydantic): 0/50
  • Rejected by TypeScript (tsc --strict + zod): 0/50
Source: benches/moat/compile_time_rejection/RESULTS.md → Reproduce: corvid bench compare python

Governance line count

Lines of code required to express the same agent safely in Corvid vs Python vs TypeScript.

App: rag_qa_bot

Stack Feature lines Governance lines Total Governance %
corvid 15.0 15.0 30.0 50.0%
python 45.0 35.0 80.0 43.8%
typescript 22.0 52.0 74.0 70.3%

Governance lines Corvid saves vs Python: +20.0 Governance lines Corvid saves vs TypeScript: +37.0

App: refund_bot

Stack Feature lines Governance lines Total Governance %
corvid 18.0 9.0 27.0 33.3%
python 42.0 31.0 73.0 42.5%
typescript 25.0 43.0 68.0 63.2%

Governance lines Corvid saves vs Python: +22.0 Governance lines Corvid saves vs TypeScript: +34.0

App: support_escalation_bot

Stack Feature lines Governance lines Total Governance %
corvid 19.0 16.0 35.0 45.7%
python 48.0 39.0 87.0 44.8%
typescript 27.0 55.0 82.0 67.1%

Governance lines Corvid saves vs Python: +23.0 Governance lines Corvid saves vs TypeScript: +39.0

Source: benches/moat/governance_lines/RESULTS.md → Reproduce: corvid bench compare python

Provenance preservation

How often a model-derived string makes it into a downstream call without a citation.

  • Chains run: 10
  • Provenance preserved by Corvid: 10/10
  • Provenance preserved by Python (LangChain + pydantic): 0/10
  • Provenance preserved by TypeScript (Vercel AI SDK + zod): 0/10
Source: benches/moat/provenance_preservation/RESULTS.md → Reproduce: corvid bench compare python

Replay determinism

Whether re-running a captured trace produces byte-identical output across compilers, runtimes, hosts.

  • Corvid (cargo run -p refund_bot_demo): 190/190 byte-identical pairs (rate = 1.000, N = 20)
  • Python (LangChain + LangSmith): bounty-open (no _summary.json)
  • TypeScript (Vercel AI SDK + OTEL): bounty-open (no _summary.json)
Source: benches/moat/replay_determinism/RESULTS.md → Reproduce: corvid bench compare python

Time to audit

How long it takes a reviewer to confirm a specific safety property holds for a given agent.

Lines of audit-logic code required to answer all 5 representative audit questions against the stack's canonical trace surface (lower is better):

  • Corvid (JSONL trace under target/trace/): 65 LOC (all 5 queries correct)
  • Python (LangChain + LangSmith): bounty-open (5/5 queries unimplemented)
  • TypeScript (Vercel AI SDK + OTEL): bounty-open (5/5 queries unimplemented)
Source: benches/moat/time_to_audit/RESULTS.md → Reproduce: corvid bench compare python