Honest benchmarks. Including the parts we lose.
Corvid optimizes for correctness, not raw throughput. Here's what that lands as on the canonical reference apps.
Compile-time rejection
Bug classes that Corvid rejects at compile time and Python/TypeScript accept silently.
- Cases run: 50
- Rejected by Corvid: 50/50
- Rejected by Python (
mypy --strict + pydantic): 0/50 - Rejected by TypeScript (
tsc --strict + zod): 0/50
corvid bench compare python Governance line count
Lines of code required to express the same agent safely in Corvid vs Python vs TypeScript.
App: rag_qa_bot
| Stack | Feature lines | Governance lines | Total | Governance % |
|---|---|---|---|---|
| corvid | 15.0 | 15.0 | 30.0 | 50.0% |
| python | 45.0 | 35.0 | 80.0 | 43.8% |
| typescript | 22.0 | 52.0 | 74.0 | 70.3% |
Governance lines Corvid saves vs Python: +20.0 Governance lines Corvid saves vs TypeScript: +37.0
App: refund_bot
| Stack | Feature lines | Governance lines | Total | Governance % |
|---|---|---|---|---|
| corvid | 18.0 | 9.0 | 27.0 | 33.3% |
| python | 42.0 | 31.0 | 73.0 | 42.5% |
| typescript | 25.0 | 43.0 | 68.0 | 63.2% |
Governance lines Corvid saves vs Python: +22.0 Governance lines Corvid saves vs TypeScript: +34.0
App: support_escalation_bot
| Stack | Feature lines | Governance lines | Total | Governance % |
|---|---|---|---|---|
| corvid | 19.0 | 16.0 | 35.0 | 45.7% |
| python | 48.0 | 39.0 | 87.0 | 44.8% |
| typescript | 27.0 | 55.0 | 82.0 | 67.1% |
Governance lines Corvid saves vs Python: +23.0 Governance lines Corvid saves vs TypeScript: +39.0
corvid bench compare python Provenance preservation
How often a model-derived string makes it into a downstream call without a citation.
- Chains run: 10
- Provenance preserved by Corvid: 10/10
- Provenance preserved by Python (LangChain + pydantic): 0/10
- Provenance preserved by TypeScript (Vercel AI SDK + zod): 0/10
corvid bench compare python Replay determinism
Whether re-running a captured trace produces byte-identical output across compilers, runtimes, hosts.
- Corvid (
cargo run -p refund_bot_demo): 190/190 byte-identical pairs (rate = 1.000, N = 20) - Python (LangChain + LangSmith): bounty-open (no
_summary.json) - TypeScript (Vercel AI SDK + OTEL): bounty-open (no
_summary.json)
corvid bench compare python Time to audit
How long it takes a reviewer to confirm a specific safety property holds for a given agent.
Lines of audit-logic code required to answer all 5 representative audit questions against the stack's canonical trace surface (lower is better):
- Corvid (JSONL trace under
target/trace/): 65 LOC (all 5 queries correct) - Python (LangChain + LangSmith): bounty-open (5/5 queries unimplemented)
- TypeScript (Vercel AI SDK + OTEL): bounty-open (5/5 queries unimplemented)
corvid bench compare python