Methodology

How we measure, and what we don't.

What "ratio" means, how a benchmark is reproduced, and the non-promise list of things our numbers explicitly do not claim.

What "ratio" means

Every archive session's ratios.json reports a Corvid-to-Python ratio (and a Corvid-to-TypeScript ratio) for two scenarios: tool_loop (an agent that issues a sequence of tool calls) and retry_workflow (the same agent under a retry-on-error policy). For each scenario, the ratio is the median of multiple runs, with a 95% confidence interval and p50/p90/p99 percentiles disclosed alongside.

The ratio is calculated excluding model latency. Both the Corvid and the Python/TypeScript runners issue the same sequence of prompts to the same provider; we subtract the model's wall-clock latency from each run and compare only the orchestration overhead — the time the runtime spends scheduling, deserializing, type-checking, and shuttling values between calls. End-user wall-clock numbers are also captured in the raw data (raw.jsonl), but the headline ratio is orchestration-only because model latency dominates the user- perceived time and would smear the meaningful comparison.

Where the ratio is greater than 1, Corvid is slower on that scenario. Where it is less than 1, Corvid is faster. A ratio of 25 means Corvid takes 25× as long on orchestration as the reference implementation. The launch post's quoted "~25–36× slower than Python LangChain" is the range observed across both scenarios in the canonical 2026-04-16-ratio-session.

How to reproduce a benchmark

The runners are checked into the Corvid-lang repo. To reproduce:

corvid bench compare python runs the Python-comparison runner against your local Corvid install and produces a ratios.json in the same shape as the archive sessions. corvid bench compare typescript does the same for the TypeScript comparison.

The runners are deterministic — model responses are recorded as fixtures so re-runs hit the same prompts in the same order. A drift gate in the Corvid CI fails the build if the recorded fixtures stop matching the runners' expected order, so the comparison stays apples-to-apples across runtime changes.

Each moat sub-benchmark under benches/moat/ has its own runner and its own RESULTS.md documenting the exact cases and methodology. The moat numbers (50/50 compile-time rejection, governance line counts, etc.) are independent of orchestration overhead; they're structural measurements of what each language can and cannot reject before runtime.

What we don't measure

The benchmarks intentionally leave several axes unmeasured. Listing them so visitors aren't tempted to read meaning that isn't there:

Model output quality. Corvid runs the same prompts as the reference implementations; the model returns the same text. We don't measure whether the model's answer is right, only what the runtime does with it.
End-user latency including network. Network time to the model API is captured in the raw data but not in the ratio. A 36× orchestration ratio on a workload where the model takes 30 seconds becomes a ~3% wall-clock difference end-to-end. Use the raw data when this matters.
Cold-start time. Each scenario is measured after the runtime is warm. Cold-start numbers are a separate axis and are not represented in ratios.json.
Memory / RSS. Memory profiling is outside the scope of the ratio benchmarks. Some archive sessions (e.g. perf-investigation) include memory snapshots in their investigation.json; the headline ratio does not.
Comparison to LangGraph, LlamaIndex, AutoGen, etc. The Python reference is LangChain because it's the closest competitor in shape. We do not claim Corvid is faster or slower than every Python agent framework — only than the one we benchmarked against.
Concurrency under load. All benchmarks run single-threaded against a single agent. Multi-agent throughput is a separate concern; the ratio does not predict it.

If you need a measurement we don't surface, the raw inputs are all in benches/; the runners are small and forkable. We'd rather you compute the number than that we publish one we can't defend.