Methodology

v0.1 — published 2026-05-06

What we score

Picked publishes a confidence score per (tool, task) cell — a 0–100 number expressing how strong the evidence is that a given AI coding tool wins a given coding task. Five tasks, nine tools, 45 cells.

How the score is computed

Every signal a source emits passes through three stages: statistical confidence (Wilson lower bound for pass-rates, Bradley-Terry MLE for rank-orderings), time decay with model-era step penalty, and source-trust modifier. Per-cell signals combine via Bayesian-flavored weighted average. The same math runs every cell, in public.

Confidence bands

Every cell publishes one of three bands — high, medium, or low — based on accumulated evidence weight. A score with thin sample size or correlated sources gets a lower band even if the point estimate is high. The band is part of the score, not a footnote.

Sources

v1 ingests from five public data sources. Each has a published trust weight (0.2–1.0) based on independence, reproducibility, and methodology. Vendor self-reported numbers are capped at 0.2 and dropped if a third-party leaderboard already covers the same cell.

What we don't include

Saturated benchmarks (HumanEval, MBPP). Vendor self-reported numbers (used as positioning signal only — what task type the vendor emphasizes — not as a ranking input). LinkedIn (TOS-blocked). SEO-affiliate listicles.

Calibration

v0.1 trust weights are reasoned, not measured. They will recalibrate at week 8 once we have real cell-vs-cell variance data. Recalibration deltas log to the changelog. Methodology versions are sticky per cell — a score scored under v0.1 keeps that citation even after v0.2 ships.