Methodology

How the forecast is built, how accurate it is, and what it can't do.

How it works

Step 1 — Elo ratings. Every team starts from a seeded rating drawn from eloratings.net (January 2018), then the full match history from 2018 through the most recent international break is replayed match by match. Each result updates both teams' ratings with two modifications: a margin-of-victory multiplier (larger wins move ratings more) and a 60-point home-advantage adjustment. K factors vary by match type — 20 for friendlies, 30 for qualifiers, 50 for major tournaments, 60 for the World Cup — so recent competitive results carry more weight than older friendlies.

Step 2 — Dixon-Coles goal model. Elo ratings are converted into per-match expected-goal parameters using a mapping fit on recent international history. A 2D Poisson probability grid over goal counts (0–8 each side) gives the probability of every possible scoreline; summing the right cells yields P(home win), P(draw), and P(away win). The model uses the Dixon-Coles (1997) low-score correction: probabilities in the 0–0, 1–0, 0–1, and 1–1 cells are slightly adjusted via a single parameter ρ to fix Poisson's mild under-prediction of low-scoring draws. ρ was estimated by maximum likelihood on 5,882 competitive matches before June 2024.

Step 3 — Monte Carlo simulation. The entire tournament bracket is simulated 10,000 times. Each run plays out all 104 matches from the group stage through the final, sampling from the Dixon-Coles distributions and applying FIFA's actual tiebreaker rules and the Annex C third-place bracket table. The result is a per-team probability distribution over every round — P(advance from group), P(reach R16), P(reach QF), P(SF), P(Final), P(Champion). Snapshots are saved after each update so probability movements are visible over time.

Step 4 — Three-way fusion and AI commentary. For every fixture the model probability sits alongside two market-based estimates: sportsbook odds (vig stripped by proportional normalisation across 37–44 books, sourced from The Odds API) and Polymarket prediction-market prices (Polymarket Gamma API, per-match markets posted close to kickoff). Where all three agree, confidence is higher; where they diverge, the gap is a signal worth examining. A divergence threshold of 15 percentage points flags the most notable gaps, categorised as model-over-concentrated, model-under-concentrated, or disagree-on-favourite. Claude writes a match preview for every fixture and a short divergence note for flagged matches — see below.

How accurate is it

53.0% Match accuracy · random ≈ 37%
0.583 Brier score · lower is better · random ≈ 0.667
0.978 Log loss · lower is better · random ≈ 1.099

Backtest on Euro 2024 and Copa América 2024 — n = 83 matches the model never saw during training. See the calibration page for the full reliability diagram.

Euro 2024 alone — the closest analog for World Cup group-stage matches, where many teams have similar Elo ratings — produces 49.0% match accuracy. Good international-football models top out around 55–60% in ideal conditions; the World Cup group stage is not ideal conditions. The value here is calibrated probabilities and tournament simulation, not nailing every individual game.

Known limitations

The AI commentary experiment

Every match page includes a three-paragraph preview written by Claude (Anthropic's AI), generated from the numerical inputs — the model probabilities, sportsbook odds, and Polymarket prices. The 14 most statistically divergent fixtures also carry a short commentary note explaining the shape of the gap between sources.

This is explicitly an experiment. The goal is to see whether AI-written synthesis adds genuine value — distilling what the numbers collectively imply about a match — or whether it is decorative. The forecasts themselves are not AI-generated; the probabilities come entirely from Elo ratings, Dixon-Coles, and Monte Carlo simulation. AI writes prose, not predictions.

Known limitation: the commentary can produce imprecise comparatives and occasionally uses language that isn't grounded in the data inputs. It is not a substitute for reading the actual probability numbers. Judge for yourself whether the prose adds anything.

Data & sources