3 Months In: The Model Converged — Here's What the Data Actually Shows
Back in January, I published our 16-day honest report. At the time, we were 63.8% accurate against a 66.2% backtest — a gap I attributed to small-sample noise and the expected friction between historical training data and real-world conditions. Since then, we made one meaningful model adjustment and accumulated 587 live predictions on the updated version. The calibration picture has changed — and not in the direction you'd expect.
The Headline Number
Here's where we stand on the current model version:
A note on the numbers: all figures here reflect only the current model version. The original model ran earlier in the season; mixing both would obscure rather than illuminate performance. These 587 predictions are a clean read on what the current system actually does.
Live accuracy exceeding the backtest by 0.25 points is unusual — models almost always give some ground when moving from historical data to real-time conditions. It's a small margin, well within statistical noise, but the direction matters. The model isn't degrading. It's holding.
We didn't diverge. We improved.
The Calibration Picture Has Changed - Again
After applying Platt scaling, the calibration profile tightened significantly — but also became more honest about where the model still struggles.
| Confidence Tier | Predictions | Correct | Actual % | Model Said | Gap |
|---|---|---|---|---|---|
| High (74%+) | 172 | 146 | 84.9% | 78.9% | −6.0 pp |
| Medium (66-74) | 96 | 69 | 71.9% | 70.1% | −1.8 pp |
| Low (55-66%) | 213 | 134 | 62.9% | 60.2% | −2.7 pp |
| Very Low (<55%) | 106 | 47 | 44.3% | 52.4% | +8.1 pp |
Three tiers show the model being underconfident — actual outcomes are consistently better than predicted probabilities. One tier (Very Low Certainty) shows overconfidence, where the model assigns ~52% probability but games resolve at 44%.
What This Means in Practice
High Certainty Tier: Strong Signal, Still Conservative
At nearly 85% realized accuracy, the model’s highest-confidence predictions remain its strongest asset. The −6pp gap shows consistent underconfidence — the model is still slightly “holding back” on its best edges.
Medium & Low Certainty Tiers: Calibration Success
This is where Platt scaling did its job.
Both tiers are now within ~2pp of perfect calibration — exactly what you want from a probabilistic model. These predictions are now trustworthy in a literal sense: when the model says 70%, it means ~70%.
Very Low Certainty: The Only Real Problem Left
This is now clearly isolated.
When the model outputs sub-55% probabilities, outcomes land at 44%, not 52%. That’s not noise — that’s structural overconfidence in marginal edges.
Importantly, this is no longer a system-wide issue. It’s concentrated in a single tier — which makes it fixable.
Why Calibration Matters More Than Raw Accuracy
Most prediction platforms show you one number. We show you many more — and you can explore the full reliability diagram, ROC curve and confidence breakdown live on our Calibration Dashboard. The distribution of where your accuracy lives changes everything about how you should act on predictions.
Consider two hypothetical services, both at 67% overall accuracy:
Service A: Accuracy is uniformly distributed
Every confidence tier lands close to its predicted probability. When they say 70%, games win roughly 70% of the time.
Service B: 90% accurate on easy favorites, 45% on everything else
They count all of it the same way. The headline number looks similar, but the underlying signal is worthless for any real decision.
Raw accuracy is a vanity metric. Calibration is what tells you whether you can actually act on the numbers.
The table above is our version of Service A's transparency. You can see exactly which tiers are performing, by how much, and in which direction.
What Changed Between January and Now
In the 16-day report, high-confidence tiers were underperforming their predicted probabilities. The model was directionally correct, but miscalibrated — especially at the extremes.
Today, that pattern has reversed.
Not because the model got “smarter” about basketball — but because we changed how its probabilities are interpreted.
From Raw Scores to Calibrated Probabilities
The key change we introduced was Platt scaling — a standard probabilistic calibration technique.
Instead of using raw model outputs directly as probabilities, we now pass them through a logistic transformation trained on observed outcomes. In simple terms:
What Platt scaling does
- Takes raw model confidence scores
- Learns how those scores map to real-world win frequencies
- Outputs probabilities that better reflect actual outcomes
It doesn’t change predictions. It changes how confident we say we are.
This is an important distinction.
We did not change what the model predicts. We changed how those predictions are expressed probabilistically.
Why the Old Model Looked Overconfident
Before calibration, the model systematically misstated its confidence — especially in high- and low-probability games.
When it said:
- 75–80% → outcomes were closer to mid-60s
- 50–55% → outcomes were closer to high-60s
This is a classic pattern in machine learning models:
Raw models are not naturally calibrated
Tree-based models (like XGBoost or Random Forests) are optimized for ranking and classification accuracy — not probability accuracy.
They learn who is more likely to win, not how likely they are to win in absolute terms.
That’s why calibration is a separate step — not a built-in guarantee.
What Platt Scaling Actually Changed
After applying Platt scaling, the probability distribution didn’t compress — it re-shaped and amplified separation between predictions.
This is a crucial distinction.
1. Probability Expansion, Not Compression
Instead of pulling probabilities toward the center, the scaling function pushed confident predictions further away from 50%.
Real examples from the model:
- 83% → 92%
- 74% → 89%
- 66% → 81%
- 56% → 68%
- 54% → 65% or ~51% (depending on signal strength)
What this means
The model didn’t become more conservative — it became more decisive.
Stronger signals were amplified. Weaker signals were either slightly elevated or pushed back toward coin-flip territory.
2. Tier Migration Upward
This reshaping caused a structural shift in how predictions are distributed across tiers.
Examples from real outputs:
- Medium → High (72% → 88%)
- Low → High (63% → 76–79%)
- Very Low → either Low or stays Very Low depending on signal
What this means
The High Certainty tier is now:
- Larger
- More selective
- Composed of genuinely strong signals
At the same time, weaker edges are no longer artificially clustered in the middle.
3. Separation of Signal vs Noise
Before scaling:
- Many predictions sat in the 55–70% range
- Different-quality signals were numerically similar
After scaling:
- Strong edges → clearly high probabilities (80%+)
- Weak edges → remain near 50–55%
What this means
The model now better distinguishes:
- “This should win” vs
- “This might win”
Instead of flattening everything into the middle.
4. Calibration Errors Became Visible — Not Smoothed
This is the most important effect.
Before:
- Calibration errors were inconsistent across tiers
- Some tiers underconfident, others overconfident
Now:
- Mid tiers → tightly calibrated
- High tier → consistently underconfident
- Very Low tier → consistently overconfident
This is not a side effect of scaling.
It’s what happens when you:
- Apply a global monotonic transformation
- And remove noise from the middle
You expose the true structure of model error.
The Calibration Shift Explained
The key change isn’t just numerical — it’s structural.
Before calibration, probabilities were compressed and noisy:
| Tier | Actual Behavior | Gap |
|---|---|---|
| 70–80% | Overconfident | −11pp |
| 60–70% | Overconfident | −6pp |
| 55–60% | Underconfident | +5pp |
| 50–55% | Severely underconfident | +16pp |
This created a distorted picture:
- Strong and weak signals were mixed together
- Mid-range probabilities were overloaded
- Calibration errors cancelled each other out
After Platt scaling:
Strong signals were separated
Games that were already likely winners moved clearly into the 80–90%+ range.
These now form a cleaner, more reliable High Certainty tier.
Mid-range noise was reduced
The overloaded 55–70% region was redistributed:
- Some games upgraded (true edges)
- Some downgraded (false edges)
This is why Medium and Low tiers are now well calibrated.
Weak edges were exposed
Games near coin-flip didn’t disappear.
They became clearly identifiable:
- Remaining in the <55% range
- And consistently underperforming
This is where the model still struggles — but now it’s visible.
What This Means for the Model
Platt scaling didn’t just “fix calibration.”
It changed how the model expresses confidence — and made its structure easier to interpret.
What improved
- Strong predictions are now clearly separated from marginal ones
- Mid-tier probabilities are now statistically reliable
- High-confidence tier contains cleaner, higher-quality signals
- Probability outputs are more actionable for decision-making
What remains
- Very Low tier shows systematic overconfidence (~+8pp gap)
- High tier is still slightly conservative relative to outcomes
- Global scaling introduces trade-offs at distribution extremes
Why This Matters More Than Earlier Model Adjustments
Earlier in the season, we adjusted model inputs (like home-court advantage). That improved prediction quality.
Platt scaling operates on a different level:
- Model adjustments → improve what the model learns
- Platt scaling → improves how confidence is expressed
But in practice, this change is just as important.
Because:
The Honest Summary at 3 Months
What's working
- Live accuracy exceeds backtest — 67.5% vs 67.25% is validation, not drift
- Calibration is now correct where it matters most — mid tiers are near-perfect
- High Certainty delivers elite results — 84.9% on 172 prediction
- Probabilities are now interpretable — not just directional
- Transparency improved — errors are visible, not hidden
What needs work
- Very Low tier overconfidence — +8pp gap is now the primary issue
- High tier underconfidence — model still slightly conservative at the top
What We're Doing About It
Platt scaling solved the broad calibration problem. Now we can focus on the specific remaining weakness.
Short-Term:
- Analyze composition of Very Low tier
- Identify structural drivers:
- Back-to-backs
- Travel fatigue
- Specific team profiles
- Evaluate whether these games should be filtered rather than fixed
Medium-Term:
- Test non-linear calibration methods (e.g. isotonic regression)
- Explore tier-specific calibration instead of global scaling
- Revisit whether sub-55% predictions should be surfaced at all
Long-Term:
- Achieve full reliability alignment across all probability bands
- Build calibration-aware decision frameworks (not just predictions)
- Extend calibrated modeling approach to other sports
Why This Approach Wins
The sports prediction industry is filled with snake oil. Unrealistic promises. Hidden track records. Inflated claims designed to sell subscriptions, not provide value.
After 3 months, the data validates a different path:
Radical Transparency:
Show everything, even when it's uncomfortable
Honest Expectations:
67.5% live is real. 80% is not.
Continuous Improvement:
695 predictions of live data are the foundation for real calibration work
User Education:
Teach you to evaluate predictions, not just consume them
Credibility compounds over time. It can't be bought — only earned through consistent, transparent reporting.
The Bottom Line
After 3 months and 695 live predictions:
- 67.5% live accuracy — Exceeds backtest, validated on 695 predictions
- Full calibration breakdown — Four tiers, all tracked, all public on Calibration Dashboard
- High Certainty tier delivers — 84.9% on 172 predictions
- Honest about weaknesses — VLC gap documented and being investigated
- No false promises — We still won't claim 80% to sell subscriptions
If you want inflated accuracy claims and black-box predictions, there are plenty of alternatives.
If you want honest performance, transparent methodology, and genuine continuous improvement built on real data — you're in the right place.
This is why we show every prediction.
Join the Journey
Preview Tier
See 1-2 predictions daily
Free
Core Tier
Access all predictions with confidence levels
€9.90/month
Insight Tier
Full methodology, advanced analytics, calibration data
€24.90/month
Every prediction is timestamped. Every result is tracked. Every performance metric is public.
Because in a world of 80% claims, 67.5% honesty is the competitive advantage.