3 Months In: The Model Converged — Here's What the Data Actually Shows

Published: 4/2/2026

Author: Oleksandr Honchar

Reading Time: 8 min

Back in January, I published our 16-day honest report. At the time, we were 63.8% accurate against a 66.2% backtest — a gap I attributed to small-sample noise and the expected friction between historical training data and real-world conditions. Since then, we made one meaningful model adjustment and accumulated 587 live predictions on the updated version. The calibration picture has changed — and not in the direction you'd expect.

The Headline Number

Here's where we stand on the current model version:

Backtest67.25% accuracy

Live (3 months)67.5% accuracy

Variance Gap+0.25 pp

A note on the numbers: all figures here reflect only the current model version. The original model ran earlier in the season; mixing both would obscure rather than illuminate performance. These 587 predictions are a clean read on what the current system actually does.

Live accuracy exceeding the backtest by 0.25 points is unusual — models almost always give some ground when moving from historical data to real-time conditions. It's a small margin, well within statistical noise, but the direction matters. The model isn't degrading. It's holding.

We didn't diverge. We improved.

The Calibration Picture Has Changed - Again

After applying Platt scaling, the calibration profile tightened significantly — but also became more honest about where the model still struggles.

Confidence Tier	Predictions	Correct	Actual %	Model Said	Gap
High (74%+)	172	146	84.9%	78.9%	−6.0 pp
Medium (66-74)	96	69	71.9%	70.1%	−1.8 pp
Low (55-66%)	213	134	62.9%	60.2%	−2.7 pp
Very Low (<55%)	106	47	44.3%	52.4%	+8.1 pp

Three tiers show the model being underconfident — actual outcomes are consistently better than predicted probabilities. One tier (Very Low Certainty) shows overconfidence, where the model assigns ~52% probability but games resolve at 44%.

What This Means in Practice

High Certainty Tier: Strong Signal, Still Conservative

At nearly 85% realized accuracy, the model’s highest-confidence predictions remain its strongest asset. The −6pp gap shows consistent underconfidence — the model is still slightly “holding back” on its best edges.

Medium & Low Certainty Tiers: Calibration Success

This is where Platt scaling did its job.

Both tiers are now within ~2pp of perfect calibration — exactly what you want from a probabilistic model. These predictions are now trustworthy in a literal sense: when the model says 70%, it means ~70%.

Very Low Certainty: The Only Real Problem Left

This is now clearly isolated.

When the model outputs sub-55% probabilities, outcomes land at 44%, not 52%. That’s not noise — that’s structural overconfidence in marginal edges.

Importantly, this is no longer a system-wide issue. It’s concentrated in a single tier — which makes it fixable.

Why Calibration Matters More Than Raw Accuracy

Most prediction platforms show you one number. We show you many more — and you can explore the full reliability diagram, ROC curve and confidence breakdown live on our Calibration Dashboard. The distribution of where your accuracy lives changes everything about how you should act on predictions.

Consider two hypothetical services, both at 67% overall accuracy:

Service A: Accuracy is uniformly distributed

Every confidence tier lands close to its predicted probability. When they say 70%, games win roughly 70% of the time.

Service B: 90% accurate on easy favorites, 45% on everything else

They count all of it the same way. The headline number looks similar, but the underlying signal is worthless for any real decision.

Raw accuracy is a vanity metric. Calibration is what tells you whether you can actually act on the numbers.

The table above is our version of Service A's transparency. You can see exactly which tiers are performing, by how much, and in which direction.

What Changed Between January and Now

In the 16-day report, high-confidence tiers were underperforming their predicted probabilities. The model was directionally correct, but miscalibrated — especially at the extremes.

Today, that pattern has reversed.

Not because the model got “smarter” about basketball — but because we changed how its probabilities are interpreted.

From Raw Scores to Calibrated Probabilities

The key change we introduced was Platt scaling — a standard probabilistic calibration technique.

Instead of using raw model outputs directly as probabilities, we now pass them through a logistic transformation trained on observed outcomes. In simple terms:

What Platt scaling does

Takes raw model confidence scores
Learns how those scores map to real-world win frequencies
Outputs probabilities that better reflect actual outcomes

It doesn’t change predictions. It changes how confident we say we are.

This is an important distinction.

We did not change what the model predicts. We changed how those predictions are expressed probabilistically.

Why the Old Model Looked Overconfident

Before calibration, the model systematically misstated its confidence — especially in high- and low-probability games.

When it said:

75–80% → outcomes were closer to mid-60s
50–55% → outcomes were closer to high-60s

This is a classic pattern in machine learning models:

Raw models are not naturally calibrated

Tree-based models (like XGBoost or Random Forests) are optimized for ranking and classification accuracy — not probability accuracy.

They learn who is more likely to win, not how likely they are to win in absolute terms.

That’s why calibration is a separate step — not a built-in guarantee.

What Platt Scaling Actually Changed

After applying Platt scaling, the probability distribution didn’t compress — it re-shaped and amplified separation between predictions.

This is a crucial distinction.

1. Probability Expansion, Not Compression

Instead of pulling probabilities toward the center, the scaling function pushed confident predictions further away from 50%.

Real examples from the model:

83% → 92%
74% → 89%
66% → 81%
56% → 68%
54% → 65% or ~51% (depending on signal strength)

What this means

The model didn’t become more conservative — it became more decisive.

Stronger signals were amplified. Weaker signals were either slightly elevated or pushed back toward coin-flip territory.

2. Tier Migration Upward

This reshaping caused a structural shift in how predictions are distributed across tiers.

Examples from real outputs:

Medium → High (72% → 88%)
Low → High (63% → 76–79%)
Very Low → either Low or stays Very Low depending on signal

What this means

The High Certainty tier is now:

Larger
More selective
Composed of genuinely strong signals

At the same time, weaker edges are no longer artificially clustered in the middle.

3. Separation of Signal vs Noise

Before scaling:

Many predictions sat in the 55–70% range
Different-quality signals were numerically similar

After scaling:

Strong edges → clearly high probabilities (80%+)
Weak edges → remain near 50–55%

What this means

The model now better distinguishes:

“This should win” vs
“This might win”

Instead of flattening everything into the middle.

4. Calibration Errors Became Visible — Not Smoothed

This is the most important effect.

Before:

Calibration errors were inconsistent across tiers
Some tiers underconfident, others overconfident

Now:

Mid tiers → tightly calibrated
High tier → consistently underconfident
Very Low tier → consistently overconfident

This is not a side effect of scaling.

It’s what happens when you:

Apply a global monotonic transformation
And remove noise from the middle

You expose the true structure of model error.

The Calibration Shift Explained

The key change isn’t just numerical — it’s structural.

Before calibration, probabilities were compressed and noisy:

Tier	Actual Behavior	Gap
70–80%	Overconfident	−11pp
60–70%	Overconfident	−6pp
55–60%	Underconfident	+5pp
50–55%	Severely underconfident	+16pp

This created a distorted picture:

Strong and weak signals were mixed together
Mid-range probabilities were overloaded
Calibration errors cancelled each other out

After Platt scaling:

Strong signals were separated

Games that were already likely winners moved clearly into the 80–90%+ range.

These now form a cleaner, more reliable High Certainty tier.

Mid-range noise was reduced

The overloaded 55–70% region was redistributed:

Some games upgraded (true edges)
Some downgraded (false edges)

This is why Medium and Low tiers are now well calibrated.

Weak edges were exposed

Games near coin-flip didn’t disappear.

They became clearly identifiable:

Remaining in the <55% range
And consistently underperforming

This is where the model still struggles — but now it’s visible.

What This Means for the Model

Platt scaling didn’t just “fix calibration.”

It changed how the model expresses confidence — and made its structure easier to interpret.

What improved

Strong predictions are now clearly separated from marginal ones
Mid-tier probabilities are now statistically reliable
High-confidence tier contains cleaner, higher-quality signals
Probability outputs are more actionable for decision-making

What remains

Very Low tier shows systematic overconfidence (~+8pp gap)
High tier is still slightly conservative relative to outcomes
Global scaling introduces trade-offs at distribution extremes

Why This Matters More Than Earlier Model Adjustments

Earlier in the season, we adjusted model inputs (like home-court advantage). That improved prediction quality.

Platt scaling operates on a different level:

Model adjustments → improve what the model learns
Platt scaling → improves how confidence is expressed

But in practice, this change is just as important.

Because:

A model you can’t interpret probabilistically is a model you can’t use effectively.

The Honest Summary at 3 Months

What's working

Live accuracy exceeds backtest — 67.5% vs 67.25% is validation, not drift
Calibration is now correct where it matters most — mid tiers are near-perfect
High Certainty delivers elite results — 84.9% on 172 prediction
Probabilities are now interpretable — not just directional
Transparency improved — errors are visible, not hidden

What needs work

Very Low tier overconfidence — +8pp gap is now the primary issue
High tier underconfidence — model still slightly conservative at the top

What We're Doing About It

Platt scaling solved the broad calibration problem. Now we can focus on the specific remaining weakness.

Short-Term:

Analyze composition of Very Low tier
Identify structural drivers:
- Back-to-backs
- Travel fatigue
- Specific team profiles
Evaluate whether these games should be filtered rather than fixed

Medium-Term:

Test non-linear calibration methods (e.g. isotonic regression)
Explore tier-specific calibration instead of global scaling
Revisit whether sub-55% predictions should be surfaced at all

Long-Term:

Achieve full reliability alignment across all probability bands
Build calibration-aware decision frameworks (not just predictions)
Extend calibrated modeling approach to other sports

Why This Approach Wins

The sports prediction industry is filled with snake oil. Unrealistic promises. Hidden track records. Inflated claims designed to sell subscriptions, not provide value.

After 3 months, the data validates a different path:

Radical Transparency:

Show everything, even when it's uncomfortable

Honest Expectations:

67.5% live is real. 80% is not.

Continuous Improvement:

695 predictions of live data are the foundation for real calibration work

User Education:

Teach you to evaluate predictions, not just consume them

Credibility compounds over time. It can't be bought — only earned through consistent, transparent reporting.

The Bottom Line

After 3 months and 695 live predictions:

67.5% live accuracy — Exceeds backtest, validated on 695 predictions
Full calibration breakdown — Four tiers, all tracked, all public on Calibration Dashboard
High Certainty tier delivers — 84.9% on 172 predictions
Honest about weaknesses — VLC gap documented and being investigated
No false promises — We still won't claim 80% to sell subscriptions

If you want inflated accuracy claims and black-box predictions, there are plenty of alternatives.

If you want honest performance, transparent methodology, and genuine continuous improvement built on real data — you're in the right place.

This is why we show every prediction.

Join the Journey

We're building the most transparent NBA prediction platform in the world. Follow along:

Preview Tier

See 1-2 predictions daily

Free

Core Tier

Access all predictions with confidence levels

€9.90/month

Insight Tier

Full methodology, advanced analytics, calibration data

€24.90/month

Every prediction is timestamped. Every result is tracked. Every performance metric is public.

Because in a world of 80% claims, 67.5% honesty is the competitive advantage.

3 Months In: The Model Converged — Here's What the Data Actually Shows

Published: 4/2/2026

Author: Oleksandr Honchar

Reading Time: 8 min

The Headline Number

Here's where we stand on the current model version:

Backtest67.25% accuracy

Live (3 months)67.5% accuracy

Variance Gap+0.25 pp

We didn't diverge. We improved.

The Calibration Picture Has Changed - Again

After applying Platt scaling, the calibration profile tightened significantly — but also became more honest about where the model still struggles.

Confidence Tier	Predictions	Correct	Actual %	Model Said	Gap
High (74%+)	172	146	84.9%	78.9%	−6.0 pp
Medium (66-74)	96	69	71.9%	70.1%	−1.8 pp
Low (55-66%)	213	134	62.9%	60.2%	−2.7 pp
Very Low (<55%)	106	47	44.3%	52.4%	+8.1 pp

What This Means in Practice

High Certainty Tier: Strong Signal, Still Conservative

Medium & Low Certainty Tiers: Calibration Success

This is where Platt scaling did its job.

Very Low Certainty: The Only Real Problem Left

This is now clearly isolated.

When the model outputs sub-55% probabilities, outcomes land at 44%, not 52%. That’s not noise — that’s structural overconfidence in marginal edges.

Importantly, this is no longer a system-wide issue. It’s concentrated in a single tier — which makes it fixable.

Why Calibration Matters More Than Raw Accuracy

Consider two hypothetical services, both at 67% overall accuracy:

Service A: Accuracy is uniformly distributed

Every confidence tier lands close to its predicted probability. When they say 70%, games win roughly 70% of the time.

Service B: 90% accurate on easy favorites, 45% on everything else

They count all of it the same way. The headline number looks similar, but the underlying signal is worthless for any real decision.

Raw accuracy is a vanity metric. Calibration is what tells you whether you can actually act on the numbers.

The table above is our version of Service A's transparency. You can see exactly which tiers are performing, by how much, and in which direction.

What Changed Between January and Now

In the 16-day report, high-confidence tiers were underperforming their predicted probabilities. The model was directionally correct, but miscalibrated — especially at the extremes.

Today, that pattern has reversed.

Not because the model got “smarter” about basketball — but because we changed how its probabilities are interpreted.

From Raw Scores to Calibrated Probabilities

The key change we introduced was Platt scaling — a standard probabilistic calibration technique.

Instead of using raw model outputs directly as probabilities, we now pass them through a logistic transformation trained on observed outcomes. In simple terms:

What Platt scaling does

Takes raw model confidence scores
Learns how those scores map to real-world win frequencies
Outputs probabilities that better reflect actual outcomes

It doesn’t change predictions. It changes how confident we say we are.

This is an important distinction.

We did not change what the model predicts. We changed how those predictions are expressed probabilistically.

Why the Old Model Looked Overconfident

Before calibration, the model systematically misstated its confidence — especially in high- and low-probability games.

When it said:

75–80% → outcomes were closer to mid-60s
50–55% → outcomes were closer to high-60s

This is a classic pattern in machine learning models:

Raw models are not naturally calibrated

Tree-based models (like XGBoost or Random Forests) are optimized for ranking and classification accuracy — not probability accuracy.

They learn who is more likely to win, not how likely they are to win in absolute terms.

That’s why calibration is a separate step — not a built-in guarantee.

What Platt Scaling Actually Changed

After applying Platt scaling, the probability distribution didn’t compress — it re-shaped and amplified separation between predictions.

This is a crucial distinction.

1. Probability Expansion, Not Compression

Instead of pulling probabilities toward the center, the scaling function pushed confident predictions further away from 50%.

Real examples from the model:

83% → 92%
74% → 89%
66% → 81%
56% → 68%
54% → 65% or ~51% (depending on signal strength)

What this means

The model didn’t become more conservative — it became more decisive.

Stronger signals were amplified. Weaker signals were either slightly elevated or pushed back toward coin-flip territory.

2. Tier Migration Upward

This reshaping caused a structural shift in how predictions are distributed across tiers.

Examples from real outputs:

Medium → High (72% → 88%)
Low → High (63% → 76–79%)
Very Low → either Low or stays Very Low depending on signal

What this means

The High Certainty tier is now:

Larger
More selective
Composed of genuinely strong signals

At the same time, weaker edges are no longer artificially clustered in the middle.

3. Separation of Signal vs Noise

Before scaling:

Many predictions sat in the 55–70% range
Different-quality signals were numerically similar

After scaling:

Strong edges → clearly high probabilities (80%+)
Weak edges → remain near 50–55%

What this means

The model now better distinguishes:

“This should win” vs
“This might win”

Instead of flattening everything into the middle.

4. Calibration Errors Became Visible — Not Smoothed

This is the most important effect.

Before:

Calibration errors were inconsistent across tiers
Some tiers underconfident, others overconfident

Now:

Mid tiers → tightly calibrated
High tier → consistently underconfident
Very Low tier → consistently overconfident

This is not a side effect of scaling.

It’s what happens when you:

Apply a global monotonic transformation
And remove noise from the middle

You expose the true structure of model error.

The Calibration Shift Explained

The key change isn’t just numerical — it’s structural.

Before calibration, probabilities were compressed and noisy:

Tier	Actual Behavior	Gap
70–80%	Overconfident	−11pp
60–70%	Overconfident	−6pp
55–60%	Underconfident	+5pp
50–55%	Severely underconfident	+16pp

This created a distorted picture:

Strong and weak signals were mixed together
Mid-range probabilities were overloaded
Calibration errors cancelled each other out

After Platt scaling:

Strong signals were separated

Games that were already likely winners moved clearly into the 80–90%+ range.

These now form a cleaner, more reliable High Certainty tier.

Mid-range noise was reduced

The overloaded 55–70% region was redistributed:

Some games upgraded (true edges)
Some downgraded (false edges)

This is why Medium and Low tiers are now well calibrated.

Weak edges were exposed

Games near coin-flip didn’t disappear.

They became clearly identifiable:

Remaining in the <55% range
And consistently underperforming

This is where the model still struggles — but now it’s visible.

What This Means for the Model

Platt scaling didn’t just “fix calibration.”

It changed how the model expresses confidence — and made its structure easier to interpret.

What improved

Strong predictions are now clearly separated from marginal ones
Mid-tier probabilities are now statistically reliable
High-confidence tier contains cleaner, higher-quality signals
Probability outputs are more actionable for decision-making

What remains

Very Low tier shows systematic overconfidence (~+8pp gap)
High tier is still slightly conservative relative to outcomes
Global scaling introduces trade-offs at distribution extremes

Why This Matters More Than Earlier Model Adjustments

Earlier in the season, we adjusted model inputs (like home-court advantage). That improved prediction quality.

Platt scaling operates on a different level:

Model adjustments → improve what the model learns
Platt scaling → improves how confidence is expressed

But in practice, this change is just as important.

Because:

A model you can’t interpret probabilistically is a model you can’t use effectively.

The Honest Summary at 3 Months

What's working

Live accuracy exceeds backtest — 67.5% vs 67.25% is validation, not drift
Calibration is now correct where it matters most — mid tiers are near-perfect
High Certainty delivers elite results — 84.9% on 172 prediction
Probabilities are now interpretable — not just directional
Transparency improved — errors are visible, not hidden

What needs work

Very Low tier overconfidence — +8pp gap is now the primary issue
High tier underconfidence — model still slightly conservative at the top

What We're Doing About It

Platt scaling solved the broad calibration problem. Now we can focus on the specific remaining weakness.

Short-Term:

Analyze composition of Very Low tier
Identify structural drivers:
- Back-to-backs
- Travel fatigue
- Specific team profiles
Evaluate whether these games should be filtered rather than fixed

Medium-Term:

Test non-linear calibration methods (e.g. isotonic regression)
Explore tier-specific calibration instead of global scaling
Revisit whether sub-55% predictions should be surfaced at all

Long-Term:

Achieve full reliability alignment across all probability bands
Build calibration-aware decision frameworks (not just predictions)
Extend calibrated modeling approach to other sports

Why This Approach Wins

The sports prediction industry is filled with snake oil. Unrealistic promises. Hidden track records. Inflated claims designed to sell subscriptions, not provide value.

After 3 months, the data validates a different path:

Radical Transparency:

Show everything, even when it's uncomfortable

Honest Expectations:

67.5% live is real. 80% is not.

Continuous Improvement:

695 predictions of live data are the foundation for real calibration work

User Education:

Teach you to evaluate predictions, not just consume them

Credibility compounds over time. It can't be bought — only earned through consistent, transparent reporting.

The Bottom Line

After 3 months and 695 live predictions:

67.5% live accuracy — Exceeds backtest, validated on 695 predictions
Full calibration breakdown — Four tiers, all tracked, all public on Calibration Dashboard
High Certainty tier delivers — 84.9% on 172 predictions
Honest about weaknesses — VLC gap documented and being investigated
No false promises — We still won't claim 80% to sell subscriptions

If you want inflated accuracy claims and black-box predictions, there are plenty of alternatives.

If you want honest performance, transparent methodology, and genuine continuous improvement built on real data — you're in the right place.

This is why we show every prediction.

Join the Journey

We're building the most transparent NBA prediction platform in the world. Follow along:

Preview Tier

See 1-2 predictions daily

Free

Core Tier

Access all predictions with confidence levels

€9.90/month

Insight Tier

Full methodology, advanced analytics, calibration data

€24.90/month

Every prediction is timestamped. Every result is tracked. Every performance metric is public.

Because in a world of 80% claims, 67.5% honesty is the competitive advantage.