Rating IQ

Modified on Sun, 29 Mar at 5:12 PM

TL;DR: All divisions should keep Rating IQ enabled. There is no downside always having Rating IQ on, and in almost every league setup it produces ratings that better reflect real player ability.

Rating IQ keeps player averages fair by combining three protections:

Evaluator calibration: normalizes strict vs generous scoring styles so one evaluator's scale does not overpower another's.
Outlier protection in averaging: uses median/trimmed mean so one extreme score cannot swing a player's result.
Bias removal: excludes an evaluator's score from all-evaluator average when they are confirmed as the player's guardian/coach match (if the guardian is not the only rating for the player).

1) Evaluator Calibration

The issue

Plain averages assume all evaluators interpret the same score scale the same way. They do not.

Everyone is on the same official scale (for example, 1-5), but evaluators still use different internal standards.
A strict evaluator might think, "our best player is a 3; a true 5 is college-level."
A generous evaluator might rate that same level of player as a 5.
Without calibration, score differences partly reflect evaluator style instead of player skill.
In low-overlap formats (for example one coach rating only their own team), this can inflate or deflate players based on who rated them.
In full-overlap formats (everyone rates everyone), it mostly changes evaluator influence: wide-scale raters move score gaps more than narrow-scale raters.

How Rating IQ fixes it

Rating IQ calibrates per evaluator, per numeric category:

Collect that evaluator's scores in that category.
Convert each raw score to its relative position inside that evaluator's distribution.
Slightly tone down extreme highs and lows when an evaluator has only a few ratings; do less of that when they have many ratings.
Map back to the division's score scale.

Rules:

If sample size is < 3, calibration is skipped (raw value is used).
The amount of tone-down is capped between 5% and 30%.

Why this step exists:

Small samples are noisy, so extreme highs/lows are less trustworthy.
Larger samples are more stable, so we trust those extremes more.
This keeps ranking order intact, but avoids overreacting to thin data.

Quick example (1, 2, 2, 5, 5):

Raw 1 -> about 1.60
Raw 2 -> about 2.65
Raw 5 -> about 4.05

Ranking stays the same. Extremes are simply less overconfident.

What this means in practice:

We first translate each evaluator's scores into relative standing (bottom/middle/top for that evaluator).
We then apply a confidence adjustment: smaller sample sizes get more smoothing, larger sample sizes get less smoothing.
That smoothing is intentionally limited, so calibration helps but does not flatten everyone to the middle.
Finally, we convert results back onto the normal league scale (usually 1-5) so output stays familiar.

2) Outlier Protection In Averaging

The issue

Plain averages are easy to swing in real leagues because outlier scores happen for both strategic and non-strategic reasons:

An evaluator might intentionally low-score a strong player to improve draft odds.
An evaluator might intentionally high-score players already on another coach's team so that coach is more likely to be assigned weaker players overall.
Some outliers are accidental (limited view, data-entry mistakes, rushed evaluations).

A simple average gives all of those extremes full power, so one rating can move a player more than it should.

Scenario	Raw 1-5 scores	Simple average	Rating IQ result	What happens
Hero booster (one inflated 5)	1, 1, 1, 1, 5	1.8	1.0	A single rave review lifts a player too far. Rating IQ removes the distortion.
Hater voter (one crushing 1)	1, 4, 4, 4, 4	3.4	4.0	One grudge score drags a player down. Rating IQ restores consensus.
Tiny sample	4, 5	4.5	4.5	With two ratings, every datapoint is kept.
Split trio	2, 3, 5	3.3	3.0	Median lands on the middle and ignores edge skew.

How Rating IQ fixes it

After calibration, Rating IQ uses an outlier-resistant aggregator based on sample size:

Ratings on file	Math used	Why
1-2	Mean	Too little data to remove values safely.
3-4	Median	Blocks one high/low from tipping the result.
5+	Trimmed mean (drop one highest and one lowest, average the rest)	Enough depth to remove outliers safely.

In plain terms:

With 1-2 ratings, we use a normal average.
With 3-4 ratings, we use the middle value (median) so one extreme score cannot swing the result.
With 5+ ratings, we remove one highest and one lowest score, then average the rest.

3) Bias Removal

The issue

If a player's guardian/coach is also an evaluator, that score can introduce conflict-of-interest bias into the all-evaluator average.

How Rating IQ fixes it

When Rating IQ is enabled, all-evaluator averages automatically remove scores from a player's own guardian/coach relationships.

This protects the all-evaluator view from family/coach bias while preserving raw submitted scores. If that guardian/coach-linked score is the only rating available, it is kept (not removed) so the player does not lose their only score.

FAQ

Does trimming two scores at exactly five ratings lose too much data?

Usually no. Once you have at least five ratings, removing one highest and one lowest score typically improves fairness more than it removes useful signal.

Can a determined coach still game the system?

One manipulated score usually will not change much because Rating IQ protects against single-score outliers.

Coordinated manipulation by multiple evaluators is harder and can still be reviewed.

Will scores change when I flip it on?

Yes, calculated averages update immediately. No, raw evaluator-entered scores are not rewritten.

If I turn it off, do I get raw averages back?

Yes. Rating IQ is computed at runtime, so disabling it returns standard averaging behavior.

What happens when evaluator sample size is too small for calibration?

If an evaluator has only 1-2 scores in a category, Rating IQ does not calibrate those scores yet. It uses them as entered until there is enough data.

What if the only rating comes from a guardian/coach-linked evaluator?

That score stays. Rating IQ only removes guardian/coach-linked scores when other valid ratings exist.

Bottom Line

Keeps every rating when data are scarce.
Calibrates strict vs generous evaluators for fairer cross-evaluator comparisons.
Uses median for mid-size samples.
Trims outliers when enough data exists.
Removes confirmed guardian/coach-match bias from all-evaluator averages.

Enable Rating IQ and let the math keep your league honest.