Ratings Experiment Log

Experiment 1: Per-Player Ratings (10 shards)

Date: 2026-02-09 Data: logged_in_games/gameevents_00[0-9].csv.gz (10 shards, ~9.5K games, 1.8M samples) Split: 50/50 chronological by game_id Model: LightGBM, 200 leaves, 200 trees

Rating features: queen mu + 4 per-worker mu per team (10 total, 62 features).

Metric	Baseline (52)	Ratings (62)	Diff
Log Loss	0.5831	0.6035	+0.0204
Accuracy	69.10%	69.81%	+0.71%

Ratings improve accuracy slightly (+0.7%) but hurt log loss (+0.02), suggesting overconfidence on some predictions. Per-worker ratings may be too noisy — workers swap positions and the per-worker mu is keyed to seat position, not identity.

Experiment 2: Queen + Avg Worker Ratings (10 shards)

Date: 2026-02-09 Data: same as Experiment 1 Setup: Collapse 4 per-worker mus into a single team average before materializing. Still uses the 62-feature pipeline (worker rating slots all get the same avg value).

Metric	Baseline (52)	Per-Player (62)	Condensed (62)	PP Diff	C Diff
Log Loss	0.5831	0.6035	0.6617	+0.0204	+0.0787
Accuracy	69.10%	69.81%	68.04%	+0.71%	-1.06%

Condensed ratings are worse than baseline on both metrics. Per-player ratings at least improve accuracy. This suggests: 1. Individual worker identity matters more than team average skill. 2. Averaging away per-worker signal destroys useful information. 3. The log loss degradation across both rating variants may indicate the model is overfitting to rating features in the training half.

Experiment 3: Proper 56-Feature Condensed Ratings (10 shards)

Removed — condensed ratings consistently worse than baseline across experiments 2 and 3.

Experiment 4: Per-Cabinet Anonymous Ratings (10 shards)

Date: 2026-02-10 Data: same as Experiment 1 Ratings: ratings_queen_drone.pkl — computed with per-cabinet anonymous ratings that fix rating deflation. Previously, anonymous players were excluded from model.rate(), so mu leaked out of the system (mean mu dropped 19.0 → 4.4 over 84K rated games). Now each anonymous position uses a shared per-cabinet rating, ensuring teams are always 5v5. Deflation reduced from -14.6 to -2.7 mu, and all 183K games contribute ratings (was 84K).

Metric	Baseline (52)	Per-Player (62)	Diff
Log Loss	0.5831	0.5885	+0.0055
Accuracy	69.10%	70.62%	+1.53%

Substantial improvement over Experiment 1's ratings (+1.5% accuracy vs +0.7%). Log loss penalty is also much smaller (+0.005 vs +0.020). Fixing the deflation leak makes per-player mu values more meaningful — a player's mu now reflects actual skill rather than being dragged down by untracked losses to anonymous opponents.

Experiment 5: Anonymous Sigma Floor Sweep

Date: 2026-02-10 Data: same as Experiment 1 Motivation: Cabinet anonymous ratings let sigma decay naturally (8.33 → ~1-2), but anonymous positions represent different people each game — uncertainty should stay high. With decayed sigma, anonymous positions barely absorb mu changes (mu += (sigma²/team_sigma²) * omega), forcing disproportionate updates onto logged-in players. A min_anon_sigma parameter floors anonymous sigma after each game's update.

Sigma floor sweep (correlation metric)

min_sigma	correlation
0	0.1874
1	0.1874
2	0.1905
3	0.1941
4	0.1958
5	0.1961
6	0.1957
7	0.1949
8	0.1939

Peak at sigma=5. Correlation improves +0.009 over no floor.

LightGBM A/B with min_anon_sigma=5

Metric	Baseline (52)	Per-Player (62)	Diff
Log Loss	0.5765	0.6001	+0.0236
Accuracy	69.10%	70.12%	+1.02%

Sigma floor at 5 improves raw rating correlation but the downstream LightGBM accuracy dips slightly vs Experiment 4 (70.1% vs 70.6%). The model may have been exploiting decayed sigma as a proxy for cabinet activity level. Deflation remains similar (-2.6 mu).

LightGBM A/B with min_anon_sigma=2

Metric	Baseline (52)	Per-Player (62)	Diff
Log Loss	0.5765	0.5951	+0.0186
Accuracy	69.10%	70.29%	+1.19%

Summary across sigma floor values

min_sigma	Correlation	Accuracy	Log Loss
0 (Exp 4)	0.1874	70.62%	0.5885
2	0.1905	70.29%	0.5951
5	0.1961	70.12%	0.6001

Correlation and LightGBM accuracy pull in opposite directions. Higher sigma floors improve raw rating signal but the model benefits from sharper (lower sigma) anonymous ratings — likely using decayed sigma patterns as a proxy for cabinet activity. No floor (Experiment 4) remains the best downstream result.

Experiment 6: Anonymous Discount Sweep → Remove Discount

Date: 2026-02-10 Data: same as Experiment 1

Swept anonymous_discount (the mu penalty applied to first-time anonymous cabinet ratings):

discount	correlation
0	0.1883
1	0.1883
2	0.1883
3	0.1882
4	0.1880
5	0.1877
6	0.1874
7	0.1870
8	0.1866
9	0.1860
10	0.1854
12	0.1840
15	0.1815

Correlation monotonically decreases with higher discount. Best at 0-1 (0.1883 vs 0.1874 at the previous default of 6).

LightGBM A/B with discount=0

Metric	Baseline (52)	Per-Player (62)	Diff
Log Loss	0.5765	0.5890	+0.0125
Accuracy	69.10%	70.54%	+1.44%

Essentially identical to Experiment 4's 70.6% (within noise). Since discount=0 means anonymous players start at the same default mu as everyone else, the anonymous_discount parameter was removed from the code entirely. Deflation with discount=0: -1.5 mu (vs -2.7 at discount=6).