Skip to content

Ratings Experiment Log

Experiment 1: Per-Player Ratings (10 shards)

Date: 2026-02-09 Data: logged_in_games/gameevents_00[0-9].csv.gz (10 shards, ~9.5K games, 1.8M samples) Split: 50/50 chronological by game_id Model: LightGBM, 200 leaves, 200 trees

Rating features: queen mu + 4 per-worker mu per team (10 total, 62 features).

Metric Baseline (52) Ratings (62) Diff
Log Loss 0.5831 0.6035 +0.0204
Accuracy 69.10% 69.81% +0.71%

Ratings improve accuracy slightly (+0.7%) but hurt log loss (+0.02), suggesting overconfidence on some predictions. Per-worker ratings may be too noisy — workers swap positions and the per-worker mu is keyed to seat position, not identity.

Experiment 2: Queen + Avg Worker Ratings (10 shards)

Date: 2026-02-09 Data: same as Experiment 1 Setup: Collapse 4 per-worker mus into a single team average before materializing. Still uses the 62-feature pipeline (worker rating slots all get the same avg value).

Metric Baseline (52) Per-Player (62) Condensed (62) PP Diff C Diff
Log Loss 0.5831 0.6035 0.6617 +0.0204 +0.0787
Accuracy 69.10% 69.81% 68.04% +0.71% -1.06%

Condensed ratings are worse than baseline on both metrics. Per-player ratings at least improve accuracy. This suggests: 1. Individual worker identity matters more than team average skill. 2. Averaging away per-worker signal destroys useful information. 3. The log loss degradation across both rating variants may indicate the model is overfitting to rating features in the training half.

~~Experiment 3: Proper 56-Feature Condensed Ratings (10 shards)~~

Removed — condensed ratings consistently worse than baseline across experiments 2 and 3.

Experiment 4: Per-Cabinet Anonymous Ratings (10 shards)

Date: 2026-02-10 Data: same as Experiment 1 Ratings: ratings_queen_drone.pkl — computed with per-cabinet anonymous ratings that fix rating deflation. Previously, anonymous players were excluded from model.rate(), so mu leaked out of the system (mean mu dropped 19.0 → 4.4 over 84K rated games). Now each anonymous position uses a shared per-cabinet rating, ensuring teams are always 5v5. Deflation reduced from -14.6 to -2.7 mu, and all 183K games contribute ratings (was 84K).

Metric Baseline (52) Per-Player (62) Diff
Log Loss 0.5831 0.5885 +0.0055
Accuracy 69.10% 70.62% +1.53%

Substantial improvement over Experiment 1's ratings (+1.5% accuracy vs +0.7%). Log loss penalty is also much smaller (+0.005 vs +0.020). Fixing the deflation leak makes per-player mu values more meaningful — a player's mu now reflects actual skill rather than being dragged down by untracked losses to anonymous opponents.

Experiment 5: Anonymous Sigma Floor Sweep

Date: 2026-02-10 Data: same as Experiment 1 Motivation: Cabinet anonymous ratings let sigma decay naturally (8.33 → ~1-2), but anonymous positions represent different people each game — uncertainty should stay high. With decayed sigma, anonymous positions barely absorb mu changes (mu += (sigma²/team_sigma²) * omega), forcing disproportionate updates onto logged-in players. A min_anon_sigma parameter floors anonymous sigma after each game's update.

Sigma floor sweep (correlation metric)

min_sigma correlation
0 0.1874
1 0.1874
2 0.1905
3 0.1941
4 0.1958
5 0.1961
6 0.1957
7 0.1949
8 0.1939

Peak at sigma=5. Correlation improves +0.009 over no floor.

LightGBM A/B with min_anon_sigma=5

Metric Baseline (52) Per-Player (62) Diff
Log Loss 0.5765 0.6001 +0.0236
Accuracy 69.10% 70.12% +1.02%

Sigma floor at 5 improves raw rating correlation but the downstream LightGBM accuracy dips slightly vs Experiment 4 (70.1% vs 70.6%). The model may have been exploiting decayed sigma as a proxy for cabinet activity level. Deflation remains similar (-2.6 mu).

LightGBM A/B with min_anon_sigma=2

Metric Baseline (52) Per-Player (62) Diff
Log Loss 0.5765 0.5951 +0.0186
Accuracy 69.10% 70.29% +1.19%

Summary across sigma floor values

min_sigma Correlation Accuracy Log Loss
0 (Exp 4) 0.1874 70.62% 0.5885
2 0.1905 70.29% 0.5951
5 0.1961 70.12% 0.6001

Correlation and LightGBM accuracy pull in opposite directions. Higher sigma floors improve raw rating signal but the model benefits from sharper (lower sigma) anonymous ratings — likely using decayed sigma patterns as a proxy for cabinet activity. No floor (Experiment 4) remains the best downstream result.

Experiment 6: Anonymous Discount Sweep → Remove Discount

Date: 2026-02-10 Data: same as Experiment 1

Swept anonymous_discount (the mu penalty applied to first-time anonymous cabinet ratings):

discount correlation
0 0.1883
1 0.1883
2 0.1883
3 0.1882
4 0.1880
5 0.1877
6 0.1874
7 0.1870
8 0.1866
9 0.1860
10 0.1854
12 0.1840
15 0.1815

Correlation monotonically decreases with higher discount. Best at 0-1 (0.1883 vs 0.1874 at the previous default of 6).

LightGBM A/B with discount=0

Metric Baseline (52) Per-Player (62) Diff
Log Loss 0.5765 0.5890 +0.0125
Accuracy 69.10% 70.54% +1.44%

Essentially identical to Experiment 4's 70.6% (within noise). Since discount=0 means anonymous players start at the same default mu as everyone else, the anonymous_discount parameter was removed from the code entirely. Deflation with discount=0: -1.5 mu (vs -2.7 at discount=6).