Game Quality Classifier — Experiment Report

Goal

Build a binary classifier to distinguish competitive Killer Queen games from junk/practice/button-check games, using only in-game event features (no login metadata at inference time).

Labels: games with lots of hivemind logins = 1 (proxy for quality), unfiltered games = 0 (mixed). Validation anchor: late tournament games should almost all pass the quality filter.

Architecture

LightGBM binary classifier. 69 hand-crafted features from game event streams. Training: 16K games per class (32K total), 80/20 train/val split. Early stopping on validation AUC.

Feature Engineering Evolution

Commit	Features	Change
ceecec5	41	Initial: duration, counts, rates, engagement, gate triples
3d90bb3	51	+10 per-cabinet-position first-event features
bb8ffca	61	+per-team and per-PID first maiden use
6be5314	77	+time_to_6_carry, per-maiden and per-maiden-per-team counts
5b0f7b2	65	Replace per-maiden counts with time-to-Nth-bless features
b98ebb7	63	Remove frac_to_first_kill and frac_to_first_carry
166bbdd	66	+snail escape count, rate, and time-to-first
17746f9	69	+get-off-snail count, rate, and time-to-first

Feature categories (69 total): - Basic game info (10): duration, event count, bot count, victory condition, map - Event counts (12): kills, carries, deposits, blesses, snail actions - Event rates (10): per-second rates for the above - Temporal (7): time-to-first for key game actions - Engagement (4): active player count, worker objective engagement - Gate triples (2): fastest 3-gate-touch window - Per-cabinet first event (10): first action timing per position - Per-team/PID maiden use (10): maiden engagement by team and player - Time-to-Nth milestones (4): carry and bless progression markers

Data Pipeline Improvements

Tournament leakage removal (d77ea21): Tournament games appeared in both logged_in and unfiltered datasets. Removed from training to prevent inflated metrics.

Data size sweep (869f36b): Swept 1K–32K training size. 16K per class chosen as default (diminishing returns beyond this).

Separate eval data (7dc349e): Previously, unfiltered pass-rate metrics were computed on data overlapping with training, artificially deflating Unf@95% (7-15% vs true ~21%). Now uses separate eval shards (X02, X03) distinct from training shards (X00, X01).

Temporal stride (7dc349e): Shards are time-ordered. Changed from contiguous (000-019) to strided (X00, X01 for X=0-9) selection for better temporal coverage across the full dataset.

Hyperparameter Sweep Results

Swept 6 LightGBM params one-at-a-time, 16K training per class, 69 features, evaluated on held-out unfiltered data.

Baseline (old defaults): num_leaves=63, min_child_samples=20

Param                     Value     AUC  LogLoss  Unf@99%  Unf@95%
num_leaves                   63  0.9050   0.3453   27.9%   22.3%
min_child_samples            20  0.9050   0.3453   27.9%   22.3%

Key findings (sweep with old defaults)

num_leaves: AUC peaked at 23 leaves (0.9074) but Unf filtering was flat (20-22% across all values). Higher leaf counts slightly worse AUC.

min_child_samples: Clear improvement from 20→75-100 (AUC 0.905→0.908, LogLoss 0.345→0.342). Higher values regularize against noisy proxy labels.

learning_rate: 0.01-0.03 slightly better than 0.05 (LogLoss improvement), with auto-scaled num_boost_round=2000.

feature_fraction: Mild improvement at 0.5-0.7 (decorrelates trees with partially-redundant features).

bagging_fraction=0.9: Small AUC/LogLoss improvement.

reg_lambda=10: Best AUC (0.9088) in initial sweep.

After retuning defaults to num_leaves=127, min_child_samples=75

Param                     Value     AUC  LogLoss  Unf@99%  Unf@95%
num_leaves                  127  0.9078   0.3432   29.3%   22.2%  <- current
num_leaves                   95  0.9089   0.3413   28.3%   21.4%  <- slightly better
min_child_samples            75  0.9078   0.3432   29.3%   22.2%  <- current
min_child_samples           100  0.9075   0.3435   29.4%   21.8%

Resweep showed all params landing in a narrow band (AUC 0.904-0.909, Unf@95% 20.5-22.5%). The model is near its ceiling with this feature set and label noise.

Final Configuration

DEFAULT_PARAMS = {
    'objective': 'binary',
    'metric': 'auc',
    'num_leaves': 127,
    'min_child_samples': 75,
    'learning_rate': 0.05,
    'verbose': -1,
}
# num_boost_round=500 (2000 if learning_rate < 0.05), early_stopping=20

Current Performance

Metric	Value
Validation AUC	~0.908
Validation LogLoss	~0.343
Unfiltered pass @ 99% tournament recall	~28%
Unfiltered pass @ 95% tournament recall	~22%

Interpretation: at a threshold that catches 99% of tournament games, ~28% of unfiltered games also pass (the rest are filtered as low-quality). At a stricter 95% tournament recall threshold, ~22% of unfiltered pass.

Downstream Validation: Win-Probability Training

To test whether the classifier identifies games that are actually more useful for training, we compared win-probability models trained on quality-filtered (QF) vs logged-in (LI) game data at multiple scales (5K-40K games, 10 variance runs each). See data_quality_report.md for full results and plots.

Exclusive comparison (equal-states): QF-exclusive games (high-quality anonymous games the classifier found) consistently outperform LI-exclusive games (logged-in games the classifier rejected) on log loss, AUC-ROC, accuracy, and symmetry deviation. This validates that the classifier captures a real quality signal beyond login status — it can find good training data among anonymous games that the login heuristic misses, and correctly rejects logged-in games that happen to be low quality.

Caveat: The exclusive comparison doesn't directly measure the marginal value of quality filtering on top of login filtering (i.e., would adding QF-exclusive games to the logged-in pool improve the model?). It establishes that the classifier's quality signal is genuine, but the additive benefit remains untested.

Non-exclusive comparison: QF outperforms LI across all scales tested (5K-20K). Since both datasets are similar in size (~182-183K games), this comparison is less confounded and provides additional evidence that quality-filtered data produces better models.

Downstream Validation: Win-Probability Training

To test whether quality filtering actually produces better training data, we compared win-probability models trained on quality_filtered (QF) vs logged_in_games (LI) datasets. Both datasets are ~182-183K games with ~108K overlap; each has ~75K exclusive games.

Exclusive experiment (equal-states): Trained only on games unique to each dataset, subsampling to equalize state counts. QF-exclusive outperforms LI-exclusive on log loss, AUC-ROC, accuracy, and symmetry deviation at every scale tested (5K-40K games, 10 variance runs each).

Interpretation: QF-exclusive games are anonymous games the classifier identified as high-quality from event features alone. LI-exclusive games are logged-in games the classifier rejected. The experiment shows the classifier finds good training data that the login heuristic misses — its quality signal is genuine and not just a proxy for login status. Prior results established that logged-in games outperform random unfiltered games, so login is already a quality proxy; the classifier captures something beyond that.

Limitation: The exclusive comparison doesn't measure the marginal value of quality filtering on top of login filtering (e.g., would adding QF-exclusive games to the logged-in pool help?). That remains a future experiment.

Full results: model_experiments/data_quality_report.md

Key Methodological Lesson

The original eval setup (scoring unfiltered data that overlapped with training) produced artificially optimistic filtering numbers. With num_leaves=255, the old setup showed Unf@95%=7.3% — but with held-out eval data the true number was ~22%. This was the single biggest correction in the experiment series: always evaluate on data the model has never seen.