Training the LightGBM Win Predictor Model

This document describes the complete pipeline for training the LightGBM win predictor model from Killer Queen game event data.

Overview

The KQuity project builds a binary classification model that predicts match outcomes (Blue wins vs Gold wins) based on real-time game state extracted from Killer Queen arcade cabinet event logs.

Model: LightGBM Gradient Boosting Decision Trees Task: Binary Classification (Blue win = 1, Gold win = 0) Current Best Accuracy: ~70.4%

Prerequisites

Software Dependencies

pip install lightgbm numpy scikit-learn pandas

Data Requirements

The training pipeline expects validated game event CSV files in:

/home/rrenaud/KQuity/validated_all_gameevent_partitioned/

These files follow the naming pattern gameevents_000.csv through gameevents_091.csv, with each partition containing ~1000 games worth of events.

CSV Schema (Game Events):

id, timestamp, event_type, values, game_id

Step 1: Data Validation (Optional - Already Done)

If working with raw game data, validate it first using preprocess.py:

from preprocess import validate_game_data

validate_game_data(
    input_file='raw_gameevents.csv',
    output_file='validated_gameevents.csv'
)

Validation checks: - Events occur after September 2022 - Games replay correctly through the state engine - Victory conditions match actual game states (economic, military, or snail wins)

Step 2: Configure the Experiment

In preprocess.py, set the experiment name (around line 23):

expt_name = 'your_experiment_name'

This creates the output directory: model_experiments/your_experiment_name/

Key Configuration Options

Worker Ordering (line ~610 in vectorize_team()):

# Sort workers by power (strongest first) - RECOMMENDED
workers = sorted(workers, key=lambda w: -w.power())

State Sampling (line ~672 in materialize_game_state_matrix()):

# Drop probability for training data balance
# Higher values = fewer samples but faster training
drop_prob = 0.9  # Drop 90% of states

Minimum Game Time (line ~670):

# Only sample states after game has been running
if game_time < 5:  # Skip first 5 seconds
    continue

Step 3: Materialize Feature Matrices

Convert raw event CSVs to numpy arrays for efficient training:

from preprocess import materialize_game_state_matrix

# Process training files (e.g., files 0-79)
for i in range(80):
    filename = f'validated_all_gameevent_partitioned/gameevents_{i:03d}.csv'
    materialize_game_state_matrix(filename, drop_prob=0.9)

# Process test files (e.g., files 80-91) with no dropping
for i in range(80, 92):
    filename = f'validated_all_gameevent_partitioned/gameevents_{i:03d}.csv'
    materialize_game_state_matrix(filename, drop_prob=0.0)

This creates two files per input CSV: - gameevents_XXX.csv_states.npy - Feature matrix (N samples × 49 features) - gameevents_XXX.csv_labels.npy - Label vector (N samples, binary)

Step 4: Feature Vector Structure

Each game state is encoded as a ~49-dimensional feature vector:

Feature Range	Count	Description
0-19	20	Blue team state
20-39	20	Gold team state
40-44	5	Maiden control states
45-48	4	Map one-hot encoding
49	1	Normalized berries available
50-51	2	Snail position and velocity

Team State Features (20 per team)

[0] eggs (queen health, 2 at start, -1 = dead)
[1] food_deposited (count toward economic win)
[2] vanilla_warriors (workers with wings only)
[3] speed_warriors (workers with wings + speed)
[4-7] worker_1: has_bot, has_food, has_speed, has_wings
[8-11] worker_2: has_bot, has_food, has_speed, has_wings
[12-15] worker_3: has_bot, has_food, has_speed, has_wings
[16-19] worker_4: has_bot, has_food, has_speed, has_wings

Maiden Features

Encoded as: 0 (neutral), 1 (Blue control), -1 (Gold control)

Snail Features

Position: Normalized to [-0.5, 0.5] from center
Velocity: Normalized by max snail speed
Multiplied by symmetry factor to handle gold_on_left orientation

Step 5: Train the Model

Use the Train_LightGBM.ipynb notebook or run directly:

import lightgbm as lgb
import numpy as np
from sklearn.metrics import log_loss, accuracy_score, classification_report

# Load training data
def load_vectors(base_path, file_range):
    states_list, labels_list = [], []
    for i in file_range:
        states = np.load(f'{base_path}/gameevents_{i:03d}.csv_states.npy')
        labels = np.load(f'{base_path}/gameevents_{i:03d}.csv_labels.npy')
        states_list.append(states)
        labels_list.append(labels)
    return np.vstack(states_list), np.concatenate(labels_list)

expt_dir = 'model_experiments/your_experiment_name'
train_X, train_y = load_vectors(expt_dir, range(0, 80))
test_X, test_y = load_vectors(expt_dir, range(80, 92))

# Configure LightGBM
params = {
    'num_leaves': 100,
    'objective': 'binary',
    'metric': 'binary_logloss',
    'boosting': 'gbdt',
    'verbose': -1
}

# Train
train_data = lgb.Dataset(train_X, train_y)
model = lgb.train(params, train_data, num_boost_round=100)

# Save model
model.save_model(f'{expt_dir}/model.mdl')

Step 6: Evaluate the Model

# Predictions
predictions = model.predict(test_X)

# Metrics
print(f"Log Loss: {log_loss(test_y, predictions):.4f}")
print(f"Accuracy: {accuracy_score(test_y, predictions > 0.5):.4f}")
print(classification_report(test_y, predictions > 0.5,
                           target_names=['Gold Wins', 'Blue Wins']))

Monotonicity Validation (Egg Inversion Test)

A well-trained model should predict higher Blue win probability when Blue has more eggs. The "egg inversion" metric measures violations:

def compute_egg_inversions(model, test_X):
    """Measure % of cases where adding Blue eggs DECREASES predicted win prob."""
    original_preds = model.predict(test_X)

    # Boost Blue eggs (index 0) by 2
    modified_X = test_X.copy()
    modified_X[:, 0] += 2
    modified_preds = model.predict(modified_X)

    inversions = (modified_preds < original_preds).mean()
    return inversions

print(f"Egg Inversions: {compute_egg_inversions(model, test_X):.4f}")
# Goal: < 0.02 (less than 2% inversions)

Hyperparameter Tuning

Experiments that have been tried:

Experiment	num_leaves	num_trees	Drop Rate	Accuracy	Log Loss
baseline	100	100	0%	~68%	~0.59
drop_90	100	100	90%	~70%	~0.56
more_leaves	200	100	90%	~70%	~0.56
power_sorted	100	100	90%	70.4%	0.556

Recommended Settings: - num_leaves: 100 - num_boost_round: 100 - Training drop rate: 90% - Test drop rate: 0% - Worker sorting: by power (strongest first)

Full Training Pipeline Summary

# 1. Ensure validated data exists
ls validated_all_gameevent_partitioned/gameevents_*.csv

# 2. Edit preprocess.py to set experiment name and parameters

# 3. Run materialization (generates .npy files)
python -c "
from preprocess import materialize_game_state_matrix
for i in range(92):
    drop = 0.9 if i < 80 else 0.0
    materialize_game_state_matrix(
        f'validated_all_gameevent_partitioned/gameevents_{i:03d}.csv',
        drop_prob=drop
    )
"

# 4. Train in notebook or script
jupyter notebook Train_LightGBM.ipynb

Using the Trained Model

import lightgbm as lgb
from preprocess import GameState, vectorize_game_state

# Load model
model = lgb.Booster(model_file='model_experiments/your_experiment/model.mdl')

# Create game state from events (or real-time)
game_state = GameState(map_info)
for event in events:
    event.modify_game_state(game_state)

# Predict
feature_vector = vectorize_game_state(game_state)
win_probability = model.predict([feature_vector])[0]
print(f"Blue win probability: {win_probability:.2%}")

Troubleshooting

Common Issues

Low accuracy: Ensure workers are sorted by power, use 90% drop rate on training data
High egg inversions: Model may be overfitting to noise; reduce num_leaves
Memory errors: Process fewer files at once, or increase drop rate
Validation failures: Check that game events are properly formatted and ordered

Data Quality Checks

from preprocess import is_valid_game, iterate_events_by_game_and_normalize_time

# Count valid vs invalid games
valid, invalid = 0, 0
for game_id, events in iterate_events_by_game_and_normalize_time('gameevents.csv'):
    if is_valid_game(events):
        valid += 1
    else:
        invalid += 1
print(f"Valid: {valid}, Invalid: {invalid}")

File Reference

File	Purpose
`preprocess.py`	Event parsing, game state tracking, feature vectorization
`constants.py`	Game enums (teams, victory conditions, maps)
`map_structure.py`	Map metadata (berry/maiden positions)
`map_structure_info.json`	Hardcoded map coordinates
`Train_LightGBM.ipynb`	Interactive training notebook
`validated_all_gameevent_partitioned/`	Input data directory
`model_experiments/`	Output models and feature matrices