Skip to content

fast_materialize.py Optimization Experiment Log

Benchmark: 3000-game test suite, 233K output rows x 52 features. Machine: Linux 6.8.0-90-generic, Python 3.11.14, numpy (system). All timings: 5-run mean (with warmup), time.perf_counter().

Baseline (before any changes)

Metric Value
Mean time 2.07s
Min time 1.97s
Output dtype float64
Label dtype int64
States memory 94,875 KB
Labels memory 1,825 KB
Total output memory 96,700 KB

Experiment 1: All 7 changes at once (REVERTED partially)

Applied all 7 planned optimizations simultaneously: 1. Timestamps as floats (epoch seconds via .timestamp()) 2. Direct numpy indexed writes (buf[idx, col+N] = val) 3. float32 output buffer 4. int8 label buffer 5. Skip RNG when drop_prob=0 6. Pre-split values_str once per event 7. np.empty instead of np.zeros

Metric Value
Mean time 2.63s (+27% SLOWER)
Min time 2.43s
Output dtype float32
Label dtype int8
States memory 47,437 KB
Labels memory 228 KB
Total output memory 47,665 KB (-51%)

Result: Memory halved but wall-clock regressed badly.

Root cause analysis

  • Change 2 (direct indexed writes) hurt performance. Each buf[idx, col] = val is a separate Python->C boundary crossing with numpy indexing overhead. The old approach builds a 52-element Python list (cheap in CPython) and does a single bulk buf[idx] = list assignment, which numpy converts in one C loop. ~52 individual indexed writes per row x 233K rows = ~12M extra numpy index ops.

  • Change 1 (timestamp floats) hurt performance. fromisoformat() is fast C code, but chaining .timestamp() adds timezone/epoch conversion overhead per parse. The old (dt - gamestart_dt).total_seconds() uses fast C-level datetime subtraction. Net effect: slower parsing, no gain on the subtraction side.

Experiment 2: Keep only beneficial changes (FINAL)

Reverted changes 1 and 2, kept changes 3-7: - ~~1. Timestamps as floats~~ (reverted — .timestamp() adds overhead) - ~~2. Direct numpy indexed writes~~ (reverted — bulk list assignment is faster) - 3. float32 output buffer - 4. int8 label buffer - 5. Skip RNG when drop_prob=0 - 6. Pre-split values_str once per event - 7. np.empty instead of np.zeros

Metric Value
Mean time 1.63s (-21% faster)
Min time 1.62s
Output dtype float32
Label dtype int8
States memory 47,437 KB
Labels memory 228 KB
Total output memory 47,665 KB (-51%)

Result: 21% faster AND memory halved. Best of both worlds.

Breakdown of kept changes

Change Effect
float32 output Halves buffer memory; numpy bulk-writes 52 floats into smaller buffer slightly faster
int8 labels 8x smaller label buffer; negligible time impact
Skip RNG when drop_prob=0 Saves ~233K rng.random() calls in benchmark/test mode
Pre-split values_str Eliminates redundant [1:-1].split(',') per event branch
np.empty vs np.zeros Avoids zeroing memory that will be overwritten; minor

Key takeaway

In CPython, bulk assignment (buf[idx] = python_list) beats element-wise indexed writes (buf[idx, N] = val) for numpy arrays. The per-element approach has too much Python->C overhead. Building a temporary Python list is essentially free compared to the numpy indexing machinery invoked 52 times per row.