Sports Prediction Model

A machine learning pipeline that predicts AFL match outcomes using historical data and engineered form features.

Built an end-to-end pipeline to predict AFL win/loss outcomes. Trained on 24 seasons (2000–2023) and evaluated on 2024, with a focus on leakage-safe feature engineering and time-aware validation.

Platform Jupyter Notebooks
Language Python
Libraries pandas, scikit-learn
View on GitHub

Key Features

  • Full ML pipeline: ingestion, cleaning, feature engineering, training, and evaluation
  • Wide-to-long transformation (two rows per match: one per team)
  • Rolling form features (points for/against, win %) built with strict shift(1) to prevent leakage
  • Hyperparameter tuning via GridSearchCV with TimeSeriesSplit for temporally correct validation
  • Robustness tested across 2022, 2023, and 2024 (accuracy stable within ~0.5%)

What I Learned

  • End-to-end ML structure: Separate raw data, processed data, and models — use notebooks for experimentation, save stable outputs
  • Time-based evaluation: Train on past, test on future — use TimeSeriesSplit so tuning doesn't learn from future seasons
  • Beyond accuracy: Precision matters more than raw accuracy — "when I predict a win, how often am I right?"
  • Iterating systematically: Baseline → features → tuning → robustness check. Each step measured, not guessed

Challenges & Solutions

  • Data leakage: Shifted all rolling windows by shift(1) to ensure only past data informs each prediction
  • Wide-to-long transformation: Restructured one-row-per-match into one-row-per-team to make per-team features possible
  • Extra-time games: Two matches had missing data due to extra time — manually patched rather than dropping valid data
  • Off-season rest gaps: Days since last game ballooned over summer — capped at 30 days to reflect in-season rest only
  • Category code consistency: Team/venue codes risk misalignment with new data — documented for future API integration

Design Decisions

  • Random Forest: captures non-linear interactions between form, venue, and rest without heavy preprocessing
  • Binary target (win/loss): draws folded into loss to simplify the task and mirror tipping use cases
  • Predictor set: pf_roll5, pa_roll5, winpct_roll3 performed best on held-out 2024
  • Modular notebooks: exploration, features, training, and iteration kept separate for reproducibility

Results

  • Baseline: ~52% accuracy (home/venue/opponent only)
  • Enhanced model (v2): ~59.5% accuracy, ~58.5% precision on 2024
  • Context: pro AFL tipsters average ~65% — competitive for a historical-only statistical approach

Future Roadmap

  • Integrate Champion Data API for live fixtures and season predictions
  • Add ladder position differential and travel distance as features
  • Refactor notebook logic into src/ modules (data_loader.py, features.py, model.py)