Sports Prediction Model

A machine learning pipeline that predicts AFL match outcomes using historical data and engineered form features.

Built an end-to-end pipeline to predict AFL win/loss outcomes. Trained on 24 seasons (2000–2023) and evaluated on 2024, with a focus on leakage-safe feature engineering and time-aware validation.

Platform Jupyter Notebooks

Language Python

Libraries pandas, scikit-learn

View on GitHub

Key Features

Full ML pipeline: ingestion, cleaning, feature engineering, training, and evaluation
Wide-to-long transformation (two rows per match: one per team)
Rolling form features (points for/against, win %) built with strict shift(1) to prevent leakage
Hyperparameter tuning via GridSearchCV with TimeSeriesSplit for temporally correct validation
Robustness tested across 2022, 2023, and 2024 (accuracy stable within ~0.5%)

What I Learned

End-to-end ML structure: Separate raw data, processed data, and models — use notebooks for experimentation, save stable outputs
Time-based evaluation: Train on past, test on future — use TimeSeriesSplit so tuning doesn't learn from future seasons
Beyond accuracy: Precision matters more than raw accuracy — "when I predict a win, how often am I right?"
Iterating systematically: Baseline → features → tuning → robustness check. Each step measured, not guessed

Challenges & Solutions

Data leakage: Shifted all rolling windows by shift(1) to ensure only past data informs each prediction
Wide-to-long transformation: Restructured one-row-per-match into one-row-per-team to make per-team features possible
Extra-time games: Two matches had missing data due to extra time — manually patched rather than dropping valid data
Off-season rest gaps: Days since last game ballooned over summer — capped at 30 days to reflect in-season rest only
Category code consistency: Team/venue codes risk misalignment with new data — documented for future API integration

Design Decisions

Random Forest: captures non-linear interactions between form, venue, and rest without heavy preprocessing
Binary target (win/loss): draws folded into loss to simplify the task and mirror tipping use cases
Predictor set: pf_roll5, pa_roll5, winpct_roll3 performed best on held-out 2024
Modular notebooks: exploration, features, training, and iteration kept separate for reproducibility

Results

Baseline: ~52% accuracy (home/venue/opponent only)
Enhanced model (v2): ~59.5% accuracy, ~58.5% precision on 2024
Context: pro AFL tipsters average ~65% — competitive for a historical-only statistical approach

Future Roadmap

Integrate Champion Data API for live fixtures and season predictions
Add ladder position differential and travel distance as features
Refactor notebook logic into src/ modules (data_loader.py, features.py, model.py)