Sports Prediction Model
A machine learning pipeline that predicts AFL match outcomes using historical data and engineered form features.
Built an end-to-end pipeline to predict AFL win/loss outcomes. Trained on 24 seasons (2000–2023) and evaluated on 2024, with a focus on leakage-safe feature engineering and time-aware validation.
View on GitHubKey Features
- Full ML pipeline: ingestion, cleaning, feature engineering, training, and evaluation
- Wide-to-long transformation (two rows per match: one per team)
- Rolling form features (points for/against, win %) built with strict shift(1) to prevent leakage
- Hyperparameter tuning via GridSearchCV with TimeSeriesSplit for temporally correct validation
- Robustness tested across 2022, 2023, and 2024 (accuracy stable within ~0.5%)
What I Learned
- End-to-end ML structure: Separate raw data, processed data, and models — use notebooks for experimentation, save stable outputs
- Time-based evaluation: Train on past, test on future — use TimeSeriesSplit so tuning doesn't learn from future seasons
- Beyond accuracy: Precision matters more than raw accuracy — "when I predict a win, how often am I right?"
- Iterating systematically: Baseline → features → tuning → robustness check. Each step measured, not guessed
Challenges & Solutions
- Data leakage: Shifted all rolling windows by shift(1) to ensure only past data informs each prediction
- Wide-to-long transformation: Restructured one-row-per-match into one-row-per-team to make per-team features possible
- Extra-time games: Two matches had missing data due to extra time — manually patched rather than dropping valid data
- Off-season rest gaps: Days since last game ballooned over summer — capped at 30 days to reflect in-season rest only
- Category code consistency: Team/venue codes risk misalignment with new data — documented for future API integration
Design Decisions
- Random Forest: captures non-linear interactions between form, venue, and rest without heavy preprocessing
- Binary target (win/loss): draws folded into loss to simplify the task and mirror tipping use cases
- Predictor set: pf_roll5, pa_roll5, winpct_roll3 performed best on held-out 2024
- Modular notebooks: exploration, features, training, and iteration kept separate for reproducibility
Results
- Baseline: ~52% accuracy (home/venue/opponent only)
- Enhanced model (v2): ~59.5% accuracy, ~58.5% precision on 2024
- Context: pro AFL tipsters average ~65% — competitive for a historical-only statistical approach
Future Roadmap
- Integrate Champion Data API for live fixtures and season predictions
- Add ladder position differential and travel distance as features
- Refactor notebook logic into
src/modules (data_loader.py, features.py, model.py)