2026.5.31

Bangalore Traffic Demand Prediction

Project Whitepaper — Plain-Language Edition

/ ARTICLE
Version1.0
Best public score91.38 (submission_ensemble.csv)
MetricR² — reported as score = 100 × R²
AudienceTeammates, reviewers, and anyone new to ML

Table of contents

  1. Executive summary
  2. The problem in everyday terms
  3. The data
  4. How success is measured
  5. Our mental model
  6. Feature engineering
  7. Models we used
  8. Validation strategy
  9. Experiment journey
  10. Final solution architecture
  11. Project file map
  12. Glossary
  13. Limitations and outlook
  14. Elevator pitch

1. Executive summary

This project predicts how busy each road cell will be at a given time — similar to forecasting ride or delivery demand across a city grid.

We are given two days of historical traffic data and must predict demand for the next day at times we have never seen labels for. The best honest solution combines:

  1. Smart data preparation — location, time, and “what happened recently”
  2. Two complementary machine-learning models blended together
  3. Simple rules where patterns are obvious (e.g. daytime often mirrors the previous day)

We reached 91.38 out of 100 on the public leaderboard. Scores of exactly 100 from many teams are likely from overfitting the public test set; the final evaluation is expected to use hidden data and a submitted Python script.


2. The problem in everyday terms

Imagine Flipkart needs to know:

“At 10:30 AM, in this small map square, how many deliveries or rides will we need?”

Each row in the dataset answers:

At this location (geohash), on this day, at this time — how high was demand?

  • Demand is a number between 0 and 1 (normalized “busyness”, not a raw count).
  • Location is a geohash — a short code for a map cell.
  • Time is in 15-minute steps (0:0, 0:15, 0:30, …).

The twist that makes this hard

What we have in trainingWhat we must predict in test
Day 48: full 24 hours (~69k rows)
Day 49: only 00:00–02:00 (~7.8k rows)Day 49: 02:15–13:45 (~42k rows)

We must predict daytime on day 49 using mostly day 48 plus a tiny slice of day 49 night. We never see day-49 labels for 10 AM, 11 AM, etc. during training.


3. The data

3.1 Columns explained

ColumnPlain meaning
geohashWhich map cell (location)
day48 or 49
timestampTime of day (15-min slots)
demandTarget — how busy it was (0–1). Missing in test.
RoadTypeHighway, Street, Residential, etc.
NumberofLanesRoad width
LargeVehiclesAllowed or not
LandmarksLandmark nearby (Yes/No)
TemperatureRecorded temperature
WeatherSunny, Rainy, Foggy, Snowy

3.2 Scale

SplitRowsUnique locations
Train77,299~1,249
Test41,778~1,190

3.3 Data flow (high level)


4. How success is measured

Metric: R² (R-squared)

Think of it as: “How much of the true pattern did we capture?”

ScoreMeaning
100Perfect predictions
91.38Our best honest result — very strong
60Poor — we got this when we used a wrong rule (scaling everything up)

R² punishes systematic mistakes heavily. If you predict everything 60% too high, the score collapses — even if the general shape looks right.


5. Our mental model

We treated this as three different sub-problems, not one uniform rule:

Why this split matters

Changing only hour 2 moved the score from 90.96 → 79.55 (−11 points). Hour 2 is small in row count but huge in score impact because night demand is high and spiky.


6. Feature engineering

Feature engineering = creating new columns that help the model. We did not feed raw timestamps alone.

6.1 Feature categories

6.2 Lag features (memory of the past)

FeaturePlain meaning
d48_same_slotDemand at this cell at this time on day 48
lag_1, lag_2, lag_3Demand 15, 30, 45 minutes earlier same day at same cell
roll_mean_3Average of last 3 slots
roll_std_3Variability of last 3 slots
d49_last_knownLast known day-49 demand for that cell (from night training data)

6.3 Key discovery — 15-minute persistence

Demand at consecutive 15-minute slots is ~95% correlated. Test starts at 2:15; training labels for day 49 end at 2:00. So “use the 2:00 value” is a powerful rule for hour 2.

6.4 Cyclical time encoding

Raw hour treats 23:45 and 0:00 as far apart. Sin/cos encoding tells the model they are neighbors:

hour_sin = sin(2π × hour / 24) hour_cos = cos(2π × hour / 24)

Same idea for 15-minute slots across the day.

6.5 What did NOT help

IdeaResult
Scale all predictions × 1.64 (night ratio)Score 60.94 — destroyed accuracy
Temperature as main driverCorrelation ≈ 0.003 — essentially noise
9-model mega-ensembleMarginal gain over simpler 2-model blend

7. Models we used

A model learns patterns from past data and predicts future values. We used gradient boosted decision trees — many small decision trees combined, ideal for spreadsheet-like data.

7.1 Model comparison (simple view)

Quadrant positions are illustrative scores relative to our experiments, not exact measurements.

7.2 CatBoost

WhatTree model that handles categories (geohash, road type) natively
WhyLocation and road type are categorical; CatBoost excels here
Outputsubmission_catboost.csv

7.3 XGBoost + target encoding

WhatTree model with numeric features including encoded location averages
WhyMakes different errors than CatBoost → good for blending
Target encodingReplace geohash with “average demand for this area” (fit without leakage)
Outputsubmission_full.csv

7.4 Ensemble (our best public score)

final_prediction = 0.6 × XGBoost + 0.4 × CatBoost

Score: 91.38 — when one model is slightly wrong, the other often compensates.

7.5 Why not neural networks / RNNs?

ReasonExplanation
Data shapeTabular rows, not long sequences like video or text
Lags workHand-built 15-min memory captured most time signal
Data size~77k rows — trees + good features beat deep learning complexity
InterpretabilityEasier to debug rules (hour 2 vs daytime)

8. Validation strategy

Bad approach: Randomly shuffle all rows → 80% train / 20% test.

Problem: Model sees future and past mixed together. Local scores look too good and mislead you.

Good approach (what we built):

Validation typeUsed inTrust level
Random KFoldEarly train_final.py⚠️ Misleading for time data
Temporal night CVvalidate.py, train_temporal.py✅ Matches real test structure
Public leaderboardSubmissions⚠️ Some teams may overfit public answers

On honest temporal validation, our improved pipeline reached ~87 R² on held-out night data — a realistic signal.


9. Experiment journey

9.1 Score timeline

9.2 Full submission log

FileStrategyScoreStatus
submission_ensemble.csv60% XGB + 40% CatBoost91.38✅ Best
submission_hybrid.csvDay-48 copy + hour-2 ML blend90.96
probe_C_global_ratio.csvScale all by night ratio60.94
probe_F_model_hard.csvNight-calibrated model on hard rows80.09
probe_I_pure_copy.csvPure day-48 copy everywhere79.55

9.3 Confirmed truths

  1. Daytime (hours 3–13) ≈ same demand level as day 48 at same location and time (~89% exact match).
  2. Hour 2 must use day-49 night data — not plain day-48 copy.
  3. Demand is smooth over 15 minutes — recent history beats yesterday’s shape at night.
  4. Public 100s likely overfit; final ranking may use hidden test data.

10. Final solution architecture

Recommended scripts to run

bash
# Main pipeline (temporal CV + lags + CatBoost + XGB + ensemble)
python scripts/train/train_temporal.py

# CV only — check honest score without submitting
python scripts/train/train_temporal.py --cv-only

# Re-blend with proven 91.38 weights
python scripts/blend/blend_ensemble.py --w-xgb 0.6 --w-cat 0.4

11. Project file map

FilePurpose
data/train.csv, test.csvRaw data
src/features.pyOriginal feature pipeline
src/temporal_features.pyLags, rolling stats, cyclical time
validate.pyHonest time-based validation
scripts/train/train_temporal.pyRecommended main training pipeline
train_hybrid.pyFormula + ML hybrid (90.96)
blend_ensemble.pyBlend two submission CSVs
EXPERIMENT_LOG.mdDetailed score comparison
notebooks/traffic_demand_analysis.ipynbOptional EDA (not required for scoring)

12. Glossary

TermSimple definition
Machine learning (ML)Computer learns patterns from examples instead of hand-written rules
ModelThe learned program that makes predictions
FeatureOne input signal the model uses (e.g. “hour”, “lag_1”)
TrainingShowing the model past data so it learns
Test / submissionPredictions for unseen rows, uploaded for scoring
Accuracy metric; 100 = perfect
GeohashShort code for a map grid cell
LagValue from an earlier time step
Target encodingReplace a category with its average demand (carefully, to avoid cheating)
EnsembleCombining multiple models’ predictions
OverfittingMemorizing public answers instead of learning real rules
Cross-validation (CV)Testing on held-out data during development
CatBoost / XGBoostPopular tree-based ML libraries for tabular data

13. Limitations and outlook

  1. We cannot fully validate daytime day-49 locally — no labels for those hours in training.
  2. ~9 points below public “100” may be unbridgeable without overfitting public test.csv.
  3. Final ranking may favor models that generalize on hidden data over public leaderboard 100s.
  4. Temperature and weather appear to be noise in this synthetic dataset.

14. Elevator pitch

We predict traffic demand on a city grid by combining where (geohash), when (time and recent 15-minute history), and what happened yesterday at the same place and time. Daytime is mostly stable and matches day 48; the tricky 2 AM hour needs fresh day-49 night data. Two tree-based models (CatBoost and XGBoost) learn these patterns and are blended for a 91.38 score. We validate honestly by simulating the real test — train on the past, predict unseen future times — so the solution is built to survive a hidden final test, not just chase a public leaderboard.

RISHITA · RESEARCH & ENGINEERINGPOSTS: 2PROJECTS: 3