Time-Series Forecasting in Production: ARIMA, Prophet, and LSTM

Forecasting demos look magical. Forecasting in production is a grind of messy data, drifting distributions, and stakeholders who want a single number with a confidence they can trust. Having built forecasting systems for returnable-packaging demand and daily planning, here's what actually mattered.

The three workhorses

Most business forecasting problems can be served well by one of three approaches. The trick is matching the model to the signal, not chasing the fanciest architecture.

| Model | Best for | Watch out for | |---|---|---| | ARIMA | Stationary series with clear autocorrelation | Manual (p,d,q) tuning; struggles with multiple seasonalities | | Prophet | Strong seasonality + holidays, fast iteration | Can over-smooth sharp regime changes | | LSTM | Long, non-linear dependencies and many related series | Data-hungry; easy to overfit; heavier to serve |

In practice, a well-tuned Prophet or ARIMA baseline is shockingly hard to beat, and it's the honest yardstick every deep-learning model should be measured against.

Start with a baseline you can defend

Before any model, I compute a naive baseline — last value, or seasonal naive (value from the same period last cycle):

import numpy as np

def seasonal_naive(series, season_length):
    return series.shift(season_length)

# If your LSTM can't beat this, the LSTM is not the answer.
mape_baseline = np.mean(np.abs((y_true - y_naive) / y_true)) * 100

If a model can't beat seasonal naive, it isn't earning its complexity. This single check has saved me from shipping more than one over-engineered pipeline.

Where the accuracy actually came from

Across projects, improvements in forecast accuracy came far more from data and framing than from swapping models:

Aligning the forecast horizon to the decision. Forecasting daily when the decision is weekly just adds noise.
Cleaning the target. Promotions, stockouts, and one-off events distort history — flag them or model them explicitly.
Adding the right regressors. Calendar effects, holidays, and known future events (Prophet handles these cleanly).
Reconciling hierarchies. SKU-level forecasts should sum to the category forecast; reconciliation prevents contradictory numbers.

On the returnable-packaging work, this framing discipline is what took forecast accuracy up ~12% — the model family barely changed.

Serving and monitoring

A forecast that isn't monitored silently rots. The essentials:

Backtest with rolling origins, never a single train/test split — you want error distribution across time, not one lucky number.
Track error by segment. Aggregate MAPE hides the SKUs that are quietly on fire.
Watch for drift. When recent residuals trend, retrain triggers should fire automatically.
Ship intervals, not just points. Planners act on the risk, so prediction intervals are often more valuable than the mean.

Lessons learned

The model is ~20% of the work; data quality, horizon design, and monitoring are the other 80%.
A strong classical baseline keeps everyone honest — including you.
Deep learning wins when you have many related series and non-linear structure, not by default.
Communicate uncertainty. A number without an interval is a false promise.

Good forecasting is less about the algorithm and more about respecting the messiness of the real world the numbers come from.