Exponential smoothing was proposed in the 1950s, and has motivatated some of the most successful forecasting models.
Forecasts are weighted averages, with weights decaying exponentially as the observations get older.
Simple Exponential Smoothing
The simplest method is simple exponential smoothing (SES). It’s suitable for forecasting data with no clear trend or seasonal pattern.
# Algerian economy
global_economy %>%
filter(Country == 'Algeria') %>%
autoplot(Exports) +
labs(
x = 'Year',
y = 'Exports (% of GDP)',
title = 'Algerian Economy - Exports'
)
Using the naive method, all forecasts for the future are equal to the last observed value.
\[ \hat{y}_{T + h|T} = y_T \]
for \(h = 1,2,\ldots\).
So the last observation is the most important. It can be thought of as a weighted average, with all the weight on the last observation.
With the average, all future forecasts are equal to the average of all ovbserved data:
\[ \hat{y}_{T + h|T} = \frac{1}{T} \sum_{t=1}^T y_t \]
So all observations are important, so this is a weighted average with the same weight for all observations.
We often want something between these two extremes. We can attach more weight to more recent observations than to those in the past.
This is what is used in exponential smoothing. Weights are attached to observations, with the weights decreasing exponentially as observations come from further in the past:
\[ \hat{y}_{T+1|T} = \alpha y_T + \alpha(1 - \alpha)y_{T-1} + \alpha(1 - \alpha)^2 y_{T-2} + \ldots \]
where \(0 \le \alpha \le 1\) is the smoothing parameter. The one-step forecast is a weighted average of all the observations in the series, with the decrease in weight controlled by the parameter \(\alpha\).
The sum of the weights should be approximately 1 for a reasonable sample size.
If \(\alpha\) is close to 0, more weight is given to observations from the more distant path. If \(\alpha\) is close to 1, more weight is given to more recent observations.
For the extreme case where \(\alpha = 1\), \(\hat{y}+{T+1|T} = yT\) and the forecasts are naive forecasts.
Weighted Average Form
The forecast at time \(T + 1\) is equal to the weighted average between the most recent observation \(y_T\) and the previous forecast \(\hat{y}_{T|T-1}\):
\[ \hat{y}_{T+1|T} = \alpha y_yT + (1 - \alpha)\hat{y}_{T|T-1} \]
The process starts somewhere, so we let the first fitted vlue at time 1 be denoted by \(\ell_0\) (which is estimated). Then:
\[ \hat{y}_{2|1} = \alpha y_1 + (1 - \alpha)\ell_0 \\ \hat{y}_{3|2} = \alpha y_1 + (1 - \alpha)\hat{y}_{2|1} \\ \hat{y}_{4|3} = \alpha y_1 + (1 - \alpha)\hat{y}_{3|2} \\ \vdots \\ \hat{y}_{T+1|T} = \alpha y_T + (1 - \alpha)\hat{y}_{T|T-1} \] Substituting each equation into the previous, then simplifying, we get:
\[ \hat{y}_{T+1|T} = \sum_{j=0}^{T-1} \alpha(1 - \alpha)^j y_{T-j} + (1 - alpha)^T \ell_0 \]
The last term becomes tiny for large \(T\).
Component Form
An alternative representation is the component form. For simple exponential smoothing the only component included is the level, \(\ell_t\) (other methods include trend and seasonal components).
Component forms comprise a forecast equation and a smoothing equation for each of the components included.
Simple exponential smoothing is given by:
\[ \text{Forecast equation: } \hat{y}_{t+h|t} = \ell_t \\ \text{Smoothing equation: } \ell_t = \alpha y_t + (1 - \alpha)\ell_{t-1} \] where \(\ell_t\) is the level (or the smoothed value) of the series at time \(t\). Setting \(h=1\) gives the fitted values, while setting \(t = T\) gives the true forecasts beyond the training data.
The forecast equation shows the forecast value at time \(t+1\) is the estimated level at time \(t\).
The level equation gives the estimated level of the series at each period \(t\).
The component form of simple exponential smoothing is not very useful, but it will be an easy form to use when other components are used.
Flat Forecasts
Simple exponential smoothing has a “flat” forecast function:
\[ \hat{y}_{T+h|T} = \hat{y}_{T+1|T} = \ell_T, \qquad h = 2,3,\ldots \]
That is, all forecasts take the same value. These forecasts will only be suitable if the time series has no trend or seasonal component.
Optimisation
The smoothing parameters need to be chosen: \(\alpha\) and \(\ell_0\). In some cases this is subjective based on subject matter expertise.
The unknown parameters can be estimated using sum of squared residuals (or sum of squared errors, SSE).
\[ SSE = \sum_{t=1}^T(y_t - \hat{y}_{t|t-1})^2 = \sum_{t=1}^T e^2 \] Unlike a regression, this involves a non-linear minimisation problem and an optimisation tool is required to solve.
Example
Note that:
'A'
is additive.'M'
is multiplicative'N'
is none
algeria_economy <-
global_economy %>%
filter(Country == 'Algeria')
fit <-
algeria_economy %>%
model(
ets = ETS(Exports ~ error('A') + trend('N') + season('N'), opt_crit = 'mse')
)
fc <-
fit %>%
forecast(h = 5)
fc %>%
autoplot(filter(global_economy, Country == 'Algeria')) +
geom_line(aes(y = .fitted, colour = "Fitted"), data = augment(fit))
The forecasts are plotted, as welll as the one-step-ahead fitted values. The \(\alpha\) that was calculated is large (0.84) which is reflected in the large adjustment that occurs in the estimated level \(\ell_t\) at each time.
Methods With Trend
Holt’s Linear Trend Method
This is an extension of simple exponential smoothing to allow forecasting of data with a trend:
\[ \text{Forecast equation: } \qquad \hat{y}_{t+h|t} = \ell_t + hb_t \\ \text{Level equation: } \qquad \ell_t = \alpha y_t + (1-\alpha)(\ell_{t-1} + b_{t-1}) \\ \text{Trend equation: } \qquad b_t = \beta^* (\ell_t - \ell_{t-1} + (1 - \beta^*)b_{t-1} \]
Where \(\ell_t\) denotes an estimate of the level of the series at time \(t\), \(b_t\) denotes an estimate of the trend at time \(t\), \(\alpha\) is the smoothing parameter for the level, and \(\beta^*\) is the smoothing parameter for the trend (\(0 \le \beta^* \le 1\)).
The forecast function is no longer flat but trending. The \(h\)-step ahead forecast is the last estimated level plus \(h\) times the last estimated trend value, so the forecasts are a linear function of \(h\).
# Pull out Australia and create scaled population variable
aus_economy <-
global_economy %>%
filter(Country == 'Australia') %>%
mutate(Pop = Population / 1e6)
# Fit ETS model
aus_econ_fit <-
aus_economy %>%
model(
ets = ETS(Pop ~ error('A') + trend('A') + season('N'))
)
# Forecast
aus_econ_fc <-
aus_econ_fit %>%
forecast(h = 6)
tidy(aus_econ_fit)
## # A tibble: 4 x 4
## Country .model term estimate
## <fct> <chr> <chr> <dbl>
## 1 Australia ets alpha 1.00
## 2 Australia ets beta 0.327
## 3 Australia ets l 10.1
## 4 Australia ets b 0.222
aus_econ_fc %>%
autoplot(aus_economy) +
labs(
x = 'Year',
y = 'Population (Millions)',
title = 'Australia - Population',
subtitle = 'ETS Forecast Model'
)
The smoothing coefficient for the level is 1, showing that the level changes rapidly in order to capture the highly trended series. The smoothing coefficient for the slope of .33. This is relatively large which suggests the trend also changes often, even if these changes are slight.
Dampened Trend Methods
The forecasts of Holt’s linear method are linear, and have shown to over-forecast for longer forecast horizons. A parameter was introduced to dampen the trend to a flat line over time.
This dampening parameter is \(0 < \phi < 1\).
\[ \hat{y}_{t+h|t} = \ell_t + (\phi + \phi^2 + \ldots + \phi^h)b_t \\ \ell_t = \alpha y_t + (1 - \alpha)(\ell_{t-1} + \phi b_{t-1}) \\ b_t = \beta^*(\ell_t - \ell_{t-1}) + (1 - \beta^*)\phi b_{t-1} \]
If \(\phi = 1\), this method is identical to Holt’s method. For other values it dampens the trend so that it approaches a constant some time in the future. In practice, \(\phi\) is rarely ever less than 0.8.
aus_economy %>%
model(
`Holt's` = ETS(Pop ~ error('A') + trend('A') + season('N')),
`Dampened Holt's` = ETS(Pop ~ error('A') + trend('Ad', phi = .98) + season('N'))
) %>%
forecast(h = 20) %>%
autoplot(aus_economy, level = NULL) +
labs(
title = 'Australian Population',
subtitle = 'Forecasting',
x = 'Year',
y = 'Population (Millions)',
colour = 'Forecast'
)
Example: Internet Usage
net_usage <- as_tsibble(WWWusage)
net_usage %>%
autoplot(value) +
labs(
x = 'Minute',
y = 'Users',
title = 'Internet Usage'
)
We will use the stretch_tsibble()
function to do cross-validation.
net_usage %>%
stretch_tsibble(.init = 10) %>%
model(
ses = ETS(value ~ error('A') + trend('N') + season('N')),
holt = ETS(value ~ error('A') + trend('A') + season('N')),
dampened = ETS(value ~ error('A') + trend('Ad') + season('N'))
) %>%
forecast(h = 1) %>%
accuracy(net_usage)
## # A tibble: 3 x 9
## .model .type ME RMSE MAE MPE MAPE MASE ACF1
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 dampened Test 0.288 3.69 3.00 0.347 2.26 0.663 0.336
## 2 holt Test 0.0610 3.87 3.17 0.244 2.38 0.701 0.296
## 3 ses Test 1.46 6.05 4.81 0.904 3.55 1.06 0.803
The dampened Holt’s method is the best looking at root mean squared error or mean average error.
net_usage %>%
model(
dampened = ETS(value ~ error('A') + trend('Ad') + season('N'))
) %>%
forecast(h = 10) %>%
autoplot(net_usage) %>%
labs(
x = 'Minute',
y = 'Users',
title = 'Internet Usage',
subtitle = "Forecast Using Dampened Holt's Method"
)
## [[1]]
##
## $x
## [1] "Minute"
##
## $y
## [1] "Users"
##
## $title
## [1] "Internet Usage"
##
## $subtitle
## [1] "Forecast Using Dampened Holt's Method"
##
## attr(,"class")
## [1] "labels"
In this example both metrics (MAE and RMSE) suggested the dampened Holt’s method. However often different types of metrics will suggest different forecasting methods. Consideration of the task at hand needs to be taken into account.
Methods With Seasonality
The Holt-Winters seasonal method comprises the forecast equation and three components: \(\ell_t\), \(b_t\) and \(s_t\). The smoothing parameters are \(\alpha\), \(\beta\) and \(\gamma\) respectively. We use \(m\) to denote the frequency of the seasonality as the number of seasons in a year (i.e. \(m = 4\) for quarterly).
There are two variations:
- Additive, used when the seasonal variations are roughly constant through the series.
- Seasonal component is expressed in absolute terms in the scale of the observed series.
- Within a year, seasonal component will add up to approximately zero.
- Multiplicative, used when seasonal variations are changing proportional to the level of the series.
- Seasonal component expressed in relative terms (percentages).
- Within each year, the seasonal component will add up to approximately \(m\).
Holt-Winters Additive Method
\[ \hat{y}_{t+h|t} = \ell_t + h b_t + s_{t + h - m(k+1)} \\ \ell_t = \alpha(y_t - s_{t-m}) + (1-\alpha)(\ell_{t-1} + b_{t-1}) \\ b_t = \beta^*(\ell_t - \ell_{t-1}) + (1 - \beta^*)b_{t-1} \\ s_t = \gamma(y_t - \ell_{t-1} - b_{t-1} + (1-\gamma)s_{t-m} \] Where \(k\) is the integer part of \((h-1)/m\). This ensures that the estimates of the easonal indicies used for forecasting come from the final year of the sample.
- The level equation shows a weighted average between the seasonally adjusted observation and the non-seasonal forecast.
- The trend equation is identical to Holt’s linear method.
- The seasonal equation shows a weighted average between the current seasonal index and the seasonal index of the same season last year (\(m\) periods ago).
Holt-Winters Multiplicative Method
\[ \hat{y}_{t+h|t} = (\ell_t + h b_t)s_{t+h - m(k+1)} \\ \ell_t = \alpha \frac{y_t}{s_{t-m}} + (1-\alpha)(\ell_{t-1} + b_{t-1}) \\ b_t = \beta^* (\ell_t - \ell_{t-1}) + (1-\beta^*)b_{t-1} \\ s_t = \gamma \frac{y_t}{(ell_{t-1} + b_{t-1})} + (1 - \gamma)s_{t-m} \]
Example: Domestic Overnight Trips
In the example below, both an additive and multiplicative seasonality are used to forecast quarterly visitor nights in Australia.
au_holidays <-
tourism %>%
filter(Purpose == 'Holiday') %>%
summarise(Trips = sum(Trips))
au_holidays %>%
model(
additive = ETS(Trips ~ error('A') + trend('A') + season('A')),
multiplicative = ETS(Trips ~ error('M') + trend('A') + season('M'))
) %>%
forecast(h = '5 years') %>%
autoplot(filter_index(au_holidays, '2010-01' ~ .), level = NULL) +
labs(
title = 'Australia - Overnight Trips',
subtitle = 'Holt-Winters Addidive and Multplicative Models',
x = 'Quarter',
y = 'Overnight Trips (millions)',
colour = 'Model'
)
Taxonomy
With the different combinations of trend and seasonal components, there are nine exponential smoothing methods available. If we define (trend, seasonality), here are the named versions:
- \((N,N)\) - simple exponential smoothing.
- \((A,N)\) - Holt’s linear method.
- \((A_d,N)\) - Additive damped trend method.
- \((A,A)\) - Additive Holt-Winters’ method.
- \((A,M)\) - Multiplicative Holt-Winters’ method.
- \((A_d,M)\) - Holt-Winters’ damped method.
Innovations State Space Models
The exponential smoothing methods we’ve looked at so far generate the same point forecasts, but can also generate prediction (or forecast) intervals.
A statistical model is a stochastic (or random) data generating process that can prodice an entire forecast distribution. Each model consists of a measurement equation that describes the observed data, and some state equations that describe how the unobserved components or states (level, trend, seasonal) change over time.
The are referred to as state space models.
For each method there exists two models:
- One with additive errors
- One with multiplicative errors.
To distinguish between a model with additive and multiplicative errors, a thirds letter as added to the classification. Each state space model is labelled as \(ETS(.,.,.,)\). THe possibilities for each component are:
- Error \(= {A,M}\)
- Trend \(= {N,A,A_d}\)
- Seasonal \(= {N,A,M}\)
ETS(A,N,N) - Simple Exponential, Additive Errors
Recall that the forecast equation is \(\hat{y}_{t+1|t} = \ell_t\), and the smoothing equation is \(\ell_t = \alpha y_t + (1 - \alpha)\ell_{t - 1}\).
If this is re-arranged for the level, we get the ‘error correction’ form:
\[ \ell_t = \ell_{t-1} + \alpha(y_t - \ell_{t-1}) \\ = \ell_{t-1} + \alpha e_t \]
The training data errors lead to the adjustment of the estimated level throughout the smoothing proccess for \(t = 1, \ldots, T\). The closer \(\alpha\) is to one, the “rougher” the estimate of the level, as large adjustments take place.s
We can also write \(y_t = \ell_{t-1} + e_t\), so that each observation can be represented by the previous level plus an error.
To make this into a innovations state space model, all we need to do is specify the probability distribution for \(e_t\). For a model with additive errors, we assume that the residuals \(e_t\) are normally distributed white noise with mean 0 and variance \(\sigma^2\)
$ e_t = _t NID(0, ^2) $
The NID stands for normally and independently distributed. The equatiosn of the model can be written as:
\[ y_t = \ell_{t-1} + \epsilon_t \\ \ell_t = \ell_{t-1} + \alpha \epsilon_t \]
The first equation is referred to as the measurement, and the second equation as the state or (transition) equation. These two equations,combined with a statistical distribution of the errors form a fully specified statistical model. Specifically: an innovations state space model underlying simple exponential smoothing.
The term “innovations” comes from the fact that all equations use the same random error process \(\epsilon_t\). It’s also known as a “single source of error” model.
The measurement equation shows the relationship between the observations and the unobserved states. The state equation shows the evolution of the state through time. The influence of the \(\alpha\) smoothing parameter is the same as other methods. If \(\alpha = 0\), the level of the series does not change over time. If \(\alpha = 1\), the model reduces to a random walk.
ETS(M,N,N) - Simple Exponential, Multiplicative Errors
In a similar way we write models with multiplicative errors by writing the one-step ahead training errors as relative errors:
\[ \epsilon_t = \frac{ y_t - \hat{y}_{t|t-1} }{ \hat{y}_{t|t-1} } \]
Substitutiting \(\hat{y}_{t|t-1} = \ell_{t-1}\) gives \(y_t = \ell_{t-1} + \ell_{t-1}\epsilon_t\) and \(e_t = y_t - \hat{y}_{t|t-1} = \ell_{t-1}\epsilon_t\).
The state space model is then:
\[ y_t = \ell_{t-1}(1 + \epsilon_t) \ell_t = \ell_{t-1}(1 + \alpha \epsilon_t) \]
ETS(A,A,N) - Holt’s Linear Method, Additive Errors
For this model we assume the one-step ahead training errors are \(\epsilon_t = y_t - \ell_{t-1} - b_{t-1} \sim NID(0, \sigma^2)\).
Substituting this into the error correction equations for Holt’s linear method we obtain:
\[ y_t = \ell_{t-1} + b_{t-1} + \epsilon_t \ell_t = \ell_{t-1} + b_{t-1} + \alpha\epsilon_t b_t = b_{t-1} + \beta\epsilon_t \]
where, for simplicity, \(\beta = \alpha\beta^*\).
Estimation and Model Selection
Estimating ETS Models
An alternative to estimating the parameters by minimising the sum of squares is to maximise the ‘likelihood’, where the likelihood is the probability of the data arising from the specified model.
For an additive error model, maximising the likelihood is the same as minimising the sum of squared errors. However different results will be obtained for multiplicative error models.
We will estimate the smoothing parameters \(\alpha, \beta, \gamma\) and \(\theta\), and the initial states \(\ell_0, b_0, s_0, s_{-1}, \ldots, s_{-m+1}\) by maximising the likelihood.
Model Selection
Information criteria can be used for model selection: AIC, AIC\(_{\text{c}}\), and BIC can be used to determine whic of the ETS models is most appropriate for a time series.
Akaike’s Information Criterion is defined as:
\[ AIX = -2log(L) + 2k \] where \(L\) is the likelihood of the model and \(k\) is the number of parameters and initial states that have been estimated (including residual variance).
The AIC corrected for small sample bias (AIC\(_{\text{c}}\)) is:
\[ AIC_c = AIC + \frac{2k(k+1)}{T - k - 1} \] and the Bayesian Information Criterion is:
\[ BIC = AIC + k[log(T) - 2] \]
Three of the combinations can cause difficulties:
- ETS(A,N,M)
- ETS(A,A,M)
- ETS(A,A\(_{\text{d}}\),M)
This is due to division of values potentially close to zero. These are normally not considered when selecting a model.
Models with multiplicative errors are useful when the data are strictly positive, but are not numerically stable when the data contain zeros or negative values.
Example
The ETS statistical framework is used to forecast Australian holiday tourism over the period 2016 - 2019. The ETS function selects a model by minimising the AIC\(_{\text{c}}\):
## Series: Trips
## Model: ETS(M,N,M)
## Smoothing parameters:
## alpha = 0.3578226
## gamma = 0.0009685565
##
## Initial states:
## l s1 s2 s3 s4
## 9666.501 0.9430367 0.9268433 0.968352 1.161768
##
## sigma^2: 0.0022
##
## AIC AICc BIC
## 1331.372 1332.928 1348.046
With ETS()
, the default estimation is maximum likelihood, not minimised sum of squares.
Here are the states over time:
## Warning: Removed 4 row(s) containing missing values (geom_path).
Here are point forecasts and prediction intervals:
Because of the multiplicative errors, the residuals are not equivalent to the one-step training errors. The residuals are \(\hat{\epsilon}_\), while the one-step training errors are \(y_t = \hat{t}_{t|t-1}\). Both are obtained using the residuals()
function. A type of ‘innovation’ gives regular residuals.
Forecasting with ETS Models
Point forecasts are obtained by iterating the equations for \(t = T+1, \ldots, T + h\), while setting \(\epsilon_t = 0\) for all \(t > T\).
For example:
\[ ETS(M,A,N) = y_{T+1} = (\ell_T + b_T)(1 + \epsilon_{T+1}) \quad \text{therefore:} \\ \hat{y}_{T+1|T} = \ell_T + b_t \]
ETS point forecasts are qual to the medians of the forecast distributions. For models with only additive components, the forecast distributions are normal and the mean equals the median. With multiplicative errors or seasonality, the point forecasts will not be equal to the means of the forecast distributions.
Prediction Intervals
A big advantage of the models is that prediction intervals can be generated.
For most ETS models, the prediction interval vancan be written as:
\[ \hat{y}_{T+h|T} \pm c \sigma_h \]
where \(c\) depends on the coverage probability (i.e. 1.96 for .95), and ^2 is the forecast variance. Forecast variance formulas can be complicated.
Fore a few ETS models, there are no known formulas for prediction intervals. In these cases the forecast()
functionuses simulated future sample paths and computes prediction intervals from the percentiles of these simulated future paths.