Chapter 7 - Time Series Regression Models

Greg Foletta

2020-07-16

Basic concept is the forecasting the time series of interest \(y\) assuming it has a linear relationship with other time series \(x\).

Simple Linear Regression

Standard model:

\[ y_t = \beta_0 + \beta_1X_t + \epsilon_t \]

We can show a scatter plot and an estimated linear fit:

## `geom_smooth()` using formula 'y ~ x'

The fit can be calculated using the TSLM (time series linear model) function:

# A tibble: 2 x 6
  .model term        estimate std.error statistic  p.value
  <chr>  <chr>          <dbl>     <dbl>     <dbl>    <dbl>
1 lm     (Intercept)    0.545    0.0540     10.1  1.63e-19
2 lm     Income         0.272    0.0467      5.82 2.40e- 8

The slope cas a coefficient of .27, meaning that for every 1% change in income, consumption goes up by .27%.

The intercept is when \(x = 0\), so when there is no change in income, there is an increase in consumptiojn of .54%.

Multiple Linear Regression

When there are two ro more predictor variables, the model is a multiple regression. It is of the form:

\[ y_t = \beta_0 + \beta_1 x_{1,t} + \beta_2 x_{2,t} + \ldots + \beta_p x_{p,t} + \epsilon_t \]

The coefficients measure the effect of each predictor after taking into account the effects of all other predictors - they measure the marginal effects of the predictor variables.

Below we look at three predictors that may be able to predict US consumption expenditure:

We can look at a scatterplot of the five variables:

## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2

Linear Model Assumptions

When fitting linear model, the following assumptions are made about the errors (\(\epsilon_t\)):

  • They have a mean zero, otherwise the forecasts witll be biased.
  • The are not correlated.
  • They are unrelated to the predictor variables.

It is also useful that they are normally distributed with a constant variance. This allows us to produce prediction intervals.

Least Squares Estimation

The least squares principle is a way of choosing the coefficients by minimising the sum or the squared errors. The \(\beta_p\) are chosen that minimise:

\[ \sum_{t=1}^T \epsilon_t^2 = \sum_{t=1}^t (y_t - \beta_0 - \beta_1 x_{1,t} - \ldots - \beta_p x_{p,t})^2 \]

The TSLM() function fits a linear regression to time series data, similar to lm().

Series: Consumption 
Model: TSLM 

Residuals:
     Min       1Q   Median       3Q      Max 
-0.90555 -0.15821 -0.03608  0.13618  1.15471 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)   0.253105   0.034470   7.343 5.71e-12 ***
Income        0.740583   0.040115  18.461  < 2e-16 ***
Production    0.047173   0.023142   2.038   0.0429 *  
Unemployment -0.174685   0.095511  -1.829   0.0689 .  
Savings      -0.052890   0.002924 -18.088  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.3102 on 193 degrees of freedom
Multiple R-squared: 0.7683, Adjusted R-squared: 0.7635
F-statistic:   160 on 4 and 193 DF, p-value: < 2.22e-16

Goodness-of-Fit

R-Squared

\(R^2\) is the most common way to summarise how well a linear regression model fits the data.

\[ R^2 = \frac{ \sum(\hat{y}_t - \bar{y})^2 }{ \sum(y_t - \bar{y})^2 }\]

It is the proportion of the variation in the forecast accounted for (or explained) by the model.

\(R^2\) will never decrease when adding in additional predictors, which can lead to over-fitting.

Standard Error of the Regression

Another measure is the standard deviation of the residuals, or residual standard error.

\[ \hat{\sigma}_e = \sqrt{ \frac{1}{T - p - 1} \sum_{t=1}^T e^2_t } \]

Where \(p\) is the number of predictors in the model. We take \(p - 1\) away because we have estimated the intercept plus \(p\) parameters.

The standard error is related to the size of the average error that the model produces. We compare this to the sample mean of \(y\) or the standard deviation of \(y\) to gain perspective on the accuracy of the model.

Evaluating the Regression Model

The residuals are defined as \(e_t = y_t - \hat{y}_t\). They also have other useful properties:

\[ \sum_{t=1}^t e_t = 0 \text{ and } \sum_{t=1}^Tx_{p,t} e_t = 0 \text{ for all } p \]

as a result, the mean of the residuals is zero, and the correlation between the residuals and the observations for the predictor variable is also zero.

ACF Residual Plot

With time series data, it’s very unlikely that a value of a variable in the current time period will be the same as the previous period. Thus it’s common to find autocorrelation when fitting a regression model to time series data, and one of our assumptions is violated.

What this means is that there’s some information that hasn’t been accounted for, and the forecasts aren’t as efficient as they should be. The forecasts are still unbiased, and not wrong, but the will have a larger prediction interval than they need to.

The gg_tsresiduals() gives the three key diagnostic plots needed:

We see some chage in variation over time, but otherwise the residual plot looks reasonable. The heteroscedacticity makes the prediction intervals inaccurate. The residuals look “normal-ish”, however they are skewed, which will also affect the prediction intervals.

Residual Versus Fitted

A plot of the residuals versus the fitted values should also show no pattern. If there is a pattern, the errors may be heteroscedastic, whereby the variance of the residuals is not constant.

From the plot below we see that the errors appear to be random, suggesting that the errors are homoscedastic.

Outliers and Influential Observations

Observation with extreme values are outliers, and observations that have a large influence on the coefficients of a regression model are called influential observatons and these impart leverage. Usually, but not necessarily, outliers impart leverage.

There is no rigid mathematical definition of an outlier - it’s ultimately subjective and depends on the context of the model. Observations can be flagged based on interquartile ranges - e.g. if \(Q_1\) and \(Q_3\) are the lower and upper quartiles, an outlier could be anything outside the range \([Q_1 - k(Q_4 - Q_1), Q_3 + k(Q_3 - Q_1)]\) for some non-negative constant \(k\). John Tukey proposed \(k = 1.5\).

Leverage is usually noted as \(h_i\) for the \(i\)th observation, which can be found in the [hat matrixhttps://stats.stackexchange.com/questions/208242/hat-matrix-and-leverages-in-classical-multiple-regression) \(\bf{H}_{ii}\).

Spurious Regression

Generally, time series data are ‘non-stationary’, which means that the values of the series do not fluctuate around a constant mean or with constant variance.

For example, let’s regress production of cement with productio of electricity:

Series: Cement 
Model: TSLM 

Residuals:
   Min     1Q Median     3Q    Max 
-466.7 -122.5   13.2  112.3  413.5 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 6.320e+02  2.119e+01   29.83   <2e-16 ***
Electricity 2.688e-02  6.174e-04   43.54   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 161.3 on 216 degrees of freedom
Multiple R-squared: 0.8977, Adjusted R-squared: 0.8973
F-statistic:  1896 on 1 and 216 DF, p-value: < 2.22e-16

We see that we get an \(R^2\) of .89, indicating that 89% of the variation in the Cement response can be accounted for by the Electricity predictor. Of course the production of Cement has nothing to do with the production of Electricity, a classic case of ‘correlation not causation’. What we have is a confounding variable or varibales, likely gross domestic product or population, which has a direct impact on both of these variables.

Useful Predictors

Trend

It’s common for time series data to be trending. A linear trend can be modelled by using \(x_{1,t} = t\) as a predictor, where \(t = 1,\ldots,T\). A trend variable can be specified in the \(TSLM()\) function using the \(trend()\) special.

Dummy Variables

Dummy variables are used within a linear regression to model caegorical variables. Can also be used for an outlier. If there are more than two categories, multiple variables can be used. You will use one less variables that there are categories, as the intercept will capture that variables when all the variables are set to zero.

The TSLM() will automatically handle this if the season() special is inlcuded.

Below we fit a linear model to Australian electricity demand.

## Series: Electricity 
## Model: TSLM 
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -3244.85  -490.35    11.47   575.44  3571.99 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   39071.445    338.395 115.461  < 2e-16 ***
## trend()         285.960      6.034  47.390  < 2e-16 ***
## season()year2   103.040    359.676   0.286    0.775    
## season()year3  2264.275    364.586   6.211 3.47e-08 ***
## season()year4  -585.185    364.636  -1.605    0.113    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1108 on 69 degrees of freedom
## Multiple R-squared: 0.9709,  Adjusted R-squared: 0.9692
## F-statistic: 576.1 on 4 and 69 DF, p-value: < 2.22e-16

So we can see that, for every quarter, ~280 Gigawatt hours more electricity is produced. On average the second quarter has ~990 Gwh more elctricity produced, and quarter 3 has 2506Gwh more electricity - which lines up with the majority of the summer season.

Intervention Variables

It can be necessary to omdel interventions that may affect the response variable - e.g. competitor activity, advertising expenditure, industrial action, etc.

When the action only lasts for one quarter, a ‘spkike’ variable is used which takes the value of one in the period of the action.

If the action changes the the values suddenly and permanently, a ‘step’ variable is used, taking the value 0 before the action and 1 after.

Piecewise linear models can be used where the trend change at an inflection point.

Trading Days

The number of trading days in a month can vary substantially, and can be added in as a predictor.

Distributed Lags

It can be useful to include advertising expenditure, however as this can last beyond the actual campaign, lagged values of the expenditure need to be included:

\[ x_1 = \text{ advertising for previous month.} \\ x_2 = \text{ advertising for previous two months.} \\ \vdots \\ x_m = \text{ advertising for previous } m { months.} \]

The coefficients should decrease as the lag increases.

Easter

Easter differs from most holidays because it is not held on the same date wach year, and its effects can last several days. With monthly data, if easter falls in March then the dummy variable takes value 1 in March. If Easter is over two months, the dummy variable is split proportionally across the months.

Fourier Series

An alternative to dummy variables, especially for long seasonal periods, is to use Fourier terms. With a Fourier series, a set of sine and cosine functions can approximate any periodic function.

If \(m\) is the seasonal period, then the first few Fourier terms are:

\[ x_{1,t} = sin\bigg(\frac{2 \pi t}{m}\bigg), x_{2,t} = con\bigg(\frac{2 \pi t}{m}\bigg) \\ x_{3,t} = sin\bigg(\frac{4 \pi t}{m}\bigg), x_{4,t} = cos\bigg(\frac{4 \pi t}{m}\bigg) \\ x_{5,t} = sin\bigg(\frac{6 \pi t}{m}\bigg), x_{6,t} = cos\bigg(\frac{6 \pi t}{m}\bigg) \\ \]

If there is monthly seasonality, and we use the first 11 of these predictor variables, it will be the same forecasts as using 11 dummy variables. The benefit is that fewer predictors are required than with dummy variables. This is useful for weekly data where \(m \approx 52\).

These terms can be produced using the fourier() special. The \(K\) argument specifies how many pairs of \(sin()\) and \(cos()\) pairs to include. The maximum allowed is \(K = m/2\).

If only the first two are used, the seasonal pattern is a simple sine wave. A regression with Fourier terms is often called a harmonic regression.

Selecting Predictors

Common approaches that are not recommended are:

  • Plotting the response against a particular predictor and visually determining if there is a relationship.
  • Perform multiple linear regression on all predictors and disregard based on p-value.
    • Statistical significance does not always indicate predictive value.
    • P-values are misleading with correlated predictors.

Instead, the following measures can be used:

Adjusted R\(^2\)

  • Not a good measure of predictive ability.
  • Does not allow for degrees of freedom: adding any variable will increase its value.
  • Minimusing sum of squared errors (SSE) is equivalent.

An alternative is adjusted R^2, which is equivalent to minimising the standard error, and doesn’t increase with each added predictor:

\[ \bar{R}^2 = 1 - (1 - R^2) \frac{T - 1}{T - k - 1} \] Where \(T\) is the number of observations and \(k\) is the number of predictors.

Cross-Validation

  • Can use classical leave-one-out cross validation (LOOCV):
  1. Remove observation \(t\).
  2. Fit the model using the remaining data.
  3. Predict the value of \(t\) and compute the error \(e_t^* = y_t - \hat{y}_t\).
  4. Repeat step for \(t, \ldots, T\).
  5. Compute the MSE from \(e_1^*, \ldots, e_T^*\).

Akaike’s Information Criterion

\[ AIC = T log\bigg(\frac{SSE}{T}\bigg) + 2(k + 2) \]

The \(k + 2\) part of the equation occurs because there are that many parameters in the model: the \(k\) predictors, the intercept, and the variance of the residuals.

The model with the minimum AIC is often the best model for forecasting.

Corrected AIC

For small values of \(T\), the AIC tends to select too many predictors. A bias-corrected version has been developed:

\[ AIC_c = AIC + \frac{ 2(k+2)(k+3)}{T - k - 3} \]

Schwarz’s Bayesian Information Criterion

Usually abbreviated to BIC, SBIC or SC, this value should be minimised. BIC penalises the number of parameters more heavily than AIC. For large values of \(T\), it is similar to leave-\(v\)-out cross validation, where \(v = T(1 - \frac{1}{log(T) - 1})\)

\[ BIC = T log\bigg(\frac{SSE}{T}\bigg) + (k+2)log(T) \]

Which One?

  • \(\bar{R}^2\) is widely used, but has a tendency to select too many predictors.
  • BIC is liked because if there is a ture underlying model, BIC will select it given enough data.
    • However there is rarely a true underlying model.
    • Selecting that model may not necessarily give the best forecasts.

Hyndman and Athanasopoulos recommend one of \(AIC_c\), \(AIC\) or \(CV\) be used.

Best Subset

Where possible, all regression models should be fitted and a model selected on the above criteria.

Stepwise

If there are a large number of predictors, fitting every possible combination of them is not possible - e.g. 40 predictos is \(2^40\) models!

Backwards stepwise regression is a good approach:

  1. Start with all predictors.
  2. Remove one predictor at a time and model.
  3. Keep the model if it improves the measure of predictive accuracy.
  4. Iterate until no further improvement.

If the number of predictors is large, this won’t work, and forward stepwise regression can be used.

Inference

Beware that any procedire involving selecting predictors first will invalidate the assumptions behind p-values.

Forecasting with Regression

There are different types of forecasts that can be produced:

  • Ex-ante - made using only the information available in advance. These are genuine forecasts.
  • Ex-post - made using later information on the predictors. E.g. ex-post consumption forecast may use the actual observations of the predictors once these have been observed. These are not genuine forecassts, but are useful for studying the behaviour of forecasting models.

Ex-post models can assume knowledge of the predictor variables, but should not assume knowledge of the response.

Comparing ex-ante and ex-post forecasts can help to separate out the sources of forecast uncertainty. This will show whether forecast errors are due to poor forecasts of the predictor, or due to a poor forecasting model.

Normally we cannot use actual future values of the predictors as they are not known in advance. However special predictors can be used as they are either based on calendar variables or deterministic functions of time. In this case there is not difference between ex-ante and ex-post forecasts.

Scenario Based Forecasting

With scenario based forecasting, the forecaster assumes possible scenarios for the predictor variables that of interest. They may look at an economic forecast with different changes in the employment rate.

Building a Predictive Regression Model

A challenge with regression models is that to generate ex-ante forecasts, the model requires the future values of each predictor. It many cases generating forecasts for the predictor variables can be the most challenging part!

An alternative is to use lagged values of the predictors. The predictor set is formed by values of that are observed \(h\) time periods prior to observing \(y\). When the estimates model is projected into the future - beyond sample \(T\), all predictor values are available.

Prediction Intervals

For a simple univariate regression, assuming the errors are normally distributed, an approximate 95% prediction interval is given by:

\[ \hat{y} \pm 1.96 \hat{\sigma}_e \sqrt{ 1 + \frac{1}{T} + \frac{(x - \bar{x})^2}{(T-1)s^2_x}} \]

Where \(s\) is the standard deviation of the observed values and \(\hat{\sigma}_e\) s the standard error of the regression.

Matrix Formulation

Our regression model is:

\[ y_t = \beta_0 + \beta_1 x_{1,t} + \ldots + \beta_p x_{p,t} + \epsilon_t \]

We can write this in matrix form where \(\boldsymbol{y} = (y_1, \ldots, y_n)^T\), \(\boldsymbol{\epsilon} = (\epsilon_1, \ldots, \epsilon_n)^T\), \(\boldsymbol{\beta} = (\beta_0, \ldots, \beta_n)^T\), and

\[ \boldsymbol{X} = \left[ \begin{matrix} 1 & x_{1,1} & x_{2,1} & \dots & x_{n,1}\\ 1 & x_{1,2} & x_{2,2} & \dots & x_{n,2}\\ \vdots& \vdots& \vdots&& \vdots\\ 1 & x_{1,T}& x_{2,T}& \dots & x_{m,T} \end{matrix}\right] \]

Giving us:

\[ \boldsymbol{y} = \boldsymbol{X}\boldsymbol{\beta} + \boldsymbol{\epsilon} \]

Least Squares Estimation

Least squares estimation is performed by minimising the epression:

\[ \epsilon^T\epsilon = (y - X\beta)^T(y-X\beta) \]

This is minimised when \(\beta\) takes on the expression:

\[ \hat{\beta} = (X^TX)^{-1} X^Ty \]

##          [,1]
## x0 -73.989100
## x1   2.292800
## x2   7.718088
## (Intercept)          x1          x2 
##  -73.989100    2.292800    7.718088

This is known as the normal equation.

The variance is estimated using:

\[ \hat{\sigma}^2_e = \frac{1}{T - k -1} (y - X\hat{\beta})^T (y - X\hat{\beta})\]

Fitted Values and Cross Validation

The normal equation shows that the fitted values can be calcuated using:

\[ \hat{y} - X\beta = X(X^TX)^{-1}X^Ty = Hy \]

The \(H\) is known as the ‘hat matrix’, because it ‘puts the had on \(y\)’.

The diagonal values of \(H\) are denoted by \(h_1, \ldots, h_n\). The cross-validation statistic can be computed by:

\[ CV = \frac{1}{T} \sum_{n = 1}^N (e_t / (1 - h_t))^2 \]

Where \(e_t\) is the \(t\)th residual. We see that it is not necessary to fit \(N\) models to compute the CV statistic.

Forecasts

Let \(x^*\) be a row vector containing the values of the predictors. The forecast is:

\[ \hat{y} = x^*\hat{\beta} = x^* (X^TX)^{-1} X^T Y\]

Nonlinear Regression

The simplest way of modelling a nonlinear relationship is to transform the response and/or one or more of the predictors before estimating a regression model. This provides a non-linear form, however it is still linear in the parameters (\(\beta\)).

You could have a log-log model \(log(y) = \beta_0 + \beta_1 log(x) + \epsilon\), in which $_1 is the average percentage change in \(y\) resulting from a 1% change in \(x\).

The general model is:

\[ y = f(x) + \epsilon \] There \(f\) is a nonlinear function. For standard linear regression, \(f(x) = \beta_0 + \beta_1 x\).

A simple specification is to make a piecewise linear - this introduces points where the slope can change:

\[ x_{2,t} = (x - c)_{+} = \Bigg\{ \begin{array}{ll} 0 & x < c \\ (x - c) & x \ge c \end{array} \]

\((x - c)_{+}\) means the value of \(x-c\) if it is positive, and zero otherwise. The slopes inflection point is therefore at point \(c\).

Piecewise linear relationships are a special case of regression splines. In general:

\[ x_1 = x \\ x_2 = (x - c_1)_{+} \\ \ldots \\ x_k = (x - c_k)_{+} \]

Forecasting with a Nonlinear Trend

To fit a non-linear trend, you may use quadratic or other higher order polynomials:

\[ x_{1,t} = t, \text{ } x_{2,t} = t^2, \text{ } \ldots \]

However these are not recommended to be used in forecasting. When extrapoloated, the resulting forecasts are often unrealistic.

Piecewise specification is a better approach:

$$ x_{1,t} = t \

x_{2,t} = (t - )_{+} = { \[\begin{array}{ll} 0 & t < \tau \\ (t - \tau) & t \ge \tau \end{array}\]

$$

In this model, if the coefficients of \(x_{1,t}\) and \(x_{2,t}\) are \(\beta_1\) and \(\beta_2\), then \(\beta_1\) is the slope before time \(\tau\), and the slope of the line after \(\tau\) is given by \(\beta_1 + \beta_2\).

Correlation, Causation, and Forecasting

Correlation is not causation! A variable \(x\) may be useful for forecasting \(y\), but that does not mean it is causing it. \(y\) may be causing \(x\), or the relationship may be more complicated.

A confounding variable influences the response and at least one predictor.

Correlations are useful for forecasting, even when there is no causal relationship, or when the causality runs in the other direction, or when there is confounding.

However, models will be better if the causal mechanism can be determined.

Forecasting with Correlated Predictors

When two or more predictors are correlated, it’s challenging to separate out their individual effects. It’s not really a problem for forecasting, as the forecasts can still be determined without the need to separate the effects. It does become a problem with scenario forecasting as the scenarios should take into account the relationships between predictors.

Multicollinearity and Forecasting

Multicollinearity occurs when similar information is provided by two or more of the predictor variables in a multiple regression. It can occur when the two predictors are highly correlated, or when a linear combination of predictors is highly correlated with another linear combination of predictors.

An example is the dummy variable trap. Consider having quarterly data dummy variables or \(d_1, d_2, d_3 and d_4\). \(d_4 = 1 - d_1 - d_2 - d_3\), so there is a perfect relationship between \(d_4\) and \(d_1 + d_2 + d_3\).

If there is perfect correlation, it is not possible to estimate the regression model. If there is high correlation, the estimation of the regression coefficients is very difficult.

## Warning in summary.lm(.): essentially perfect fit: summary may be
## unreliable
## 
## Call:
## lm(formula = y ~ x1 + x2 + x3, data = .)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -6.110e-16 -5.959e-17  6.149e-17  1.932e-16  4.224e-16 
## 
## Coefficients:
##               Estimate Std. Error    t value Pr(>|t|)    
## (Intercept)  0.000e+00  2.898e-16  0.000e+00    1.000    
## x1           1.000e+00  3.764e-16  2.657e+15   <2e-16 ***
## x2          -6.269e-16  1.152e-15 -5.440e-01    0.606    
## x3           1.128e-16  2.048e-16  5.510e-01    0.602    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.053e-16 on 6 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:      1 
## F-statistic: 1.674e+32 on 3 and 6 DF,  p-value: < 2.2e-16