Basic concept is the forecasting the time series of interest \(y\) assuming it has a linear relationship with other time series \(x\).
Simple Linear Regression
Standard model:
\[ y_t = \beta_0 + \beta_1X_t + \epsilon_t \]
us_change %>%
ggplot(aes(x = Quarter)) +
geom_line(aes(y = Consumption), colour = 'red') +
geom_line(aes(y = Income), colour = 'blue') +
labs(
x = 'Year',
y = '% Change',
title = 'US Personal Income and Consumption'
)
We can show a scatter plot and an estimated linear fit:
us_change %>%
ggplot(aes(Income, Consumption)) +
geom_point() +
geom_smooth(method = 'lm') +
labs(
x = 'Income (Qtr % Change)',
y = 'Consumption (Qtr % Change)',
title = 'US Income vs Consumption'
)
## `geom_smooth()` using formula 'y ~ x'
The fit can be calculated using the TSLM (time series linear model) function:
# A tibble: 2 x 6
.model term estimate std.error statistic p.value
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 lm (Intercept) 0.545 0.0540 10.1 1.63e-19
2 lm Income 0.272 0.0467 5.82 2.40e- 8
The slope cas a coefficient of .27, meaning that for every 1% change in income, consumption goes up by .27%.
The intercept is when \(x = 0\), so when there is no change in income, there is an increase in consumptiojn of .54%.
Multiple Linear Regression
When there are two ro more predictor variables, the model is a multiple regression. It is of the form:
\[ y_t = \beta_0 + \beta_1 x_{1,t} + \beta_2 x_{2,t} + \ldots + \beta_p x_{p,t} + \epsilon_t \]
The coefficients measure the effect of each predictor after taking into account the effects of all other predictors - they measure the marginal effects of the predictor variables.
Below we look at three predictors that may be able to predict US consumption expenditure:
us_change %>%
pivot_longer(
c(Production, Savings, Unemployment),
names_to = 'Measure',
values_to = 'Percentage'
) %>%
ggplot() +
geom_line(aes(Quarter, Percentage, colour = Measure)) +
facet_grid(rows = vars(Measure), scales = 'free') +
labs(
x = 'Quarter',
y = '% Change',
title = 'US Personal Fiscal Data',
subtitle = 'Consumption, Savings, Unemployment'
)
We can look at a scatterplot of the five variables:
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
Linear Model Assumptions
When fitting linear model, the following assumptions are made about the errors (\(\epsilon_t\)):
- They have a mean zero, otherwise the forecasts witll be biased.
- The are not correlated.
- They are unrelated to the predictor variables.
It is also useful that they are normally distributed with a constant variance. This allows us to produce prediction intervals.
Least Squares Estimation
The least squares principle is a way of choosing the coefficients by minimising the sum or the squared errors. The \(\beta_p\) are chosen that minimise:
\[ \sum_{t=1}^T \epsilon_t^2 = \sum_{t=1}^t (y_t - \beta_0 - \beta_1 x_{1,t} - \ldots - \beta_p x_{p,t})^2 \]
The TSLM()
function fits a linear regression to time series data, similar to lm()
.
us_consumption_mdl <-
us_change %>%
model(
lm = TSLM(Consumption ~ Income + Production + Unemployment + Savings)
)
us_consumption_mdl %>%
report()
Series: Consumption
Model: TSLM
Residuals:
Min 1Q Median 3Q Max
-0.90555 -0.15821 -0.03608 0.13618 1.15471
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.253105 0.034470 7.343 5.71e-12 ***
Income 0.740583 0.040115 18.461 < 2e-16 ***
Production 0.047173 0.023142 2.038 0.0429 *
Unemployment -0.174685 0.095511 -1.829 0.0689 .
Savings -0.052890 0.002924 -18.088 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.3102 on 193 degrees of freedom
Multiple R-squared: 0.7683, Adjusted R-squared: 0.7635
F-statistic: 160 on 4 and 193 DF, p-value: < 2.22e-16
Fitted Values
Predictions for a data point can be calculated by simply plugging the \(x_{p,t}\) values into the formula with the coefficients.
us_consumption_mdl %>%
augment() %>%
rename(Fitted = .fitted) %>%
pivot_longer(c(Consumption, Fitted), names_to = 'Measure', values_to = 'Value') %>%
ggplot() +
geom_line(aes(Quarter, Value, colour = Measure)) +
labs(
x = 'Quarter',
y = '% Change',
title = 'US Consumption',
subtitle = 'Actual versus Model Fitted'
)
Goodness-of-Fit
R-Squared
\(R^2\) is the most common way to summarise how well a linear regression model fits the data.
\[ R^2 = \frac{ \sum(\hat{y}_t - \bar{y})^2 }{ \sum(y_t - \bar{y})^2 }\]
It is the proportion of the variation in the forecast accounted for (or explained) by the model.
\(R^2\) will never decrease when adding in additional predictors, which can lead to over-fitting.
Standard Error of the Regression
Another measure is the standard deviation of the residuals, or residual standard error.
\[ \hat{\sigma}_e = \sqrt{ \frac{1}{T - p - 1} \sum_{t=1}^T e^2_t } \]
Where \(p\) is the number of predictors in the model. We take \(p - 1\) away because we have estimated the intercept plus \(p\) parameters.
The standard error is related to the size of the average error that the model produces. We compare this to the sample mean of \(y\) or the standard deviation of \(y\) to gain perspective on the accuracy of the model.
Evaluating the Regression Model
The residuals are defined as \(e_t = y_t - \hat{y}_t\). They also have other useful properties:
\[ \sum_{t=1}^t e_t = 0 \text{ and } \sum_{t=1}^Tx_{p,t} e_t = 0 \text{ for all } p \]
as a result, the mean of the residuals is zero, and the correlation between the residuals and the observations for the predictor variable is also zero.
ACF Residual Plot
With time series data, it’s very unlikely that a value of a variable in the current time period will be the same as the previous period. Thus it’s common to find autocorrelation when fitting a regression model to time series data, and one of our assumptions is violated.
What this means is that there’s some information that hasn’t been accounted for, and the forecasts aren’t as efficient as they should be. The forecasts are still unbiased, and not wrong, but the will have a larger prediction interval than they need to.
The gg_tsresiduals()
gives the three key diagnostic plots needed:
We see some chage in variation over time, but otherwise the residual plot looks reasonable. The heteroscedacticity makes the prediction intervals inaccurate. The residuals look “normal-ish”, however they are skewed, which will also affect the prediction intervals.
Residual Predictor Plots
We expect that the residuals should be randomly scattered without showing any systematic patterns. We can check this by plotting them against the predictors.
In the plots below, we don’t see any noticible patterns between the residuals and the predictor values.
us_change %>%
left_join(residuals(us_consumption_mdl), by = 'Quarter') %>%
pivot_longer(
c(Income, Production, Savings, Unemployment),
names_to = 'Economic_Measure',
values_to = 'Economic_Value'
) %>%
ggplot() +
geom_point(aes(Economic_Value, .resid)) +
facet_wrap(vars(Economic_Measure), scales = 'free') +
labs(
x = 'Economic Measure %',
y = 'Model Residuals',
title = 'US Consumption',
subtitle = 'Linear Model - Predictors versis Residuals'
)
Residual Versus Fitted
A plot of the residuals versus the fitted values should also show no pattern. If there is a pattern, the errors may be heteroscedastic, whereby the variance of the residuals is not constant.
From the plot below we see that the errors appear to be random, suggesting that the errors are homoscedastic.
us_consumption_mdl %>%
augment() %>%
ggplot() +
geom_point(aes(.fitted, .resid)) +
labs(
x = 'Fitted Consumption',
y = 'Residuals',
title = 'US Consumption Model',
subtitle = 'Linear Model - Fitted versus Residuals'
)
Outliers and Influential Observations
Observation with extreme values are outliers, and observations that have a large influence on the coefficients of a regression model are called influential observatons and these impart leverage. Usually, but not necessarily, outliers impart leverage.
There is no rigid mathematical definition of an outlier - it’s ultimately subjective and depends on the context of the model. Observations can be flagged based on interquartile ranges - e.g. if \(Q_1\) and \(Q_3\) are the lower and upper quartiles, an outlier could be anything outside the range \([Q_1 - k(Q_4 - Q_1), Q_3 + k(Q_3 - Q_1)]\) for some non-negative constant \(k\). John Tukey proposed \(k = 1.5\).
Leverage is usually noted as \(h_i\) for the \(i\)th observation, which can be found in the [hat matrixhttps://stats.stackexchange.com/questions/208242/hat-matrix-and-leverages-in-classical-multiple-regression) \(\bf{H}_{ii}\).
Spurious Regression
Generally, time series data are ‘non-stationary’, which means that the values of the series do not fluctuate around a constant mean or with constant variance.
For example, let’s regress production of cement with productio of electricity:
aus_production %>%
ggplot(aes(Electricity, Cement)) +
geom_point() +
geom_smooth(method = 'lm', formula = 'y ~ x') +
labs(
title = 'Spurious Regression',
subtitle = 'Electricity vs Cement Production'
)
Series: Cement
Model: TSLM
Residuals:
Min 1Q Median 3Q Max
-466.7 -122.5 13.2 112.3 413.5
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.320e+02 2.119e+01 29.83 <2e-16 ***
Electricity 2.688e-02 6.174e-04 43.54 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 161.3 on 216 degrees of freedom
Multiple R-squared: 0.8977, Adjusted R-squared: 0.8973
F-statistic: 1896 on 1 and 216 DF, p-value: < 2.22e-16
We see that we get an \(R^2\) of .89, indicating that 89% of the variation in the Cement response can be accounted for by the Electricity predictor. Of course the production of Cement has nothing to do with the production of Electricity, a classic case of ‘correlation not causation’. What we have is a confounding variable or varibales, likely gross domestic product or population, which has a direct impact on both of these variables.
Useful Predictors
Trend
It’s common for time series data to be trending. A linear trend can be modelled by using \(x_{1,t} = t\) as a predictor, where \(t = 1,\ldots,T\). A trend variable can be specified in the \(TSLM()\) function using the \(trend()\) special.
Dummy Variables
Dummy variables are used within a linear regression to model caegorical variables. Can also be used for an outlier. If there are more than two categories, multiple variables can be used. You will use one less variables that there are categories, as the intercept will capture that variables when all the variables are set to zero.
The TSLM()
will automatically handle this if the season()
special is inlcuded.
Below we fit a linear model to Australian electricity demand.
aus_electricity_lms <-
aus_production %>%
filter_index('1992' ~ .) %>%
model(
`Trend LM` = TSLM(Electricity ~ trend()),
`Trend + Season LM` = TSLM(Electricity ~ trend() + season())
)
aus_electricity_lms %>%
select(`Trend + Season LM`) %>%
report()
## Series: Electricity
## Model: TSLM
##
## Residuals:
## Min 1Q Median 3Q Max
## -3244.85 -490.35 11.47 575.44 3571.99
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 39071.445 338.395 115.461 < 2e-16 ***
## trend() 285.960 6.034 47.390 < 2e-16 ***
## season()year2 103.040 359.676 0.286 0.775
## season()year3 2264.275 364.586 6.211 3.47e-08 ***
## season()year4 -585.185 364.636 -1.605 0.113
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1108 on 69 degrees of freedom
## Multiple R-squared: 0.9709, Adjusted R-squared: 0.9692
## F-statistic: 576.1 on 4 and 69 DF, p-value: < 2.22e-16
So we can see that, for every quarter, ~280 Gigawatt hours more electricity is produced. On average the second quarter has ~990 Gwh more elctricity produced, and quarter 3 has 2506Gwh more electricity - which lines up with the majority of the summer season.
aus_electricity_lms %>%
augment() %>%
ggplot(aes(x = Quarter)) +
geom_line(aes(y = Electricity, colour = 'Real Values')) +
geom_line(aes(y = .fitted, colour = .model)) +
labs(
colour = 'Data Source',
title = 'Australian Electricity Demand',
subtitle = 'Linear Regressions',
x = 'Year',
y = 'Gigawatt Hours Produced'
)
aus_electricity_lms %>%
select(`Trend + Season LM`) %>%
augment() %>%
ggplot() +
geom_point(aes(Electricity, .fitted, colour = as.factor(quarter(Quarter)))) +
geom_abline(slope = 1, intercept = 0) +
scale_colour_brewer(palette="Dark2") +
labs(
title = 'Australian Electricity Demand',
subtitle = 'Linear Regressions - Response vs. Fitted',
x = 'Electricity Demand (GWh)',
y = 'Linear Regression Fitted Values',
colour = 'Quarter'
)
Intervention Variables
It can be necessary to omdel interventions that may affect the response variable - e.g. competitor activity, advertising expenditure, industrial action, etc.
When the action only lasts for one quarter, a ‘spkike’ variable is used which takes the value of one in the period of the action.
If the action changes the the values suddenly and permanently, a ‘step’ variable is used, taking the value 0 before the action and 1 after.
Piecewise linear models can be used where the trend change at an inflection point.
Trading Days
The number of trading days in a month can vary substantially, and can be added in as a predictor.
Distributed Lags
It can be useful to include advertising expenditure, however as this can last beyond the actual campaign, lagged values of the expenditure need to be included:
\[ x_1 = \text{ advertising for previous month.} \\ x_2 = \text{ advertising for previous two months.} \\ \vdots \\ x_m = \text{ advertising for previous } m { months.} \]
The coefficients should decrease as the lag increases.
Easter
Easter differs from most holidays because it is not held on the same date wach year, and its effects can last several days. With monthly data, if easter falls in March then the dummy variable takes value 1 in March. If Easter is over two months, the dummy variable is split proportionally across the months.
Fourier Series
An alternative to dummy variables, especially for long seasonal periods, is to use Fourier terms. With a Fourier series, a set of sine and cosine functions can approximate any periodic function.
If \(m\) is the seasonal period, then the first few Fourier terms are:
\[ x_{1,t} = sin\bigg(\frac{2 \pi t}{m}\bigg), x_{2,t} = con\bigg(\frac{2 \pi t}{m}\bigg) \\ x_{3,t} = sin\bigg(\frac{4 \pi t}{m}\bigg), x_{4,t} = cos\bigg(\frac{4 \pi t}{m}\bigg) \\ x_{5,t} = sin\bigg(\frac{6 \pi t}{m}\bigg), x_{6,t} = cos\bigg(\frac{6 \pi t}{m}\bigg) \\ \]
If there is monthly seasonality, and we use the first 11 of these predictor variables, it will be the same forecasts as using 11 dummy variables. The benefit is that fewer predictors are required than with dummy variables. This is useful for weekly data where \(m \approx 52\).
These terms can be produced using the fourier()
special. The \(K\) argument specifies how many pairs of \(sin()\) and \(cos()\) pairs to include. The maximum allowed is \(K = m/2\).
If only the first two are used, the seasonal pattern is a simple sine wave. A regression with Fourier terms is often called a harmonic regression.
aus_production %>%
model(TSLM(Beer ~ trend() + fourier(K = 2))) %>%
augment() %>%
ggplot() +
geom_line(aes(Quarter, Beer, colour = 'Beer')) +
geom_line(aes(Quarter, .fitted, colour = 'Fitted')) +
labs(
colour = 'Data Source',
title = 'Australian Beer Production',
subtitle = 'Linear Regression with Trend & Fourier Terms',
x = 'Year',
y = 'Megalitres'
)
Selecting Predictors
Common approaches that are not recommended are:
- Plotting the response against a particular predictor and visually determining if there is a relationship.
- Perform multiple linear regression on all predictors and disregard based on p-value.
- Statistical significance does not always indicate predictive value.
- P-values are misleading with correlated predictors.
Instead, the following measures can be used:
Adjusted R\(^2\)
- Not a good measure of predictive ability.
- Does not allow for degrees of freedom: adding any variable will increase its value.
- Minimusing sum of squared errors (SSE) is equivalent.
An alternative is adjusted R^2, which is equivalent to minimising the standard error, and doesn’t increase with each added predictor:
\[ \bar{R}^2 = 1 - (1 - R^2) \frac{T - 1}{T - k - 1} \] Where \(T\) is the number of observations and \(k\) is the number of predictors.
Cross-Validation
- Can use classical leave-one-out cross validation (LOOCV):
- Remove observation \(t\).
- Fit the model using the remaining data.
- Predict the value of \(t\) and compute the error \(e_t^* = y_t - \hat{y}_t\).
- Repeat step for \(t, \ldots, T\).
- Compute the MSE from \(e_1^*, \ldots, e_T^*\).
Akaike’s Information Criterion
\[ AIC = T log\bigg(\frac{SSE}{T}\bigg) + 2(k + 2) \]
The \(k + 2\) part of the equation occurs because there are that many parameters in the model: the \(k\) predictors, the intercept, and the variance of the residuals.
The model with the minimum AIC is often the best model for forecasting.
Corrected AIC
For small values of \(T\), the AIC tends to select too many predictors. A bias-corrected version has been developed:
\[ AIC_c = AIC + \frac{ 2(k+2)(k+3)}{T - k - 3} \]
Schwarz’s Bayesian Information Criterion
Usually abbreviated to BIC, SBIC or SC, this value should be minimised. BIC penalises the number of parameters more heavily than AIC. For large values of \(T\), it is similar to leave-\(v\)-out cross validation, where \(v = T(1 - \frac{1}{log(T) - 1})\)
\[ BIC = T log\bigg(\frac{SSE}{T}\bigg) + (k+2)log(T) \]
Which One?
- \(\bar{R}^2\) is widely used, but has a tendency to select too many predictors.
- BIC is liked because if there is a ture underlying model, BIC will select it given enough data.
- However there is rarely a true underlying model.
- Selecting that model may not necessarily give the best forecasts.
Hyndman and Athanasopoulos recommend one of \(AIC_c\), \(AIC\) or \(CV\) be used.
Best Subset
Where possible, all regression models should be fitted and a model selected on the above criteria.
Stepwise
If there are a large number of predictors, fitting every possible combination of them is not possible - e.g. 40 predictos is \(2^40\) models!
Backwards stepwise regression is a good approach:
- Start with all predictors.
- Remove one predictor at a time and model.
- Keep the model if it improves the measure of predictive accuracy.
- Iterate until no further improvement.
If the number of predictors is large, this won’t work, and forward stepwise regression can be used.
Inference
Beware that any procedire involving selecting predictors first will invalidate the assumptions behind p-values.
Forecasting with Regression
There are different types of forecasts that can be produced:
- Ex-ante - made using only the information available in advance. These are genuine forecasts.
- Ex-post - made using later information on the predictors. E.g. ex-post consumption forecast may use the actual observations of the predictors once these have been observed. These are not genuine forecassts, but are useful for studying the behaviour of forecasting models.
Ex-post models can assume knowledge of the predictor variables, but should not assume knowledge of the response.
Comparing ex-ante and ex-post forecasts can help to separate out the sources of forecast uncertainty. This will show whether forecast errors are due to poor forecasts of the predictor, or due to a poor forecasting model.
Normally we cannot use actual future values of the predictors as they are not known in advance. However special predictors can be used as they are either based on calendar variables or deterministic functions of time. In this case there is not difference between ex-ante and ex-post forecasts.
recent_production <-
aus_production %>%
filter_index('1992' ~ .)
recent_production %>%
model(TSLM(Beer ~ trend() + season())) %>%
forecast() %>%
autoplot(recent_production)
Scenario Based Forecasting
With scenario based forecasting, the forecaster assumes possible scenarios for the predictor variables that of interest. They may look at an economic forecast with different changes in the employment rate.
Building a Predictive Regression Model
A challenge with regression models is that to generate ex-ante forecasts, the model requires the future values of each predictor. It many cases generating forecasts for the predictor variables can be the most challenging part!
An alternative is to use lagged values of the predictors. The predictor set is formed by values of that are observed \(h\) time periods prior to observing \(y\). When the estimates model is projected into the future - beyond sample \(T\), all predictor values are available.
Prediction Intervals
For a simple univariate regression, assuming the errors are normally distributed, an approximate 95% prediction interval is given by:
\[ \hat{y} \pm 1.96 \hat{\sigma}_e \sqrt{ 1 + \frac{1}{T} + \frac{(x - \bar{x})^2}{(T-1)s^2_x}} \]
Where \(s\) is the standard deviation of the observed values and \(\hat{\sigma}_e\) s the standard error of the regression.
Matrix Formulation
Our regression model is:
\[ y_t = \beta_0 + \beta_1 x_{1,t} + \ldots + \beta_p x_{p,t} + \epsilon_t \]
We can write this in matrix form where \(\boldsymbol{y} = (y_1, \ldots, y_n)^T\), \(\boldsymbol{\epsilon} = (\epsilon_1, \ldots, \epsilon_n)^T\), \(\boldsymbol{\beta} = (\beta_0, \ldots, \beta_n)^T\), and
\[ \boldsymbol{X} = \left[ \begin{matrix} 1 & x_{1,1} & x_{2,1} & \dots & x_{n,1}\\ 1 & x_{1,2} & x_{2,2} & \dots & x_{n,2}\\ \vdots& \vdots& \vdots&& \vdots\\ 1 & x_{1,T}& x_{2,T}& \dots & x_{m,T} \end{matrix}\right] \]
Giving us:
\[ \boldsymbol{y} = \boldsymbol{X}\boldsymbol{\beta} + \boldsymbol{\epsilon} \]
Least Squares Estimation
Least squares estimation is performed by minimising the epression:
\[ \epsilon^T\epsilon = (y - X\beta)^T(y-X\beta) \]
This is minimised when \(\beta\) takes on the expression:
\[ \hat{\beta} = (X^TX)^{-1} X^Ty \]
set.seed(1)
y <- seq(10, 100, by = 20)
data <- tibble(
x0 = rep(1, 5),
x1 = 1:5 + rnorm(5),
x2 = seq(11, 20, 2) + rnorm(5)
)
X <-
data %>%
as.matrix()
# Normal equation
beta <- solve(t(X) %*% X) %*% t(X) %*% y
print(beta)
## [,1]
## x0 -73.989100
## x1 2.292800
## x2 7.718088
## (Intercept) x1 x2
## -73.989100 2.292800 7.718088
This is known as the normal equation.
The variance is estimated using:
\[ \hat{\sigma}^2_e = \frac{1}{T - k -1} (y - X\hat{\beta})^T (y - X\hat{\beta})\]
Fitted Values and Cross Validation
The normal equation shows that the fitted values can be calcuated using:
\[ \hat{y} - X\beta = X(X^TX)^{-1}X^Ty = Hy \]
The \(H\) is known as the ‘hat matrix’, because it ‘puts the had on \(y\)’.
The diagonal values of \(H\) are denoted by \(h_1, \ldots, h_n\). The cross-validation statistic can be computed by:
\[ CV = \frac{1}{T} \sum_{n = 1}^N (e_t / (1 - h_t))^2 \]
Where \(e_t\) is the \(t\)th residual. We see that it is not necessary to fit \(N\) models to compute the CV statistic.
Forecasts
Let \(x^*\) be a row vector containing the values of the predictors. The forecast is:
\[ \hat{y} = x^*\hat{\beta} = x^* (X^TX)^{-1} X^T Y\]
Nonlinear Regression
The simplest way of modelling a nonlinear relationship is to transform the response and/or one or more of the predictors before estimating a regression model. This provides a non-linear form, however it is still linear in the parameters (\(\beta\)).
You could have a log-log model \(log(y) = \beta_0 + \beta_1 log(x) + \epsilon\), in which $_1 is the average percentage change in \(y\) resulting from a 1% change in \(x\).
The general model is:
\[ y = f(x) + \epsilon \] There \(f\) is a nonlinear function. For standard linear regression, \(f(x) = \beta_0 + \beta_1 x\).
A simple specification is to make a piecewise linear - this introduces points where the slope can change:
\[ x_{2,t} = (x - c)_{+} = \Bigg\{ \begin{array}{ll} 0 & x < c \\ (x - c) & x \ge c \end{array} \]
\((x - c)_{+}\) means the value of \(x-c\) if it is positive, and zero otherwise. The slopes inflection point is therefore at point \(c\).
Piecewise linear relationships are a special case of regression splines. In general:
\[ x_1 = x \\ x_2 = (x - c_1)_{+} \\ \ldots \\ x_k = (x - c_k)_{+} \]
Forecasting with a Nonlinear Trend
To fit a non-linear trend, you may use quadratic or other higher order polynomials:
\[ x_{1,t} = t, \text{ } x_{2,t} = t^2, \text{ } \ldots \]
However these are not recommended to be used in forecasting. When extrapoloated, the resulting forecasts are often unrealistic.
Piecewise specification is a better approach:
$$ x_{1,t} = t \
x_{2,t} = (t - )_{+} = { \[\begin{array}{ll} 0 & t < \tau \\ (t - \tau) & t \ge \tau \end{array}\]$$
In this model, if the coefficients of \(x_{1,t}\) and \(x_{2,t}\) are \(\beta_1\) and \(\beta_2\), then \(\beta_1\) is the slope before time \(\tau\), and the slope of the line after \(\tau\) is given by \(\beta_1 + \beta_2\).
Example
boston_men_mdl <-
boston_marathon %>%
filter(Event == "Men's open division") %>%
mutate(Seconds = as.numeric(Time)) %>%
model(lm = TSLM(Seconds ~ trend()))
boston_men_mdl %>%
augment() %>%
ggplot() +
geom_line(aes(Year, Seconds)) +
geom_line(aes(Year, .fitted), colour = 'Blue') +
labs(
title = 'Boston Marathon Times - Men',
subtitle = 'Linear Fit',
x = 'Year',
y = 'Seconds'
)
boston_men_mdl %>%
augment() %>%
autoplot(.resid) +
labs(
title = 'Boston Marathon - Men',
subtitle = 'Linear Regression Residuals',
x = 'Year',
y = 'Residual (Seconds)'
)
We see that the plot shows an obvious non-linear pattern, with some heteroscedacticity as well.
Instead we will fit an exponential trend (equivalent to a log-linear) and a piecewise linear. A caution: subjective identification of the knot points can lead to overfitting.
boston_marathon %>%
filter(Event == "Men's open division") %>%
mutate(Seconds = as.numeric(Time)) %>%
model(
exp = TSLM(log(Seconds) ~ trend()),
pw = TSLM(Seconds ~ trend(knots = c(1940, 1980)))
) %>%
augment() %>%
ggplot() +
geom_line(aes(Year, Seconds)) +
geom_line(aes(Year, .fitted, colour = .model), size = 1.5) +
labs(
title = 'Botson Marathon - Men',
subtitle = 'Exponential and Piecewise Linear Models',
x = 'Year',
y = 'Seconds'
)
Correlation, Causation, and Forecasting
Correlation is not causation! A variable \(x\) may be useful for forecasting \(y\), but that does not mean it is causing it. \(y\) may be causing \(x\), or the relationship may be more complicated.
A confounding variable influences the response and at least one predictor.
Correlations are useful for forecasting, even when there is no causal relationship, or when the causality runs in the other direction, or when there is confounding.
However, models will be better if the causal mechanism can be determined.
Multicollinearity and Forecasting
Multicollinearity occurs when similar information is provided by two or more of the predictor variables in a multiple regression. It can occur when the two predictors are highly correlated, or when a linear combination of predictors is highly correlated with another linear combination of predictors.
An example is the dummy variable trap. Consider having quarterly data dummy variables or \(d_1, d_2, d_3 and d_4\). \(d_4 = 1 - d_1 - d_2 - d_3\), so there is a perfect relationship between \(d_4\) and \(d_1 + d_2 + d_3\).
If there is perfect correlation, it is not possible to estimate the regression model. If there is high correlation, the estimation of the regression coefficients is very difficult.
tibble(
x1 = 1:10,
x2 = 2 * x1 + rnorm(10),
x3 = 5 * x2 + rnorm(10),
y = 1:10
) %>%
lm(y ~ x1 + x2 + x3, data = .) %>%
summary()
## Warning in summary.lm(.): essentially perfect fit: summary may be
## unreliable
##
## Call:
## lm(formula = y ~ x1 + x2 + x3, data = .)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.110e-16 -5.959e-17 6.149e-17 1.932e-16 4.224e-16
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.000e+00 2.898e-16 0.000e+00 1.000
## x1 1.000e+00 3.764e-16 2.657e+15 <2e-16 ***
## x2 -6.269e-16 1.152e-15 -5.440e-01 0.606
## x3 1.128e-16 2.048e-16 5.510e-01 0.602
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.053e-16 on 6 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: 1.674e+32 on 3 and 6 DF, p-value: < 2.2e-16