library(tidyverse)
library(forecast)
library(feasts)
library(fma)
library(magrittr)
library(fpp3)
library(tsibble)
library(lubridate)

The feasts package includes functions for computing features and statistics from time series.

We’ve seen some: autocorrelations, as well as Guerrero estimate of the Box-Cox transformation parameter.

Simple Statistics

Any numerical summary computed from a time series is a feature. This could be mean, minimum, maximum, etc.

The features() function helps compute these features. The following is the mean:

tourism %>% features(Trips, mean)

## # A tibble: 304 x 4
##    Region         State              Purpose    ...4
##    <chr>          <chr>              <chr>     <dbl>
##  1 Adelaide       South Australia    Business 156.  
##  2 Adelaide       South Australia    Holiday  157.  
##  3 Adelaide       South Australia    Other     56.6 
##  4 Adelaide       South Australia    Visiting 205.  
##  5 Adelaide Hills South Australia    Business   2.66
##  6 Adelaide Hills South Australia    Holiday   10.5 
##  7 Adelaide Hills South Australia    Other      1.40
##  8 Adelaide Hills South Australia    Visiting  14.2 
##  9 Alice Springs  Northern Territory Business  14.6 
## 10 Alice Springs  Northern Territory Holiday   31.9 
## # … with 294 more rows

A list of functions helps name the columns:

tourism %>% 
    features(Trips, list(Mean = mean)) %>% 
    arrange(Mean)

## # A tibble: 304 x 4
##    Region          State              Purpose   Mean
##    <chr>           <chr>              <chr>    <dbl>
##  1 Kangaroo Island South Australia    Other    0.340
##  2 MacDonnell      Northern Territory Other    0.449
##  3 Wilderness West Tasmania           Other    0.478
##  4 Barkly          Northern Territory Other    0.632
##  5 Clare Valley    South Australia    Other    0.898
##  6 Barossa         South Australia    Other    1.02 
##  7 Kakadu Arnhem   Northern Territory Other    1.04 
##  8 Lasseter        Northern Territory Other    1.14 
##  9 Wimmera         Victoria           Other    1.15 
## 10 MacDonnell      Northern Territory Visiting 1.18 
## # … with 294 more rows

Let’s look at quantiles, the third argument to features is passed to the function that’s being called. 0% is the minimum and 100% is the maximum.

tourism %>% 
    features(Trips, quantile, probs = seq(0, 1, 0.25))

## # A tibble: 304 x 8
##    Region       State          Purpose    `0%`  `25%`   `50%`  `75%` `100%`
##    <chr>        <chr>          <chr>     <dbl>  <dbl>   <dbl>  <dbl>  <dbl>
##  1 Adelaide     South Austral… Busine…  68.7   134.   153.    177.   242.  
##  2 Adelaide     South Austral… Holiday 108.    135.   154.    172.   224.  
##  3 Adelaide     South Austral… Other    25.9    43.9   53.8    62.5  107.  
##  4 Adelaide     South Austral… Visiti… 137.    179.   206.    229.   270.  
##  5 Adelaide Hi… South Austral… Busine…   0       0      1.26    3.92  28.6 
##  6 Adelaide Hi… South Austral… Holiday   0       5.77   8.52   14.1   35.8 
##  7 Adelaide Hi… South Austral… Other     0       0      0.908   2.09   8.95
##  8 Adelaide Hi… South Austral… Visiti…   0.778   8.91  12.2    16.8   81.1 
##  9 Alice Sprin… Northern Terr… Busine…   1.01    9.13  13.3    18.5   34.1 
## 10 Alice Sprin… Northern Terr… Holiday   2.81   16.9   31.5    44.8   76.5 
## # … with 294 more rows

ACF Features

All the autocorrelations of a series can be considered features. The autocorrelations themselves can be summarised to produce new features. An example: the sum of the first ten autocorrelation coefficients is a useful summary of how much autocorrelation there is in a series, regardless of lag.

We can also compute autocorrelations of transformations of time series. It is useful to look at changes in the series between periods by ‘differencing’ the data.

Another approach is to compute seasonal differences in the data - i.e. the difference between consecutive Januaries to look at how the series is changing between years.

We recall that the autocorrelation coefficients are standard \(Cor(Y_i, Y_{i+k})\) where \(k\) is the lag.

The feat_acf() function computes a selection of autocorrelatations. It returns six/seven features:

The first autocorrelation coefficient from the original data.
The sum of squares of the first ten autocorrelation coefficients from the original data.
The first autocorrelation coefficient from the differenced data.
The sum of squares of the first ten autocorrelation coefficients from the differenced data.
The same as above for twice-differenced data.
For seasonal data, the autocorrelation coefficient at the first seasonal lag is also returned.

Application to the Australian tourism data:

tourism %>% 
    features(Trips, feat_acf)

## # A tibble: 304 x 10
##    Region State Purpose     acf1 acf10 diff1_acf1 diff1_acf10 diff2_acf1
##    <chr>  <chr> <chr>      <dbl> <dbl>      <dbl>       <dbl>      <dbl>
##  1 Adela… Sout… Busine…  0.0333  0.131     -0.520       0.463     -0.676
##  2 Adela… Sout… Holiday  0.0456  0.372     -0.343       0.614     -0.487
##  3 Adela… Sout… Other    0.517   1.15      -0.409       0.383     -0.675
##  4 Adela… Sout… Visiti…  0.0684  0.294     -0.394       0.452     -0.518
##  5 Adela… Sout… Busine…  0.0709  0.134     -0.580       0.415     -0.750
##  6 Adela… Sout… Holiday  0.131   0.313     -0.536       0.500     -0.716
##  7 Adela… Sout… Other    0.261   0.330     -0.253       0.317     -0.457
##  8 Adela… Sout… Visiti…  0.139   0.117     -0.472       0.239     -0.626
##  9 Alice… Nort… Busine…  0.217   0.367     -0.500       0.381     -0.658
## 10 Alice… Nort… Holiday -0.00660 2.11      -0.153       2.11      -0.274
## # … with 294 more rows, and 2 more variables: diff2_acf10 <dbl>,
## #   season_acf1 <dbl>

STL Features

The STL decompositions are the basis for several more features.

A time series decomposition can be used to measure the strength of trend and seasonality in a time series.

We recall that \(y_t = T_t + S_t + R_t\). For strongly trended data, the seasonally adjusted data should have much more variation than the remainder, so \(Var(R_t)/Var(T_t + R_t)\) should be relatively small. But for data with no trend, the variances should approximately the same.

The strength of the trend is defined as:

\[ F_T = max\Bigg(0, 1 - \frac{Var(R_t)}{Var(T_t + R_t)} \Bigg)\]

Because the variances of the remainder may be larger then the variance of the seasonally adjusted data, the minimum is set to 0.

The strength of the seasonality is defined similarly, but with respect to the detrended data rather than the seasonally adjusted data.

\[ F_S = max\Bigg(0, 1 - \frac{Var(R_t)}{Var(S_t + R_t)} \Bigg)\]

A series with \(F_S\) close to zero has almost no seasonality, which a series with strong seasonality will have \(F_S\) close to 1.

With STL, the timing of the peaks and troughs is also useful. This tells us which month or quarter contains the largest seasonal component.

These are available in the feat_stl functon.

tourism %>% 
    features(Trips, feat_stl) %>% 
    head()

## # A tibble: 6 x 12
##   Region State Purpose trend_strength seasonal_streng… seasonal_peak_y…
##   <chr>  <chr> <chr>            <dbl>            <dbl>            <dbl>
## 1 Adela… Sout… Busine…          0.451            0.380                3
## 2 Adela… Sout… Holiday          0.541            0.601                1
## 3 Adela… Sout… Other            0.743            0.189                2
## 4 Adela… Sout… Visiti…          0.433            0.446                1
## 5 Adela… Sout… Busine…          0.453            0.140                3
## 6 Adela… Sout… Holiday          0.512            0.244                2
## # … with 6 more variables: seasonal_trough_year <dbl>, spikiness <dbl>,
## #   linearity <dbl>, curvature <dbl>, stl_e_acf1 <dbl>, stl_e_acf10 <dbl>

We can use these features to identify what type of series are trended and what are most seasonal:

tourism %>% 
    features(Trips, feat_stl) %>% 
    ggplot() +
    geom_point(aes(trend_strength, seasonal_strength_year, colour = Purpose)) +
    facet_wrap(vars(State)) +
    labs(
        x = 'Trend Strength',
        y = 'Seasonal Strength',
        title = 'Australian Tourism',
        subtitle = 'Trend vs Seasonal Strength per State'
    )

From this we see that the holiday series are the most seasonal, and the strongest trends seems to be in Western Australa.

Seasonal series can be identified and plotted:

tourism %>% 
    features(Trips, feat_stl) %>% 
    filter(seasonal_strength_year == max(seasonal_strength_year)) %>% 
    left_join(tourism, by = c('State', 'Region', 'Purpose')) %>% 
    ggplot() +
    geom_line(aes(Quarter, Trips)) +
    facet_grid(vars(State, Region, Purpose))

This shows holiday trips to the most popular ski region of Australia.

The feat_stl has other features:

spikiness measures the prevalence of spikes in the remainder component \(R_t\) of the STL.
linearity measures the linearity of the trend component of the STL. It’s based on the coefficient of a linear regression applied to the trend component.
curvature measures the curvature of the trend component. It’s based on the coefficient from an orthogonal quadratic regression applied to the trend component.
stl_e_acf1 is the first autocorrelation coefficient of the remainder series.
stl_e_acf10 is the sum of squares of the first ten autocorrelaton coefficients of the remainder series.

Other Features

The following is a reference for other features calculated by the feasts package:

coef_hurst is the Hurst coefficient, which is a measure of ‘long memory’. A series with long memory will have a significant autocorrelatons for many lags.
spectral_entropy computes the Shannon spectral entropy. This is a measure of how easy the series is to forecast. A series with strong trend and seasonality will have entropy close to 0.
bp_stat - Box-Pierce statistic, tests if a time series is white noise, with bp_pvalue gives the p-value from the test.
lb_stat - Ljung-Box statistic, similar to above.
diff1_pacf5 is the sum of squares of the first five partial autocorrelatons from the differenced data.
diff2_pacf5 is the same as above but for the second difference?
season_pacf contains partial autocorrelation at the first seasonal lag.
kpss_stat is the Kwiatkowski-Phillips-Schmidt-Shin statistic for testing of a series is stationary, which kpss_pvalue is the p-value from the test.
pp_stat givres the Phillps-Perron statistic which tests if the series is non-stationary.
ndiffs gives the number of differences required to lead to a stationary series based on the KPSS test
nsdiffs gives the number of seasonal differences required to make the series stationary.
var_tiled_mean gives the variances of the tiled means, which are the means of consecutive non-overlapping blocks of observations.
- Default leangth is 10 for non-seasonal, or the length of the seasonal period.
- Sometimes called the ‘stability’ feature.
var_tiles_var - same as above, but variance.
- Sometimes called the ‘lumpiness’ feature.
shift_levels_max finds the largest mean shift between two consecutive sliding windows of the time series.
- Useful for finding sudden jumps or drops in a time series.
- shift_level_index gives the index for above.
shift_var_max finds the largest variance shifdt.
- Useful for finding sudden changes in volatility.
- shift_var_index is the index for above.
shift_kl_max finds the largest distributional shift (based on the Kulback-Leibler divergece) between two consecutive sliding windows of the time series.
- Useful for finding sudden changes in the distribution of a time series.
- shift_kl_index is the index for above.
n_crossing_points is the number of times a time series crosses the median.
n_flat_spots is the number of sections of the data where the series is relatively unchanging.
stat_arch_lm returns the statistic based on the Lagrange Multiplier test of Engle for autoregressive conditional heteroscedasticity (ARCH).

Exploring Tourism Data

tourism_features <-
    tourism %>% 
    features(Trips, feature_set('feasts'))

There are 39 features for every combination of three key variables (Region, State and Purpose, these are the index variables in the tsibble).

We’ll create a scatterplot matrix of groups of features, starting with seasonality:

library(glue)

tourism_features %>% 
    select_at(vars(contains('season'), Purpose)) %>% 
    mutate(
        seasonal_peak_year = glue("Q{seasonal_peak_year + 1}"),
        seasonal_trough_year = glue("Q{seasonal_trough_year + 1}")
    ) %>% 
    GGally::ggpairs(aes(colour = Purpose))

What do we learn from this?

The three measures related to seasonality are all positively correlated.
The bottom left and top right show that the most strongly seasonal series are related to holidays.
The bar plots in the bottom row show tht seasonal peaks in business occur most often in quarter 4, and least often in Q2.

It can be difficult to interpret this mess if lines and colour. Dimensional reduction via PCA gives the linear components that explain the most variation.

library(broom)

tourism_pca <-
    tourism_features %>% 
    select(-c(State, Region, Purpose)) %>%
    prcomp(scale = TRUE) %>% 
    augment(tourism_features)

tourism_pca %>% 
    ggplot() +
    geom_point(aes(.fittedPC1, .fittedPC2, colour = Purpose)) +
    theme(aspect.ratio = 1) +
    labs(
        x = 'Principal Component 1',
        y = 'Principal Component 2',
        title = 'Australian Tourism - Seasonal Features',
        subtitle = 'Principal Components Analysis'
    )

From this we see that the holioday series behave a lot differently from the rest of the series. Almost all of the holidays are in the top half, so the second principal component is distinguishing between holidays and the rest.

We can also see anomolous series, which appear far out from the rest.

tourism_pca %>% 
    filter(.fittedPC1 > 12) %>% 
    select(Region, State, Purpose, .fittedPC1, .fittedPC2) %>% 
    left_join(tourism, by = c('State', 'Region', 'Purpose')) %>% 
    mutate(Series = glue('{State}', '{Region}', '{Purpose}', .sep = "\n\n")) %>% 
    ggplot() +
    geom_line(aes(Quarter, Trips)) +
    facet_grid(vars(Series), scales = 'free') +
    labs(
        title = 'Australian Tourism',
        subtitle = 'Outliers in Principal Component Space'
    )

What do we see? Melbourne holidays are different because there is no seasonality. Western Australian business is interesting because of the lack of seasonality and a large jump in the last few years.

Exercises

Write a function to compute the mean and standard deviation of a time series, and apply it to the PBS data. Plot the series with the highest mean, and the series with the lowest standard deviation.
Use GGally::ggpairs() to look at the relationships between the STL-based features for the tourism data. Which is the peak quarter for holidays in each state?
Use a feature-based approach to look for outlying series in the PBS data. What is unusual about the series you identify as “outliers”.

Chapter 4 - Time Series Features

Greg Foletta

2020-06-30