Tsibble Objects
A time series can be thought of as a list of numbers, along with some information about what times those numbers were recorded (the index). A tsibble
object can be used for this.
Index
Consider observations over a number of years:
## ── Attaching packages ───────────────────────────────────────────────────────────────────────── fpp3 0.3 ──
## ✓ lubridate 1.7.4 ✓ feasts 0.1.4
## ✓ tsibbledata 0.2.0 ✓ fable 0.2.1
## ── Conflicts ──────────────────────────────────────────────────────────────────────────── fpp3_conflicts ──
## x lubridate::date() masks base::date()
## x magrittr::extract() masks tidyr::extract()
## x dplyr::filter() masks stats::filter()
## x kableExtra::group_rows() masks dplyr::group_rows()
## x lubridate::interval() masks tsibble::interval()
## x dplyr::lag() masks stats::lag()
## x lubridate::new_interval() masks tsibble::new_interval()
## x magrittr::set_names() masks purrr::set_names()
y <- tsibble(
Year = 2000:2008,
Observation = c(23, 342, 66, 122, 5445, 44, 77, 104, 23),
index = Year
)
Tsibbles extend the tibble by adding in temporal structure. For observations ocurring more than once a year, a time class function needs to be used.
z <- tsibble(
Month = yearmonth(c('2019 Jan', '2019 Feb', '2019 Mar')),
Obs = seq(40, 50, length.out = 3),
index = Month
)
z
## # A tsibble: 3 x 2 [1M]
## Month Obs
## <mth> <dbl>
## 1 2019 Jan 40
## 2 2019 Feb 45
## 3 2019 Mar 50
Other time class functions:
- Annual:
start_year:end_year
- Quarterly:
yearquarter()
- Monthly:
yearmonth()
- Weekly:
yearweek()
- Daily:
as_date()
,ymd()
-Sub-daily:as_datetime()
Key Variables
A tsibble allows multiple time series to be stored in a single object. Consider men’s and women’s track races at the Olympics:
## # A tsibble: 312 x 4 [4Y]
## # Key: Length, Sex [14]
## Year Length Sex Time
## <int> <int> <chr> <dbl>
## 1 1896 100 men 12
## 2 1900 100 men 11
## 3 1904 100 men 11
## 4 1908 100 men 10.8
## 5 1912 100 men 10.8
## 6 1916 100 men NA
## 7 1920 100 men 10.8
## 8 1924 100 men 10.6
## 9 1928 100 men 10.8
## 10 1932 100 men 10.3
## # … with 302 more rows
The [4Y] informs us that the interval of observation is every 4 years. Below this is the key structure. This tells us that there are 14 different time series in the tsibble.
The 14 time series are uniquely identified by Length
and Sex
.
Functions
The usual functions apply:
filter()
to filter out values.select()
to select variables.- Note the index variable is returned even if it is not selected.
- There cannot be duplicate rows for each index.
summarise()
to sumamrise data across keys.mutate()
to create new variables.
Reading in a CSV
prison <-
readr::read_csv('https://OTexts.com/fpp3/extrafiles/prison_population.csv') %>%
mutate(Quarter = yearquarter(date)) %>%
select(-date) %>%
as_tibble(
index = Quarter,
key = c(state, gender, legal, indigenous)
)
## Parsed with column specification:
## cols(
## date = col_date(format = ""),
## state = col_character(),
## gender = col_character(),
## legal = col_character(),
## indigenous = col_character(),
## count = col_double()
## )
## # A tibble: 3,072 x 6
## state gender legal indigenous count Quarter
## <chr> <chr> <chr> <chr> <dbl> <qtr>
## 1 ACT Female Remanded ATSI 0 2005 Q1
## 2 ACT Female Remanded Non-ATSI 2 2005 Q1
## 3 ACT Female Sentenced ATSI 0 2005 Q1
## 4 ACT Female Sentenced Non-ATSI 5 2005 Q1
## 5 ACT Male Remanded ATSI 7 2005 Q1
## 6 ACT Male Remanded Non-ATSI 58 2005 Q1
## 7 ACT Male Sentenced ATSI 5 2005 Q1
## 8 ACT Male Sentenced Non-ATSI 101 2005 Q1
## 9 NSW Female Remanded ATSI 51 2005 Q1
## 10 NSW Female Remanded Non-ATSI 131 2005 Q1
## # … with 3,062 more rows
Seasonal Period
Some graphics and some models will use the seasonal period of the data. This is the number of observations before the seasonal pattern repeats. In most cases this is detected using the time idnex variable.
For quarterly monthly and weekly data, there is only one seasonal period - the number of observations within a year.
If the data is observed more than once per week, then there is often more than one seasonal pattern in the data. For example, data with daily observations may have weekly and yearly seasonal patterns.
Similarly data every minute may have hourly, daily, weekly and annual seasonality.
More complicated seasonal patterns can be specified using the period()
function.
## [1] "20M 10S"
Time Plots
For time series data, the obvious graph is a time plot. The observations are plotted against the time of observation.
ansett %>%
filter(Airports == 'MEL-SYD' & Class == 'Economy') %>%
ggplot() +
geom_line(aes(Week, Passengers)) +
labs(
title = 'Ansett Economy Class Passengers',
subtitle = 'Melbourne - Sydney'
)
There are some interesting features:
- Period in 1989 when no passengers were carried due to an industrial dispute.
- Period of reduced load on economy when some enconomy seats were replaced by business class seats.
- Large dips in the load around the start of the year due to holiday effects.
- Long term fluctuation.
- Some periods of missing observations.
Any model will need to take all these features into account in order to effectively forecast the passenger load into the future.
A simpler series:
fpp::a10 %>%
as_tsibble() %>%
autoplot() +
labs(
title = 'Antidiabetic Drig Sales',
x = 'Year',
y = '$ million'
)
In this time series there is a clear increasing trend. There is also a strong seasonal pattern that increases in size every year. The seasonal drop is due to a government subsidisation scheme which makes it cost-effective for people to stockpile the drugs at the end of the year.
Any forecasts would need to capture both the long term trend, as well as the seasonal pattern.
Time Series Patterns
When we describe a time series, there are a number of terms that are used that need to be defined:
Trend - a trend exists when there is a long term increase or decrease in the data, not necessarily linear.
Seasonal - a seasonal pattern occurs when a time series is affected by seasonal factors such as the time of year or the day of the week. The seasonality is a known and fixed frequency.
Cyclic - A cycle occurs when the data exhibit rises and falls that are not of a fixed frequency. These are often related to econmomic conditions and are often relatred to the ‘business cycle’. The duration of these fluctuations is usually at least 2 years.
Don’t confuse cyclic with seasonal; the latter is related to the calendar and fixed frequency, whereas the former is of changing frequency.
Seasonal Plots
A seasonal plot is similar to a time plot except the data are plotten against the individual ‘seasons’ rather than dates.
Note: the ggplot2 code for ggseasonplot()
can be found here.
fpp::a10 %>%
as_tsibble() %>%
gg_season(labels = 'both') +
labs(
x = 'Month',
y = '$ (millions)',
title = 'Antidiabetic Drug Sales',
subtitle = 'Seasonal Plot'
)
In this it’s clear that there’s a large jump in sales each January - these are probably December sales not registered until the next month.
Multiple Seasonal Periods
Where the data has more than one seasonal pattern, the period
argument can be sed to select which seasonal plot is required:
vic_elec %>%
gg_season(Demand, period = 'day') +
labs(
title = 'Victorian Energy Demand',
subtitle = 'Daily Seasonal Plot'
)
Seasonal Subseries Plots
An alternate plot that emphasises the seasonal pattern is where the data for each season is collected into separate time series plots:
fpp::a10 %>%
as_tsibble() %>%
gg_subseries(value) +
labs(
x = 'Month',
y = '$ million',
title = 'Seasonal Subseries - Antidiabetic Drug Sales'
)
The horizontal lines indicate the means for each month. This form of plot is very useful in identifying the underlying seasonal patterns.
Scatterplots
Sometimes we need to identify the relationships between time series. Below we see electricity demand and temperature.
library(fpp2)
library(tsibble)
elecdemand %>%
as_tsibble() %>%
filter(key != 'WorkDay') %>%
ggplot() +
geom_line(aes(index, value)) +
facet_grid(rows = vars(key), scales = 'free') +
labs(
x = 'Date',
y = '',
title = 'Electricity Demand and Temperature'
)
We can study the relationship between demand and temperature by plotting one series against another:
elecdemand %>%
as_tsibble() %>%
spread(key, value) %>%
ggplot() +
geom_point(aes(Temperature, Demand), alpha = .1, stroke = .1)
Correlation
The correlation coefficient is between \(x\) and \(y\) is:
\[ r = \frac{ \sum(x_t - \bar{x})(y_y - \bar{y}) }{ \sqrt{\sum (x_t - \bar{x})^2} \sqrt{ \sum (y_t - \bar{y})^2} } \]
It’s between 1 and -1, where positive is showing a positive relationship, and negative showing a negative relationship.
It measures only the strength of the linear relationship and can sometimes be misleading.
Scatterplot Matrices
When there are several potential predictor variables, it’s useful to plot one against the other. Consider the series below of visitor numbers to give regions of NSW:
visnights[,1:5] %>%
autoplot(facets = TRUE) +
labs(
x = 'Date',
y = '# Visitors (millions)',
title = 'Visitors to the NSW Regions'
)
To see the relationships between the time series, it can be arranged into a scatterplot matrix:
We see the scatterplots of the two variables, and their correlations. The centre plots are the density plots for the particular variable.
This plot gives us a quick view of the relationships between all pairs of variables.
Lag Plots
The figure below shows the quarterly Australian beer production, where the horizontal axis shows lagged values of the time series.
Each grah shows \(y_t\) plotted against \(y_{t - k}\) for different values of \(k\).
What we can see is that there is a positive relationship at lags 4 and 8, reflecting a strong seasonality. The negative relationship seen for lags 2 and 6 occurs because peaks (in Q4) are plotted against troughs (in Q2).
Autocorrelation
Autocorrelation measures the linear relationships between lagged values of the time series.
There are several autocorrelation coefficients corresponding to each panel in the graph above. The autocorrelation coefficients are plotted to show the autocorrelation function or ACF.
## # A tsibble: 9 x 2 [1Q]
## lag acf
## <lag> <dbl>
## 1 1Q -0.102
## 2 2Q -0.657
## 3 3Q -0.0603
## 4 4Q 0.869
## 5 5Q -0.0892
## 6 6Q -0.635
## 7 7Q -0.0542
## 8 8Q 0.832
## 9 9Q -0.108
These can be plotted in a correlogram:
aus_production %>%
filter(year(Quarter) >= 1992) %>%
ACF(Beer) %>%
autoplot() +
labs(
x = 'Autocorrelation',
y = 'Lag (Quarter)',
title = 'Australian Beer Production - Autocorrelation'
)
In this graph \(r_4\) is higher than for the other lags. This is due to the seasonal pattern in the data. The peaks tend to be four quarters apart.
The dashed blue lines indicate whether the correlations are significantly different from zero.
Trend and Seasonality in ACF Plots
When data have a trend, the autocorrelations for small lags tend to be large and positive. This is because observatiosn nearby in time are also nearby in size. So an ACF of tended time series tends to have positive values that slowly decrease as the lags increase.
When data are seasonal, the autocorrelatons will be larger for the seasonal lags (at multiples of the seasonal frequency) than for other lags.
When data are both trended and seasonal, you see the combination of these effects:
fpp::a10 %>%
as_tsibble() %>%
ACF(value, lag_max = 48) %>%
autoplot() +
labs(
x = 'Year',
y = 'Correlation',
title = 'Anti-diabetic Drug Sales',
subtitle = 'Autocorrelation'
)
The slow decrease is due to the trend, while the dips are due to the seasonality.
White Noise
Time series that show no autocorrelation are called white noise:
set.seed(1)
y <- tsibble(sample = 1:50, wn = rnorm(50), index = sample)
y %>%
ggplot() +
geom_line(aes(sample, wn)) +
labs(
title = 'White Noise',
x = 'Sample',
y = 'Value'
)
## [[1]]
##
## $title
## [1] "White Noise Autocorrelation"
##
## attr(,"class")
## [1] "labels"
For white noise, we expect autocorrelation to be close to 0, but not exactly as there is some random variation. For a white noise series, we expect 95% of the spikes in the ACF to lie within \(\pm 2 / \sqrt{T}\), where \(T\) is the length of the time series. These are the blue lines in the ACF plot. If one or more large spikes, or more than 5%, are outside the lines, it’s probably not white noise.
Exercises
- Look at the
gold
,woolyrnq
andgas
represent.
Gold is the daily morning prices of gold in USD, gas is Australia’s monthly gas production, and woolyrnq is the quarterly producton (in tonnes) of woolen yarn in Australia.
- Use
autoplot()
to plot each of these in separate plots.
- What is the frequency in each of the series?
## $Gold
## [1] 1
##
## $Gas
## [1] 12
##
## $Wool
## [1] 4
Gold is every 1 day, gas 1/12 of a year, or a month, and wool is 1/4 of a year.
- Spot the outlier in the
gold
series.
## [1] 770
2). Download the tute1.csv from the book website, open it in Excel (or some other spreadsheet application), and review its contents. You should find four columns of information. Columns B through D each contain a quarterly series, labelled Sales, AdBudget and GDP. Sales contains the quarterly sales for a small company over the period 1981-2005. AdBudget is the advertising budget and GDP is the gross domestic product. All series have been adjusted for in inflation.
- Read in the data
## Warning: Missing column names filled in: 'X1' [1]
## Parsed with column specification:
## cols(
## X1 = col_character(),
## Sales = col_double(),
## AdBudget = col_double(),
## GDP = col_double()
## )
- Convert to a time series.
- Construct time series plots of each of the three series.
q2ts %>%
as_tsibble() %>%
ggplot() +
geom_line(aes(index, value)) +
facet_grid(rows = vars(key), scales = 'free') +
labs(
x = 'Quarter',
y = 'Dollars',
title = 'Budget, GDP and Sales'
)
- Download some monthly Australian retail data from the book website. These represent retail sales in various categories for different Australian states,and are stored in a MS-Excel file.
- Read in the data.
library(readxl)
download.file(
'https://otexts.com/fpp2/extrafiles/retail.xlsx',
tmp <- tempfile()
)
q3 <- read_excel(tmp, skip = 1)
- Select one of the timer series.
- Explore the time series, Can you spot any seasonality, cyclicity and trend? What do you learn about the series?
q3ts %>%
as_tsibble() %>%
ggplot() +
geom_line(aes(index, value)) +
labs(x = 'Month', y = 'Sales', title = 'Australian Reatil Data')
It appears that there is a 6 month seasonal trend in the data, with retail sales jumping in Jun (likely due to the end of the financial year) and December (likely due to Chrismas).
There is a trend upwards up until 2008 when the ‘Great Financial Crisis’ hits:
- Create time plots of the following time series:
bicoal
,chicken
.
bicoal %>%
autoplot() +
labs(
x = 'Year',
y = 'Production',
title = 'Annual bituminous coal production in the USA: 1920–1968.'
)
chicken %>%
autoplot() +
labs(
x = 'Year',
y = 'USD (constant)',
title = 'Price of chicken in US (constant dollars): 1924–1993.'
)
- Use
ggseasonplot()
andggsubseriesplot()
functions to explore the seasonal patterns in the following time series:writing
,fancy
.
We first look at writing
. This data set is the industry sales for printing and writing paper.
We see a huge dip every year in August, and a small peak in March. The trend is upwards, but we see that 1975 went against the upward
The fancy
dataset is sales for a souvenir shop in Queensland:
In this plot the seasonal cycle is very obvious, ocurring during the November and December holiday season. The trend is upwards, with only 1990 going against this trend. This is likely due to the recession in Australia during that time.
- Use the following graphics functions:
autoplot()
,ggseasonplot()
,ggsubseriesplot()
,gglagplot()
,ggAcf()
and explore features from thehsales
series.
- Can you spot any seasonality, cyclicity and trend? -What do you learn about the series?
hsales %>%
autoplot() +
labs(
x = 'Month',
y = 'Sales',
title = 'Monthly sales of new one-family houses sold in the USA since 1973'
)
We can see a seasonality of 12 months via the lag and ACF plots, with the peak coming in Mar/April/May.
Skipped
The following time plots and ACF plots correspond to four di erent time series. Your task is to match each time plot in the rst row with one of the ACF plots in the second row.
- 1 -> B
- 2 -> A
- 3 -> D
- 4 -> C
- The
pigs
data shows the monthly total number of pigs slaughtered in Victoria, Australia, from Jan 1980 to Aug 1995. Usemypigs <- window(pigs, start=1990)
to select the data starting from 1990. Useautoplot()
andggAcf() for
mypigs` series and compare these to white noise plots.
We can see that the autocorrelation is predominantly within the blue bars, which is very similar to what occurs with ‘white noise’.
- dj contains 292 consecutive trading days of the Dow Jones Index. Do the changes in the Dow Jones Index look like white noise?
The changes in the Dow Jones index do look like white noise.