Principal components analysis (PCA)
Digression: Factor augmented VARs (FA-VARs) based on PCA
zoo: For working with time series data. The package includes important functions such as computing rolling correlations, rolling standard deviations and aggregation (e.g., converting data from daily to monthly frequency).
vars: Includes methods and tools specifically for VAR analysis.
factoextra: Easy-to-use functions to extract and visualise the output of PCA.
corrplot: Provides a visual exploratory tool on correlation matrix.
# Important packages described above
library(zoo)
library(vars)
library(factoextra)
library(corrplot)
options(scipen = 9) # Avoid scientific notation
In this lab, we will use a monthly dataset for the US ranging from
January 1980 to December 2019. The data includes year-on-year (y-o-y)
consumer price inflation (CPIAUCSL
), the Federal Funds rate
(FEDFUNDS
), and y-o-y growth rates for seven different
industrial production measures: INDPRO
,
IPMANSICS
, IPCONGD
, IPMINE
,
IPDCONGD
, IPBUSEQ
, IPMAT
. The
time series were directly transformed and downloaded as a
.csv
file from FRED.
# Allocate the macroeconomic variables to the object "usmacro"
if("usmonthly.csv" %in% list.files()){
usmacro <- read.table("usmonthly.csv", sep=",",header=T)
}else{ # file.choose() allows to choose the file interactively
usmacro <- read.table(file.choose(), sep=",",header=T)
}
# Select the following macroeconomic variables
var.slct <- c("INDPRO", "IPMANSICS", "IPCONGD", "IPMINE", "IPDCONGD", "IPBUSEQ", "IPMAT",
"CPIAUCSL", "FEDFUNDS")
# Specify "usmacro" as time series object
usmacro <- ts(usmacro[,var.slct], end = c(2019, 12), frequency = 12)
colnames(usmacro) <- c("IP1", "IP2", "IP3", "IP4", "IP5", "IP6", "IP7",
"CPI", "FFR") # Use shorter labels for variables
We may wish to subset our time series.
# Subset time series and define a new end date
usmacro <- window(usmacro, # Select the ts object
end = c(2006,12) # Define new end date of time series
)
In the first two parts, we focus solely on the seven different
industrial production measures. Let’s plot them over time with
ts.plot()
.
# Time series plot of industrial production measures
IP.series <- usmacro[,1:7]
ts.plot(IP.series,
main = "Different Industrial Production Measures", # Title of plot
ylab = "Y-o-y growth rate", # Label of y-axis
xlab = "Month", # Label of x-axis
col = c(gray(seq(0.7,0.1,length.out = 7))), # Define colours
lty = "solid", # Define line types
lwd = 2 # Define line width
)
abline(h = 0) # Horizontal line at zero
Among the seven different industrial production growth rates, we can
eyeball a substantial degree of comovement. It seems
that, particularly around recessions, the comovement is exceptionally
strong. Let’s take a closer look at the actual correlations. The
corrplot()
function provides a nice visualisation of these
correlations.
# Obtain raw correlation coefficients
cor <- cor(IP.series)
# Nice correlation plot for the industrial production measures
corrplot::corrplot(cor)
How can we interpret this correlation plot?
We can use the function prcomp()
to run principal
components analysis (PCA), extract the principal components (which can
be used as estimates of the factors), loadings matrix, and other
series-specific quantities.
# Obtain principal components
PCA <- prcomp(IP.series, scale = TRUE)
# Extract the principal component and specify them to a time series
IP.PCs <- ts(PCA$x, start = start(IP.series), frequency = 12)
# Extract the loadings
IP.load <- PCA$rotation
# Extract the results for variables
PCA.var <- get_pca_var(PCA)
# Obtain summary for each component
summary(PCA)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 2.2726 1.1351 0.5736 0.35667 0.27267 0.11652 0.05306
## Proportion of Variance 0.7378 0.1840 0.0470 0.01817 0.01062 0.00194 0.00040
## Cumulative Proportion 0.7378 0.9219 0.9689 0.98704 0.99766 0.99960 1.00000
How can we interpret the output we get from
summary(PCA)
?
Standard deviation captures the idea that the variances of the PCs are as large as possible subject to a constraint, Proportion of Variance refers to the fraction of total variability in the data explained by each factor/PC, and Cumulative Proportion refers to the total variability in the data explained by the factors/PCs considered.
We can visualise the variation explained by each principal
component using a scree plot. A scree plot is an informal way
of selecting the number of factors by plotting the variation explained
by each component (from largest to smallest) against the number of
components. To do this, we use the function fviz_eig()
from
the factoextra
package.
fviz_eig(PCA, addlabels = TRUE, ylim = c(0, 100))
Based on the scree plot, how many factors (principal components) would you select?
We can also plot the first principal component alongside all the industrial production series.
# Time series plot together with first principal component
ts.plot(ts.intersect(IP.series, IP.PCs[,1]),
main = "Different Industrial Production Measures", # Title of plot
ylab = "Y-o-y growth rate", # Label of y-axis
xlab = "Month", # Label of x-axis
col = c(gray(seq(0.7, 0.1,length.out = 7)), "red"), # Define colours
lty = "solid", # Define line types
lwd = c(rep(1.5, 7), 2) # Define line width
)
abline(h = 0) # Horizontal line at zero
In addition, we can also use the corrplot()
function to
look at the size/sign of the loadings. We only need to
account for the fact that, unlike correlations, the loadings are not
bounded between 0 and 1, and are on a different scale (the real line).
This can be done via the option is.corr = FALSE
.
corrplot::corrplot(IP.load, is.corr = FALSE)
We observe that the first factor loads equally on all the industrial
production series, except for IP4
. We can also visualise
how much series-specific variation is explained by each principal
component (which tells a similar story). While the first factor
(principal component) strongly loads on all series except
IP4
, the second factor (principal component) explains a
large portion of the variation in IP4
.
corrplot::corrplot(PCA.var$cos2, is.corr=FALSE)
In the following, for simplicity, we assume a single factor. We therefore extract the first principal component and augment the other macroeconomic variables with it.
# Extract the first principal component
IP.PC1 <- IP.PCs[,1]
# Augment other macro variables with this first principal component
favar.data <- ts.intersect(IP.PC1, usmacro[,c("CPI", "FFR")])
colnames(favar.data) <- c("IP.PC1", "CPI", "FFR")
# Ordering of variables
var.order <- c("IP.PC1", # real activity factor: slow-moving (1st variable)
"CPI", # price index: slow-moving (2nd variable)
"FFR" # policy rate: fast-moving (3rd variable)
)
favar.data <- favar.data[,var.order] # Reorder columns in dataset
Estimation and impulse response analysis for FA-VARs is no different than from conventional VARs.
VARs are the generalisation of univariate autoregressive (AR) models to the case in which there is more than one endogenous variable. A VAR is simply a collection of autoregressive distributed lag (ARDL) models. In a VAR model, we regress several time series variables on the lags of these variables.
In R, there are many methods and tools available for
VAR analysis. One of these useful packages is the vars
package. We can now use to the command VAR()
to run a VAR
model. Here we use twelve lags and include an intercept for our
FA-VAR.
# Estimate a FA-VAR(12) model, including an intercept
fa.var <- vars::VAR(favar.data, # Available data (3 endogenous variables)
p = 12, # Consider 12 lags (monthly data)
type = "const") # Include an intercept
summary(fa.var)
How can we interpret this output? How would we interpret the coefficients?
What can we learn from this VAR model about each equation and time series?
For model selection, we can compute the Akaike Information Criterion (AIC) which you have seen previously. With monthly time series these days, it is almost a convention to work with p = 12 lags, accounting for the dynamics of a full year. However, we could also use any information criterion like the AIC to select the lag length.
Analysing the estimation output above can be a tedious task. We will therefore use impulse response functions (IRFs) to analyse the dynamic response over time of endogenous variables to a monetary policy shock, assuming that all other structural shocks are held at zero. IRFs measure how much future realisations are expected to differ in response to an on-off shock at time t with respect to the response when no such shock is imposed. IRFs of VARs are commonly used to quantify the dynamic response of certain endogenous variables after isolating a structural economic shock (keeping everything else constant). Intuitively, it is the marginal response to a specific shock.
In the following analysis, let’s focus on Cholesky
identification to pin down a monetary policy
shock. The interest rate, which is associated with a monetary
policy shock, is ordered last. This specific ordering
of variables implies that a monetary policy shock has no
contemporaneous impact on the industrial production factor
\(f_t\) (IP.PC1
) and
consumer price inflation (CPI
). We will impose a
positive one standard deviation shock and consider a
maximum horizon of 60 months (i.e., 5 years).
# Obtain impulse responses
nhor <- 60 # Maximum horizon of impulse responses
irf.favar <- irf(fa.var, # VAR object
impulse = "FFR", # Impulse variable (in this case a "monetary policy shock")
response = var.order, # Response variables
n.ahead = nhor,
ci = 0.67, # Confidence interval (67% corresponds to +/- one standard deviation)
runs = 1000) # Number of bootstrapping runs to obtain confidence intervals
To plot impulse responses, we could simply use plot()
,
but these default plots typically do not look nice.
# Plot impulse responses
irf.mp <- data.frame(irf.favar$irf$FFR, irf.favar$Lower$FFR, irf.favar$Upper$FFR)
colnames(irf.mp) <- c(paste0(var.order, ".m"),
paste0(var.order, ".l"),
paste0(var.order, ".u"))
par(mfrow=c(1,3)) # Divides the plotting window into three separate panels (3 columns)
# 1.) IRF of main IP factor to MP shock
plot(x = 0:nhor,
y = irf.mp$IP.PC1.m, type = "l", main = "Main IP factor to MP shock",
xlab = '', ylab = '',
ylim = c(min(irf.mp$IP.PC1.l), max(irf.mp$IP.PC1.u))*1.1)
polygon(x = c(0:nhor, rev(0:nhor)),
y = c(irf.mp$IP.PC1.u, rev(irf.mp$IP.PC1.l)),
col = 'lightgray',
lty = 0)
lines(x = 0:nhor, y = irf.mp$IP.PC1.m, lty=1, col = "black", lwd = 2) # Solid line for median response
lines(x = 0:nhor, y = irf.mp$IP.PC1.l, lty=2, col = "black", lwd = 1) # Dashed lines for bounds
lines(x = 0:nhor, y = irf.mp$IP.PC1.u, lty=2, col = "black", lwd = 1)
abline(h=0) # Horizontal zero line
# 2.) IRF of inflation to MP shock
plot(x = 0:nhor,
y = irf.mp$CPI.m, type = "l", main = "Inflation to MP shock",
xlab = '', ylab = '',
ylim = c(min(irf.mp$CPI.l), max(irf.mp$CPI.u))*1.1)
polygon(x = c(0:nhor, rev(0:nhor)),
y = c(irf.mp$CPI.u, rev(irf.mp$CPI.l)),
col = 'lightgray',
lty = 0)
lines(x = 0:nhor, y = irf.mp$CPI.m, lty=1, col = "black", lwd = 2) # Solid line for median response
lines(x = 0:nhor, y = irf.mp$CPI.l, lty=2, col = "black", lwd = 1) # Dashed lines for bounds
lines(x = 0:nhor, y = irf.mp$CPI.u, lty=2, col = "black", lwd = 1)
abline(h=0) # Horizontal zero line
# 3.) IRF of FFR to MP shock
plot(x = 0:nhor,
y = irf.mp$FFR.m, type = "l", main = "FFR to MP shock",
xlab = '', ylab = '',
ylim = c(min(irf.mp$FFR.l), max(irf.mp$FFR.u))*1.1)
polygon(x = c(0:nhor, rev(0:nhor)),
y = c(irf.mp$FFR.u, rev(irf.mp$FFR.l)),
col = 'lightgray',
lty = 0)
lines(x = 0:nhor, y = irf.mp$FFR.m, lty=1, col = "black", lwd = 2) # Solid line for median response
lines(x = 0:nhor, y = irf.mp$FFR.l, lty=2, col = "black", lwd = 1) # Dashed lines for bounds
lines(x = 0:nhor, y = irf.mp$FFR.u, lty=2, col = "black", lwd = 1)
abline(h=0) # Horizontal zero line
How can we interpret the impulse responses from above?
The FA-VAR is a powerful tool that also allows for obtaining series-specific impulse responses for each industrial production measure. We use the impulse response of the industrial production factor and exploit the relationship between each series and this factor. In other words, this amounts to scaling the single common factor by series-specific factor loadings.
# Impulse responses to each component (point estimate only)
# Duplicate the response from the common component for each series
irf.IPs <- matrix(rep(irf.mp$IP.PC1.m, each = ncol(IP.series)), ncol(IP.series), nhor+1)
rownames(irf.IPs) <- colnames(IP.series)
# Scale them with the associated loading of the 1st component
irf.IPs <- t(irf.IPs*IP.load[,1])
# Define them as a time series
irf.ip <- ts(irf.IPs, start = 0)
# Plot impulse responses of each IP series
par(mfrow = c(1,1))
ts.plot(irf.IPs,
main = "IP series-specific IRFs to an MP shock", # Title of plot
ylab = "", xlab = "",
col = c(gray(seq(0.7, 0.1,length.out = 7))), # Define colours
lty = "solid", # Define line types
lwd = 1.5 # Define line width
)
abline(h = 0)