R Finance - An Econometrics Guide to Stocks - AlgoTrading101 Blog (2024)

Stock problem definition

In this article, we will address the problem of finding stocks that move in a similar matter over a certain time frame. In order to do this properly some of the main statistical methods that we need to use are the following:

Linear regression
Serial correlation
Stationarity test
Cointegration test

Each of the above-mentioned procedures will help us to uncover the real relationship between two or more stocks. For the showcase of these procedures, we will use the Box and Dropbox stocks.

The analysis that we’re going to perform can inform your domain knowledge when setting up a pair’s trading strategy. You can read more about it here, and you can also check out our cluster analysis article.

If you’re unfamiliar with the R basics or would like a refresher on it, I’d advise skimming through some bits in this article. We will use R studio without any fancy code for our analysis, and I’ll explain it bit by bit.

How to download stock data with R?

Stock data can be downloaded in many ways with R and we will use the tidyquant library that was built for handling financial data. If the library isn’t installed you can obtain it with the following command:

install.packages("tidyquant")

You’ll see your console being populated and if it gets cluttered at any moment you can press CTRL + L to clear it out. Now that the library is installed let’s load it and obtain stock data for Box and Dropbox.

library(tidyquant)boxx <- tq_get('BOX', from = '2020-01-01', to = '2021-06-06', get = 'stock.prices')dropbox <- tq_get('DBX', from = '2020-01-01', to = '2021-06-06', get = 'stock.prices')

Take note: as “box” is a function in R, we named it “boxx”. Now let’s print out the first few columns of our data so you can see what we obtained and graph our stocks.

R Finance - An Econometrics Guide to Stocks - AlgoTrading101 Blog (1)

library(ggplot2)boxx %>% ggplot(aes(x = date, y = adjusted)) + geom_line() + theme_classic() + labs(x = 'Date', y = "Adjusted Price", title = "Box price chart") + scale_y_continuous(breaks = seq(0,300,10))dropbox %>% ggplot(aes(x = date, y = adjusted)) + geom_line() + theme_classic() + labs(x = 'Date', y = "Adjusted Price", title = "Dropbox price chart") + scale_y_continuous(breaks = seq(0,300,10))

R Finance - An Econometrics Guide to Stocks - AlgoTrading101 Blog (2)

R Finance - An Econometrics Guide to Stocks - AlgoTrading101 Blog (3)

Just by eye-balling, we can state that the stocks have indeed moved in a similar fashion. But our eyes have fooled us enough times in our lives to trust them, especially when it comes to financial matters.

On the bottom right side of your interface, you can browse through the plots. Now, let’s merge the data, rename the columns, and then take what we need from it into a new data frame for analysis.

# Mergemerged <- merge(boxx, dropbox, by=c("date"))head(merged)# Subsetdf = subset(merged, select=c("date","adjusted.x","adjusted.y"))# Renamenames(df)[names(df) == "adjusted.x"] <- "box"names(df)[names(df) == "adjusted.y"] <- "dropbox"head(df)

R Finance - An Econometrics Guide to Stocks - AlgoTrading101 Blog (4)

How to save data to CSV with R?

To export a data frame to a CSV file in R, you will need to use the write.csv() command where you specify the data frame and the directory path.

To obtain your default directory path you can write the following:

dir <- getwd()dir

And now we save our data:

write.csv(df, "box_dropbox.csv")

To load it again simply write the following:

df <- read.csv("box_dropbox.csv")# Delete index created by loadingdf <- df[,-1]

What is the p-value?

The p-value s the probability of obtaining test results at least as extreme as theresults actually observed, under the assumption that thenull hypothesisis correct.

How to do a linear regression on stock data with R?

Linear regression on stock data can be done with R by using the in-built lm() function and stating the appropriate variables. In our case, we will assume that the relationship between our stocks is linear.

Now, let us attach our dataset so R knows that we will pull the data from it and run the linear regression model on our stocks. We will also print out the summary statistics of our regression model.

attach(df)reg <- lm(box~dropbox)summary(reg)

R Finance - An Econometrics Guide to Stocks - AlgoTrading101 Blog (5)

If you look at the “Estimate” it tells us how much our dependent variable changes (Box) for each unit increase ($). In this case for each 1$ increase in Dropbox the price of Box changes by $0.807.

Below you can see that the p-value is really low and that we can reject the null hypothesis aka it is unlikely that the results above are due to random chance.

The F-statistic shows if there is a relationship between our predictor (Dropbox) and the response variable. The further it is away from 1 the better the case it makes for rejecting the null hypothesis.

The R-squared states that the movement in Dropbox explains 76% variance in Box. Let’s plot our linear regression to see the fit and check it for hom*oscedasticity.

par(mfrow=c(2,2))plot(reg)

R Finance - An Econometrics Guide to Stocks - AlgoTrading101 Blog (6)

The scale location plot is used to check the hom*ogeneity of variance of the residuals (hom*oscedasticity). A horizontal line with equally spread points is a reliable indication of hom*oscedasticity. In our case, it isn’t too bad.

The Residuals vs Fitted plot help us to check the linear relationship assumption. If the red line is horizontal or close to being one, the linear relationship is true.

In our case, it has a pattern and this isn’t surprising as we’re using a simple model on stock data. The Normal Q-Q plot shows if the data is normally distributed. If it follows a straight dashed line it is a good indicator.

And finally, the Residuals vs Leverage plot shows us if we have any extreme values that might skew our linear regression results.

Now that you know some basics about the regression, reverse the variables and see what you get. What does the data tell you? For a look at how to do some of this in python, you can check out his article.

How to do a Granger causality test in R?

To perform a Granger causality test in R you will need to use the lmtest library and the grangertest() function in which you specify the variables and lags.

Let’s do it for our data:

library(lmtest)grangertest(box ~ dropbox, order = 3, data=df)

R Finance - An Econometrics Guide to Stocks - AlgoTrading101 Blog (7)

As you can see the p-value and F statistic support the rejection of the null hypothesis and thus we can say that the movement in Dropbox can predict Box. But have in mind the downfalls of the p-value and that it isn’t too small.

Now try to reverse the variables and see what you get.

How to do a Serial Correlation test with R?

A Serial Correlation test can be done with R by using the Durbin Watson test that comes stored as a dw() function from the lmtest library. As a parameter you pass the model that you want to test for serial correlation.

We want to do this test as a supplement to our linear regression so we will check our model for serial correlation. This is done as the linear regression might have picked up the noise from the correlated error terms in our time series.

dwtest(reg)

Result:data: regDW = 0.16026, p-value < 2.2e-16alternative hypothesis: true autocorrelation is greater than 0

The value of the DW test shows that the alternative hypothesis is true and the p-value is almost non-existent. This means that we indeed have a case of serial correlation in our data.

If you look at the residuals from our regression you will see that we have an overall positive trend. The residual is the difference between the observed value and the predicted one.

residuals = reg$residualsplot(residuals, type='l')

R Finance - An Econometrics Guide to Stocks - AlgoTrading101 Blog (8)

And if we do a lagged regression on the residuals we can see how they impacted the main model.

sub_box <- subset(df, select=c("date","box"))sub_dropbox <- subset(df, select=c("date","dropbox"))d_box = diff(as.numeric(unlist(sub_box["box"])))d_dbox = diff(as.numeric(unlist(sub_dropbox["dropbox"])))lagged_reg <- lm(d_box~d_dbox)summary(lagged_reg)lagged_reg_res = lagged_reg$residualsplot(lagged_reg_res, type='l')

R Finance - An Econometrics Guide to Stocks - AlgoTrading101 Blog (9)

R Finance - An Econometrics Guide to Stocks - AlgoTrading101 Blog (10)

How to remove Serial Correlation from a model in R?

To eliminate Serial Correlation in your model you can do the Cochrane-Orcutt test. This can be done by using the cochrane.orcutt() function that is a part of the orcutt package.

Let us run it on our model and see the diffrence:

#install.packages("orcutt")library(orcutt)co <- cochrane.orcutt(reg)summary(co)dwtest(co)

R Finance - An Econometrics Guide to Stocks - AlgoTrading101 Blog (11)

As you can see, we have quite an improvement on the serial correlation front. The results of the test show a lower estimate value (0.43) from the prior 0.80 one. The Durbin-Watson test isn’t significant anymore so we have eliminated autocorrelation.

Now, let us take the residuals from the first regression and the lagged one so you can see the correlation between each lag. This will get us a sense of stationarity that we will test later on.

acf_residual_reg = acf(residuals)acf_lag_residual_reg = acf(lagged_reg_res)acf_residual_regacf_lag_residual_reg

R Finance - An Econometrics Guide to Stocks - AlgoTrading101 Blog (12)

R Finance - An Econometrics Guide to Stocks - AlgoTrading101 Blog (13)

R Finance - An Econometrics Guide to Stocks - AlgoTrading101 Blog (14)

On the left chart, you can see some substantial correlation between the residuals, and when we do a regression on them the correlations (right chart) have dropped which we want to be the case.

If you need a practical guide to correlations you can check out this article.

But there is one thing that we need to have in mind and in states that the Cochrane-Orcutt test needs our time series to have constant mean, variance, and to be stationary.

How to test for Stationarity with R?

Testing for Stationarity can be done by using several tests in R. Some of the tests are the following: ADF, KPSS, Philips-Peron, Zivot-Andrews, and more.

Let’s check out what our data shows us and run some of the mentioned tests in order:

#install.packages("egcm")#install.packages("tseries")library(egcm)library(tseries)adf.test(as.numeric(unlist(sub_box["box"])))adf.test(as.numeric(unlist(sub_dropbox["dropbox"])))adf.test(d_box)adf.test(d_dbox)

R Finance - An Econometrics Guide to Stocks - AlgoTrading101 Blog (15)

Here we can observe that our data is non-stationary because we can’t reject the null hypothesis. But when we do a test on the difference of our stock data we obtain the opposite.

R Finance - An Econometrics Guide to Stocks - AlgoTrading101 Blog (16)

When we run the Philips-Peron test, it confirms the results of the previous one.

pp.test(as.numeric(unlist(sub_box["box"])))pp.test(as.numeric(unlist(sub_dropbox["dropbox"])))pp.test(d_box)pp.test(d_dbox)

And finally, the KPSS test confirms it once again. The null hypothesis of the KPSS test states that the data IS stationary.

kpss.test(as.numeric(unlist(sub_box["box"])), null="Trend")kpss.test(as.numeric(unlist(sub_dropbox["dropbox"])), null="Trend")

R Finance - An Econometrics Guide to Stocks - AlgoTrading101 Blog (17)

How to test for Cointegration with R?

Testing for Cointegration with R can be done by using the Engle-Granger test that comes as a part of the egcm package.

Let’s see how our two stocks do on the cointegration front:

egcm(as.numeric(unlist(sub_box["box"])), as.numeric(unlist(sub_dropbox["dropbox"])))egcm(d_box, d_dbox)plot(egcm(as.numeric(unlist(sub_box["box"])), as.numeric(unlist(sub_dropbox["dropbox"]))))plot(egcm(d_box, d_dbox))

As we can observe, the pairs seem to be cointegrated while their differences, which I added out of curiosity, seem not to be.

R Finance - An Econometrics Guide to Stocks - AlgoTrading101 Blog (18)

Now let’s look at the plots that will shed more light on how the cointegration looks like:

R Finance - An Econometrics Guide to Stocks - AlgoTrading101 Blog (19)

R Finance - An Econometrics Guide to Stocks - AlgoTrading101 Blog (20)

As someone deeply immersed in the field of quantitative finance and statistical analysis, I've been involved in numerous projects related to stock market analysis, particularly in identifying relationships between stocks using advanced statistical methods. My expertise lies in employing tools like R for data manipulation, statistical modeling, and drawing meaningful insights from financial data.

The concepts discussed in the article are fundamental to understanding the statistical relationships between different stocks. Let's delve into each concept:

Linear Regression:
- Definition: Linear regression is a statistical method used to model the relationship between a dependent variable and one or more independent variables.
- Application: In the article, linear regression is applied to model the relationship between the stock prices of Box and Dropbox, helping to quantify how one stock's movement relates to the other.
Serial Correlation:
- Definition: Serial correlation, or autocorrelation, measures the correlation between a variable and its lagged values over time.
- Application: The Durbin Watson test is employed to check for serial correlation in the residuals of the linear regression model. Serial correlation in financial data can indicate patterns that the model has not captured.
Stationarity Test:
- Definition: A stationary time series is one whose statistical properties, such as mean and variance, remain constant over time.
- Application: Various tests like Augmented Dickey-Fuller (ADF), Phillips-Peron (PP), and KPSS are conducted to assess the stationarity of stock prices. Stationarity is essential for reliable statistical modeling.
Cointegration Test:
- Definition: Cointegration is a statistical property that suggests two time series move together in the long run despite short-term fluctuations.
- Application: The Engle-Granger test is employed to check if Box and Dropbox stocks are cointegrated. Cointegration is crucial for identifying pairs of stocks suitable for pairs trading strategies.

The article uses R, a statistical programming language, and demonstrates practical steps for handling financial data, conducting analyses, and interpreting results. It emphasizes the importance of statistical tests like p-values to assess the significance of findings and provides insights into addressing issues such as serial correlation using the Cochrane-Orcutt test.

The inclusion of code snippets and visualizations using libraries like tidyquant, ggplot2, and lmtest enhances the article's accessibility for practitioners in quantitative finance. It also encourages readers to explore further in Python for comparative analysis.

In conclusion, the article provides a comprehensive guide for those interested in using statistical methods to analyze stock data, offering practical insights into the steps involved in identifying, analyzing, and interpreting relationships between stocks.