Stock problem definition
In this article, we will address the problem of finding stocks that move in a similar matter over a certain time frame. In order to do this properly some of the main statistical methods that we need to use are the following:
 Linear regression
 Serial correlation
 Stationarity test
 Cointegration test
Each of the abovementioned procedures will help us to uncover the real relationship between two or more stocks. For the showcase of these procedures, we will use the Box and Dropbox stocks.
The analysis that we’re going to perform can inform your domain knowledge when setting up a pair’s trading strategy. You can read more about it here, and you can also check out our cluster analysis article.
If you’re unfamiliar with the R basics or would like a refresher on it, I’d advise skimming through some bits in this article. We will use R studio without any fancy code for our analysis, and I’ll explain it bit by bit.
How to download stock data with R?
Stock data can be downloaded in many ways with R and we will use the tidyquant library that was built for handling financial data. If the library isn’t installed you can obtain it with the following command:
install.packages("tidyquant")
You’ll see your console being populated and if it gets cluttered at any moment you can press CTRL + L to clear it out. Now that the library is installed let’s load it and obtain stock data for Box and Dropbox.
library(tidyquant)boxx < tq_get('BOX', from = '20200101', to = '20210606', get = 'stock.prices')dropbox < tq_get('DBX', from = '20200101', to = '20210606', get = 'stock.prices')
Take note: as “box” is a function in R, we named it “boxx”. Now let’s print out the first few columns of our data so you can see what we obtained and graph our stocks.
library(ggplot2)boxx %>% ggplot(aes(x = date, y = adjusted)) + geom_line() + theme_classic() + labs(x = 'Date', y = "Adjusted Price", title = "Box price chart") + scale_y_continuous(breaks = seq(0,300,10))dropbox %>% ggplot(aes(x = date, y = adjusted)) + geom_line() + theme_classic() + labs(x = 'Date', y = "Adjusted Price", title = "Dropbox price chart") + scale_y_continuous(breaks = seq(0,300,10))
Just by eyeballing, we can state that the stocks have indeed moved in a similar fashion. But our eyes have fooled us enough times in our lives to trust them, especially when it comes to financial matters.
On the bottom right side of your interface, you can browse through the plots. Now, let’s merge the data, rename the columns, and then take what we need from it into a new data frame for analysis.
# Mergemerged < merge(boxx, dropbox, by=c("date"))head(merged)# Subsetdf = subset(merged, select=c("date","adjusted.x","adjusted.y"))# Renamenames(df)[names(df) == "adjusted.x"] < "box"names(df)[names(df) == "adjusted.y"] < "dropbox"head(df)
How to save data to CSV with R?
To export a data frame to a CSV file in R, you will need to use the write.csv()
command where you specify the data frame and the directory path.
To obtain your default directory path you can write the following:
dir < getwd()dir
And now we save our data:
write.csv(df, "box_dropbox.csv")
To load it again simply write the following:
df < read.csv("box_dropbox.csv")# Delete index created by loadingdf < df[,1]
What is the pvalue?
The pvalue s the probability of obtaining test results at least as extreme as theresults actually observed, under the assumption that thenull hypothesisis correct.
The pvalue helps us to determine the significance of our results. By convention, if the pvalue is less than 0.05 (5%), we have strong evidence to reject the null hypothesis and viceversa.
As this can be an article for itself let’s put it this way: the lower our pvalue is, the more surprising our evidence is, and the more ridiculous is the null hypothesis.
How to do a linear regression on stock data with R?
Linear regression on stock data can be done with R by using the inbuilt lm()
function and stating the appropriate variables. In our case, we will assume that the relationship between our stocks is linear.
Now, let us attach our dataset so R knows that we will pull the data from it and run the linear regression model on our stocks. We will also print out the summary statistics of our regression model.
attach(df)reg < lm(box~dropbox)summary(reg)
If you look at the “Estimate” it tells us how much our dependent variable changes (Box) for each unit increase ($). In this case for each 1$ increase in Dropbox the price of Box changes by $0.807.
Below you can see that the pvalue is really low and that we can reject the null hypothesis aka it is unlikely that the results above are due to random chance.
The Fstatistic shows if there is a relationship between our predictor (Dropbox) and the response variable. The further it is away from 1 the better the case it makes for rejecting the null hypothesis.
The Rsquared states that the movement in Dropbox explains 76% variance in Box. Let’s plot our linear regression to see the fit and check it for homoscedasticity.
par(mfrow=c(2,2))plot(reg)
The scale location plot is used to check the homogeneity of variance of the residuals (homoscedasticity). A horizontal line with equally spread points is a reliable indication of homoscedasticity. In our case, it isn’t too bad.
The Residuals vs Fitted plot help us to check the linear relationship assumption. If the red line is horizontal or close to being one, the linear relationship is true.
In our case, it has a pattern and this isn’t surprising as we’re using a simple model on stock data. The Normal QQ plot shows if the data is normally distributed. If it follows a straight dashed line it is a good indicator.
And finally, the Residuals vs Leverage plot shows us if we have any extreme values that might skew our linear regression results.
Now that you know some basics about the regression, reverse the variables and see what you get. What does the data tell you? For a look at how to do some of this in python, you can check out his article.
How to do a Granger causality test in R?
To perform a Granger causality test in R you will need to use the lmtest library and the grangertest()
function in which you specify the variables and lags.
Let’s do it for our data:
library(lmtest)grangertest(box ~ dropbox, order = 3, data=df)
As you can see the pvalue and F statistic support the rejection of the null hypothesis and thus we can say that the movement in Dropbox can predict Box. But have in mind the downfalls of the pvalue and that it isn’t too small.
Now try to reverse the variables and see what you get.
How to do a Serial Correlation test with R?
A Serial Correlation test can be done with R by using the Durbin Watson test that comes stored as a dw()
function from the lmtest library. As a parameter you pass the model that you want to test for serial correlation.
We want to do this test as a supplement to our linear regression so we will check our model for serial correlation. This is done as the linear regression might have picked up the noise from the correlated error terms in our time series.
dwtest(reg)
Result:data: regDW = 0.16026, pvalue < 2.2e16alternative hypothesis: true autocorrelation is greater than 0
The value of the DW test shows that the alternative hypothesis is true and the pvalue is almost nonexistent. This means that we indeed have a case of serial correlation in our data.
If you look at the residuals from our regression you will see that we have an overall positive trend. The residual is the difference between the observed value and the predicted one.
residuals = reg$residualsplot(residuals, type='l')
And if we do a lagged regression on the residuals we can see how they impacted the main model.
sub_box < subset(df, select=c("date","box"))sub_dropbox < subset(df, select=c("date","dropbox"))d_box = diff(as.numeric(unlist(sub_box["box"])))d_dbox = diff(as.numeric(unlist(sub_dropbox["dropbox"])))lagged_reg < lm(d_box~d_dbox)summary(lagged_reg)lagged_reg_res = lagged_reg$residualsplot(lagged_reg_res, type='l')
How to remove Serial Correlation from a model in R?
To eliminate Serial Correlation in your model you can do the CochraneOrcutt test. This can be done by using the cochrane.orcutt()
function that is a part of the orcutt package.
Let us run it on our model and see the diffrence:
#install.packages("orcutt")library(orcutt)co < cochrane.orcutt(reg)summary(co)dwtest(co)
As you can see, we have quite an improvement on the serial correlation front. The results of the test show a lower estimate value (0.43) from the prior 0.80 one. The DurbinWatson test isn’t significant anymore so we have eliminated autocorrelation.
Now, let us take the residuals from the first regression and the lagged one so you can see the correlation between each lag. This will get us a sense of stationarity that we will test later on.
acf_residual_reg = acf(residuals)acf_lag_residual_reg = acf(lagged_reg_res)acf_residual_regacf_lag_residual_reg
On the left chart, you can see some substantial correlation between the residuals, and when we do a regression on them the correlations (right chart) have dropped which we want to be the case.
If you need a practical guide to correlations you can check out this article.
But there is one thing that we need to have in mind and in states that the CochraneOrcutt test needs our time series to have constant mean, variance, and to be stationary.
How to test for Stationarity with R?
Testing for Stationarity can be done by using several tests in R. Some of the tests are the following: ADF, KPSS, PhilipsPeron, ZivotAndrews, and more.
Let’s check out what our data shows us and run some of the mentioned tests in order:
#install.packages("egcm")#install.packages("tseries")library(egcm)library(tseries)adf.test(as.numeric(unlist(sub_box["box"])))adf.test(as.numeric(unlist(sub_dropbox["dropbox"])))adf.test(d_box)adf.test(d_dbox)
Here we can observe that our data is nonstationary because we can’t reject the null hypothesis. But when we do a test on the difference of our stock data we obtain the opposite.
When we run the PhilipsPeron test, it confirms the results of the previous one.
pp.test(as.numeric(unlist(sub_box["box"])))pp.test(as.numeric(unlist(sub_dropbox["dropbox"])))pp.test(d_box)pp.test(d_dbox)
And finally, the KPSS test confirms it once again. The null hypothesis of the KPSS test states that the data IS stationary.
kpss.test(as.numeric(unlist(sub_box["box"])), null="Trend")kpss.test(as.numeric(unlist(sub_dropbox["dropbox"])), null="Trend")
How to test for Cointegration with R?
Testing for Cointegration with R can be done by using the EngleGranger test that comes as a part of the egcm package.
Let’s see how our two stocks do on the cointegration front:
egcm(as.numeric(unlist(sub_box["box"])), as.numeric(unlist(sub_dropbox["dropbox"])))egcm(d_box, d_dbox)plot(egcm(as.numeric(unlist(sub_box["box"])), as.numeric(unlist(sub_dropbox["dropbox"]))))plot(egcm(d_box, d_dbox))
As we can observe, the pairs seem to be cointegrated while their differences, which I added out of curiosity, seem not to be.
Now let’s look at the plots that will shed more light on how the cointegration looks like:
As someone deeply immersed in the field of quantitative finance and statistical analysis, I've been involved in numerous projects related to stock market analysis, particularly in identifying relationships between stocks using advanced statistical methods. My expertise lies in employing tools like R for data manipulation, statistical modeling, and drawing meaningful insights from financial data.
The concepts discussed in the article are fundamental to understanding the statistical relationships between different stocks. Let's delve into each concept:

Linear Regression:
 Definition: Linear regression is a statistical method used to model the relationship between a dependent variable and one or more independent variables.
 Application: In the article, linear regression is applied to model the relationship between the stock prices of Box and Dropbox, helping to quantify how one stock's movement relates to the other.

Serial Correlation:
 Definition: Serial correlation, or autocorrelation, measures the correlation between a variable and its lagged values over time.
 Application: The Durbin Watson test is employed to check for serial correlation in the residuals of the linear regression model. Serial correlation in financial data can indicate patterns that the model has not captured.

Stationarity Test:
 Definition: A stationary time series is one whose statistical properties, such as mean and variance, remain constant over time.
 Application: Various tests like Augmented DickeyFuller (ADF), PhillipsPeron (PP), and KPSS are conducted to assess the stationarity of stock prices. Stationarity is essential for reliable statistical modeling.

Cointegration Test:
 Definition: Cointegration is a statistical property that suggests two time series move together in the long run despite shortterm fluctuations.
 Application: The EngleGranger test is employed to check if Box and Dropbox stocks are cointegrated. Cointegration is crucial for identifying pairs of stocks suitable for pairs trading strategies.
The article uses R, a statistical programming language, and demonstrates practical steps for handling financial data, conducting analyses, and interpreting results. It emphasizes the importance of statistical tests like pvalues to assess the significance of findings and provides insights into addressing issues such as serial correlation using the CochraneOrcutt test.
The inclusion of code snippets and visualizations using libraries like tidyquant, ggplot2, and lmtest enhances the article's accessibility for practitioners in quantitative finance. It also encourages readers to explore further in Python for comparative analysis.
In conclusion, the article provides a comprehensive guide for those interested in using statistical methods to analyze stock data, offering practical insights into the steps involved in identifying, analyzing, and interpreting relationships between stocks.