Part of this post is also available as a video tutorial on YouTube.
Time series data
Time series data is ubiquitous in finance research, but it can present challenges when using it in OLS regressions because it can violate the assumptions of OLS. The main challenge is that the error term can be correlated across time periods and can be heteroskedastic, meaning that the variance of the error term is not constant. This can lead to biased standard errors and invalid hypothesis tests. Time series data models have the following form:
\[
y_{t} = x_{t}\beta + \epsilon_{t},
\]
where \(y_{t}\) is the dependent variable at time \(t\in [1,T]\), \(x_{t}\) is a vector of independent variables, \(\beta\) is a vector of coefficients, and \(\epsilon_{t}\) is the error term. If the model includes an intercept, then the first element of \(x_{t}\) is 1 and the first element of \(\beta\) is \(\alpha\).
When we get to the data, we write the model in matrix form:
where \(\mathbf{y}\) is a \(T \times 1\) vector of dependent variables, \(\mathbf{X}\) is a \(T \times K\) matrix of independent variables, \(\beta\) is a \(K \times 1\) vector of coefficients, and \(\mathbf{\epsilon}\) is a \(T \times 1\) vector of error terms. The number of observations is \(T\).
More on matrix form representation
The observations are then stacked in their temporal order. For example, the vector \(\mathbf{y}\) of elements \(y_{t}\) is defined as:
If the model includes an intercept, then the first column of \(\mathbf{X}\) is a column of ones (i.e. all the \(x_{t}^{(1)}\equiv 1\)) and the first element of \(\beta\) is \(\alpha\):
The goal is then to estimate the coefficients \(\beta\) and their associated standard errors. The standard errors are important because they allow us to perform statistical inference. In the context of financial anomalies, or more generally trading strategies, we can test whether the \(\alpha\) is statistically different from zero.
Test data
In this post, I explain how to estimate standard errors in time series regressions with Python and the statsmodels library. I do so using the anomalies data set available on Open Source Asset Pricing, which contains monthly returns for the anomalies discussed in Chen and Zimmermann (2022). More specifically, I use daily returns of the 12-month momentum anomaly in the PredictorVW.zip file available at this link.
import pandas as pddf = pd.read_csv("Mom12m_ret.csv", parse_dates=["date"], index_col=["date"])df
port01
port02
port03
port04
port05
port06
port07
port08
port09
port10
portLS
date
1927-01-03
-0.312789
-1.315867
-0.361545
-0.557918
-1.252321
-0.310728
-0.767765
-0.796250
-0.649846
-1.267020
-0.954230
1927-01-04
1.845338
0.659064
0.828047
0.102919
0.213863
-0.011713
0.127618
0.102198
0.659057
0.874677
-0.970662
1927-01-05
1.012461
0.724596
0.581067
-0.152049
0.079294
0.288886
0.283217
-0.080991
0.110790
0.048784
-0.963677
1927-01-06
-0.902011
0.289840
-0.462931
-1.100983
-0.223404
-0.010662
-0.121442
-0.095102
-0.142538
-0.014550
0.887461
1927-01-07
-0.332090
0.286702
0.062731
0.221905
0.152737
0.093068
0.456891
0.672505
0.225237
0.209930
0.542021
...
...
...
...
...
...
...
...
...
...
...
...
2022-12-23
-1.492861
-0.940321
-0.087377
0.254381
0.580743
0.502287
0.482743
0.588520
0.295648
1.433918
2.926779
2022-12-27
-3.014401
-3.667398
-1.638701
-0.776100
-0.984944
-0.153435
-0.000316
0.146955
-0.746434
0.281178
3.295579
2022-12-28
-0.475730
-1.287840
-0.931057
-1.367397
-1.121403
-1.162237
-1.097983
-0.922127
-1.433636
-1.478748
-1.003018
2022-12-29
6.153138
4.639327
3.960907
3.063374
2.489143
2.013535
1.798595
1.207053
1.708395
0.742981
-5.410157
2022-12-30
0.607938
0.718418
0.117090
-0.167288
-0.194521
-0.461471
-0.395566
-0.365650
-0.181250
-0.002331
-0.610269
25249 rows × 11 columns
I complement the momentum strategy with the daily Fama-French 5 factors available from Ken French’s data library.
ff5_df = pd.read_csv("https://mba.tuck.dartmouth.edu/pages/faculty/ken.french/ftp/F-F_Research_Data_5_Factors_2x3_daily_CSV.zip", skiprows=2, parse_dates=[0], index_col=[0],)# Clean column names for formula apiff5_df.columns = [col.replace("-", "") for col in ff5_df.columns]ff5_df.index.name ="date"ff5_df
MktRF
SMB
HML
RMW
CMA
RF
date
1963-07-01
-0.67
0.02
-0.35
0.03
0.13
0.012
1963-07-02
0.79
-0.28
0.28
-0.08
-0.21
0.012
1963-07-03
0.63
-0.18
-0.10
0.13
-0.25
0.012
1963-07-05
0.40
0.09
-0.28
0.07
-0.30
0.012
1963-07-08
-0.63
0.07
-0.20
-0.27
0.06
0.012
...
...
...
...
...
...
...
2023-12-22
0.21
0.61
0.09
-0.64
0.19
0.021
2023-12-26
0.48
0.81
0.46
-0.34
-0.15
0.021
2023-12-27
0.16
0.16
0.12
-0.31
-0.14
0.021
2023-12-28
-0.01
-0.38
0.03
-0.32
0.15
0.021
2023-12-29
-0.43
-1.13
-0.37
0.67
-0.07
0.021
15229 rows × 6 columns
Finally, I join the two data sets to create a single data set with the following columns:
The statsmodels library is a Python package that provides many functions for estimating statistical models and conducting statistical tests.
Installation
You can install the library with your package manager of choice:
poetry:
poetry add statsmodels
pip:
pip install statsmodels
conda:
conda install statsmodels
Usage
The library is easy to use for anyone familiar with the models used. However, reference to the documentation is a must to understand how to use all the available options. In this post, I will use the linear regression model estimated by ordinary least squares (OLS). Statmodels provides two APIs: their usual1 API and the formula API that allows you to specify the model using a formula string similar to R. I will use the formula API in this post. The two APIs are imported as follows:
import statsmodels.api as smimport statsmodels.formula.api as smf
You define model a model by passing a string with the formula and a pandas DataFrame with the data to the from_formula method of the model class. For example, to create our OLS model that regresses the momentum strategy on the Fama-French 5 factors, we use the following code:
The first model I estimate is a vanilla OLS model. In our example with the momentum strategy, the \(\beta\) vector is \([ \alpha\ \beta_{MktRF}\ \beta_{SMB}\ \beta_{HML}\ \beta_{RMW}\ \beta_{CMA}]'\).
The OLS estimator is the most common in finance research. It is also the most efficient estimator when the errors are homoskedastic and uncorrelated across time. The OLS estimator is defined as:
where \(\mathbf{X}\) is the matrix of exogenous variables and \(\mathbf{y}\) is the vector of dependent variables. The covariance matrix \(\Sigma\) of the estimator is defined as:
where \(\sigma_{\mathbf{\epsilon}}^{2}\) is usually estimated with the residuals \(\mathbf{\hat{e}}=( \mathbf{y}-\mathbf{X}\hat{\beta}_{OLS})\) from the OLS regression:
To estimate the OLS model with statsmodels, we call the fit() method. The fit() method returns a regression results object, which contains the estimated coefficients, standard errors, and other statistics. The results object has a summary() Depending on the parameters passed to the fit() method, we can get z-statistics or t-statistics. I always use the use_t=True option to make sure I get t-statistics in the summary table.
mod.fit(use_t=True).summary()
OLS Regression Results
Dep. Variable:
portLS
R-squared:
0.128
Model:
OLS
Adj. R-squared:
0.128
Method:
Least Squares
F-statistic:
439.1
Date:
Thu, 01 Feb 2024
Prob (F-statistic):
0.00
Time:
05:38:44
Log-Likelihood:
-26910.
No. Observations:
14979
AIC:
5.383e+04
Df Residuals:
14973
BIC:
5.388e+04
Df Model:
5
Covariance Type:
nonrobust
coef
std err
t
P>|t|
[0.025
0.975]
Intercept
0.0928
0.012
7.767
0.000
0.069
0.116
MktRF
-0.0635
0.013
-4.980
0.000
-0.089
-0.039
SMB
-0.2919
0.023
-12.510
0.000
-0.338
-0.246
HML
-0.9909
0.025
-39.510
0.000
-1.040
-0.942
RMW
0.3269
0.032
10.179
0.000
0.264
0.390
CMA
0.6359
0.040
15.712
0.000
0.557
0.715
Omnibus:
2945.727
Durbin-Watson:
1.724
Prob(Omnibus):
0.000
Jarque-Bera (JB):
43704.781
Skew:
-0.523
Prob(JB):
0.00
Kurtosis:
11.302
Cond. No.
3.84
Notes: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
White standard errors
With time series data, the errors are often heteroskedastic and correlated across time. In this case, the OLS standard errors are incorrect.
A common solution to the problem of heteroskedasticity, i.e. that the variance of the error term is not constant, is to use heteroskedasticity-robust standard errors. The most common heteroskedasticity-robust standard errors estimator is the White (1980) estimator, which is defined as
where \(\hat{\Omega}_0\) is a diagonal matrix with the squared residuals \(\mathbf{\hat{e}}^{2}\) on the diagonal, i.e., with the square of the j-th residual \(\hat{e}_{j}^{2}\) at position \((j,j)\) and \(0\) elsewhere. The matrix \(\hat{\Omega}_0\) is defined as:
This is referred to as the HC0 estimator in statsmodels. MacKinnon and White (1985) introduced three new estimators, HC1, HC2, and HC3, which are also available in statsmodels. The HC1 estimator is the same as the HC0 estimator with \(\hat{\Omega}_0\) normalized by a factor of \(T/(T-K)\) to make the estimator unbiased, where \(T\) is the number of observations and \(K\) is the number of independent variables.
The ols model we created can be used to estimate OLS models with White standard errors. The following code estimates the OLS coefficients and White standard errors:
mod.fit(cov_type="HC0", use_t=True).summary()
OLS Regression Results
Dep. Variable:
portLS
R-squared:
0.128
Model:
OLS
Adj. R-squared:
0.128
Method:
Least Squares
F-statistic:
94.22
Date:
Thu, 01 Feb 2024
Prob (F-statistic):
4.93e-98
Time:
05:38:28
Log-Likelihood:
-26910.
No. Observations:
14979
AIC:
5.383e+04
Df Residuals:
14973
BIC:
5.388e+04
Df Model:
5
Covariance Type:
HC0
coef
std err
t
P>|t|
[0.025
0.975]
Intercept
0.0928
0.012
7.675
0.000
0.069
0.117
MktRF
-0.0635
0.020
-3.168
0.002
-0.103
-0.024
SMB
-0.2919
0.038
-7.741
0.000
-0.366
-0.218
HML
-0.9909
0.056
-17.570
0.000
-1.101
-0.880
RMW
0.3269
0.069
4.724
0.000
0.191
0.463
CMA
0.6359
0.081
7.820
0.000
0.477
0.795
Omnibus:
2945.727
Durbin-Watson:
1.724
Prob(Omnibus):
0.000
Jarque-Bera (JB):
43704.781
Skew:
-0.523
Prob(JB):
0.00
Kurtosis:
11.302
Cond. No.
3.84
Notes: [1] Standard Errors are heteroscedasticity robust (HC0)
The estimated coefficients are the same as with the OLS standard errors, but the standard errors are different. In turn, all statistical measures that depend on the standard errors, such as t-statistics and p-values, are different.
Newey-West standard errors
The White estimator takes into account only heteroskedasticity, but not autocorrelation. For that, we can use a hetoroskedasticity and autocorrelation robust (HAR)2 standard errors estimator.
In empirical finance research, it is common in practice to use Newey-West standard errors (introduced in Newey and West 1987) to account for both heteroskedasticity and autocorrelation. The Newey-West estimator is defined as:
where \(\hat{e}_{i}\) is the \(i\)-th element of the vector \(\mathbf{\hat{e}}\). The matrix \(\hat{S}_{HAR}\) is a weighted sum of the autocovariances of the residuals \(\mathbf{\hat{e}}\), where the kernel \(\omega(j,l)\) provides the weight for lag \(j\) with a bandwith (maximum lag) of \(l\). The Newey-West estimator uses a Bartlett kernel, which is defined as:
cov_kwds={"kernel": "bartlett", "maxlags": 5}: A dictionary with the kernel and the number of lags to use (maxlags). Use the uniform kernel for the truncated uniform kernel of Hansen and Hodrick (1980).
Notes: [1] Standard Errors are heteroscedasticity and autocorrelation robust (HAC) using 5 lags and without small sample correction
Cheat sheet
The following table summarizes the standard errors estimators in statsmodels and their corresponding methods:
# Import the formula APIimport statsmodels.formula.api as smf# Define OLS modelmod = smf.ols("portLS ~ MktRF + SMB + HML + RMW + CMA", data=df)# Estimate OLS model with OLS standard errorsres = mod.fit(use_t=True)# Estimate OLS model with White standard errorsres = mod.fit(cov_type="HC0", use_t=True)# Estimate OLS model with Newey-West standard errorsres = mod.fit( cov_type="HAC", cov_kwds={"kernel": "bartlett", "maxlags": 5}, use_t=True)# Estimate OLS model with Hansen-Hodrick standard errorsres = mod.fit( cov_type="HAC", cov_kwds={"kernel": "uniform", "maxlags": 5}, use_t=True)
Learn more
To learn more about the statsmodels library, I recommend reading the documentation, especially the section on linear regression.
Using widely-used software tools like statsmodels is a good way to make your research more reproducible and transparent. However, like any tool, it is important to understand the underlying econometric theory. If you are interested in learning more about the standard errors discussed in this post, I recommend reading White (1980), Newey and West (1987), MacKinnon and White (1985), and Hansen and Hodrick (1980).
Hansen, Lars Peter, and Robert J Hodrick. 1980. “Forward Exchange Rates as Optimal Predictors of Future Spot Rates: An Econometric Analysis.”Journal of Political Economy 88 (5): 829–53. https://www.journals.uchicago.edu/doi/abs/10.1086/260910.
MacKinnon, James G, and Halbert White. 1985. “Some Heteroskedasticity-Consistent Covariance Matrix Estimators with Improved Finite Sample Properties.”Journal of Econometrics 29 (3): 305–25. https://doi.org/10.1016/0304-4076(85)90158-7.
Newey, Whitney K, and Kenneth D West. 1987. “A Simple, Positive Semi-Definite, Heteroskedasticity and Autocorrelation Consistent Covariance Matrix.”Econometrica 55 (3): 703–8. https://www.jstor.org/stable/1913610.
White, Halbert. 1980. “A Heteroskedasticity-Consistent Covariance Matrix Estimator and a Direct Test for Heteroskedasticity.”Econometrica, 817–38. https://doi.org/10.2307/1912934.