Estimating standard errors in time series data with Python and statsmodels

Python
Econometrics
Statsmodels
Author

Vincent Grégoire, PhD, CFA

Published

January 31, 2024

In this post, I show how to estimate standard errors in OLS regressions of time series data with Python and the statsmodels library.

More specifically, I show how to estimate OLS models with:

If you just want the code examples with no explanations, jump to the cheat sheet at the end of the post.

For panel data models, see my previous post on estimating standard errors in panel data models with Python and linearmodels.

Video tutorial

Part of this post is also available as a video tutorial on YouTube.

Time series data

Time series data is ubiquitous in finance research, but it can present challenges when using it in OLS regressions because it can violate the assumptions of OLS. The main challenge is that the error term can be correlated across time periods and can be heteroskedastic, meaning that the variance of the error term is not constant. This can lead to biased standard errors and invalid hypothesis tests. Time series data models have the following form:

\[ y_{t} = x_{t}\beta + \epsilon_{t}, \]

where \(y_{t}\) is the dependent variable at time \(t\in [1,T]\), \(x_{t}\) is a vector of independent variables, \(\beta\) is a vector of coefficients, and \(\epsilon_{t}\) is the error term. If the model includes an intercept, then the first element of \(x_{t}\) is 1 and the first element of \(\beta\) is \(\alpha\).

When we get to the data, we write the model in matrix form:

\[ \mathbf{y} = \mathbf{X}\beta + \mathbf{\epsilon}, \]

where \(\mathbf{y}\) is a \(T \times 1\) vector of dependent variables, \(\mathbf{X}\) is a \(T \times K\) matrix of independent variables, \(\beta\) is a \(K \times 1\) vector of coefficients, and \(\mathbf{\epsilon}\) is a \(T \times 1\) vector of error terms. The number of observations is \(T\).

The observations are then stacked in their temporal order. For example, the vector \(\mathbf{y}\) of elements \(y_{t}\) is defined as:

\[ \mathbf{y} = \left[\begin{array}{c} y_{1}\\ y_{2}\\ \vdots \\ y_{T-1} \\ y_{T} \end{array} \right], \]

The matrix \(\mathbf{X}\) formed with vectors \(x_{t}\) is defined as:

\[ \mathbf{X} = \left[\begin{array}{ccccc} x_{1}^{(1)} & x_{1}^{(2)} & \cdots & x_{1}^{(K-1)} & x_{1}^{(K)} \\ x_{2}^{(1)} & x_{2}^{(2)} & \cdots & x_{2}^{(K-1)} & x_{2}^{(K)} \\ \vdots & \vdots & \ddots & \vdots & \vdots \\ \vdots & \vdots & \ddots & \vdots & \vdots \\ x_{T-1}^{(1)} & x_{T-1}^{(2)} & \cdots & x_{T-1}^{(K-1)} & x_{T-1}^{(K)}\\ x_{T}^{(1)} & x_{T}^{(2)} & \cdots & x_{T}^{(K-1)} & x_{T}^{(K)} \end{array} \right], \]

where \(x_{t}^{(j)}\) is the \(j\)-th element of the vector \(x_{t}\).

Finally, the \(\beta\) vector is defined as usual:

\[ \beta = \left[\begin{array}{c}\beta_1 \\ \beta_2 \\ \vdots \\ \beta_{K-1} \\ \beta_{K} \end{array} \right]. \]

If the model includes an intercept, then the first column of \(\mathbf{X}\) is a column of ones (i.e. all the \(x_{t}^{(1)}\equiv 1\)) and the first element of \(\beta\) is \(\alpha\):

\[ \beta = \left[\begin{array}{c}\alpha \\ \beta_1 \\ \vdots \\ \beta_{K-2} \\ \beta_{K-1} \end{array} \right]. \]

The goal is then to estimate the coefficients \(\beta\) and their associated standard errors. The standard errors are important because they allow us to perform statistical inference. In the context of financial anomalies, or more generally trading strategies, we can test whether the \(\alpha\) is statistically different from zero.

Test data

In this post, I explain how to estimate standard errors in time series regressions with Python and the statsmodels library. I do so using the anomalies data set available on Open Source Asset Pricing, which contains monthly returns for the anomalies discussed in Chen and Zimmermann (2022). More specifically, I use daily returns of the 12-month momentum anomaly in the PredictorVW.zip file available at this link.

import pandas as pd

df = pd.read_csv("Mom12m_ret.csv", parse_dates=["date"], index_col=["date"])
df
port01 port02 port03 port04 port05 port06 port07 port08 port09 port10 portLS
date
1927-01-03 -0.312789 -1.315867 -0.361545 -0.557918 -1.252321 -0.310728 -0.767765 -0.796250 -0.649846 -1.267020 -0.954230
1927-01-04 1.845338 0.659064 0.828047 0.102919 0.213863 -0.011713 0.127618 0.102198 0.659057 0.874677 -0.970662
1927-01-05 1.012461 0.724596 0.581067 -0.152049 0.079294 0.288886 0.283217 -0.080991 0.110790 0.048784 -0.963677
1927-01-06 -0.902011 0.289840 -0.462931 -1.100983 -0.223404 -0.010662 -0.121442 -0.095102 -0.142538 -0.014550 0.887461
1927-01-07 -0.332090 0.286702 0.062731 0.221905 0.152737 0.093068 0.456891 0.672505 0.225237 0.209930 0.542021
... ... ... ... ... ... ... ... ... ... ... ...
2022-12-23 -1.492861 -0.940321 -0.087377 0.254381 0.580743 0.502287 0.482743 0.588520 0.295648 1.433918 2.926779
2022-12-27 -3.014401 -3.667398 -1.638701 -0.776100 -0.984944 -0.153435 -0.000316 0.146955 -0.746434 0.281178 3.295579
2022-12-28 -0.475730 -1.287840 -0.931057 -1.367397 -1.121403 -1.162237 -1.097983 -0.922127 -1.433636 -1.478748 -1.003018
2022-12-29 6.153138 4.639327 3.960907 3.063374 2.489143 2.013535 1.798595 1.207053 1.708395 0.742981 -5.410157
2022-12-30 0.607938 0.718418 0.117090 -0.167288 -0.194521 -0.461471 -0.395566 -0.365650 -0.181250 -0.002331 -0.610269

25249 rows × 11 columns

I complement the momentum strategy with the daily Fama-French 5 factors available from Ken French’s data library.

ff5_df = pd.read_csv(
    "https://mba.tuck.dartmouth.edu/pages/faculty/ken.french/ftp/F-F_Research_Data_5_Factors_2x3_daily_CSV.zip",
    skiprows=2,
    parse_dates=[0],
    index_col=[0],
)
# Clean column names for formula api
ff5_df.columns = [col.replace("-", "") for col in ff5_df.columns]
ff5_df.index.name = "date"
ff5_df
MktRF SMB HML RMW CMA RF
date
1963-07-01 -0.67 0.02 -0.35 0.03 0.13 0.012
1963-07-02 0.79 -0.28 0.28 -0.08 -0.21 0.012
1963-07-03 0.63 -0.18 -0.10 0.13 -0.25 0.012
1963-07-05 0.40 0.09 -0.28 0.07 -0.30 0.012
1963-07-08 -0.63 0.07 -0.20 -0.27 0.06 0.012
... ... ... ... ... ... ...
2023-12-22 0.21 0.61 0.09 -0.64 0.19 0.021
2023-12-26 0.48 0.81 0.46 -0.34 -0.15 0.021
2023-12-27 0.16 0.16 0.12 -0.31 -0.14 0.021
2023-12-28 -0.01 -0.38 0.03 -0.32 0.15 0.021
2023-12-29 -0.43 -1.13 -0.37 0.67 -0.07 0.021

15229 rows × 6 columns

Finally, I join the two data sets to create a single data set with the following columns:

df = df[["portLS"]].join(ff5_df[["MktRF", "SMB", "HML", "RMW", "CMA"]], how="inner")
df
portLS MktRF SMB HML RMW CMA
date
1963-07-01 0.127573 -0.67 0.02 -0.35 0.03 0.13
1963-07-02 0.680630 0.79 -0.28 0.28 -0.08 -0.21
1963-07-03 0.184325 0.63 -0.18 -0.10 0.13 -0.25
1963-07-05 0.304217 0.40 0.09 -0.28 0.07 -0.30
1963-07-08 -0.216363 -0.63 0.07 -0.20 -0.27 0.06
... ... ... ... ... ... ...
2022-12-23 2.926779 0.51 -0.34 1.16 0.86 0.46
2022-12-27 3.295579 -0.51 -0.44 1.42 1.16 1.22
2022-12-28 -1.003018 -1.23 -0.30 -0.29 -0.96 -0.04
2022-12-29 -5.410157 1.87 1.03 -1.07 -1.02 -0.82
2022-12-30 -0.610269 -0.22 0.14 -0.03 -0.54 0.00

14979 rows × 6 columns

statsmodels

The statsmodels library is a Python package that provides many functions for estimating statistical models and conducting statistical tests.

Installation

You can install the library with your package manager of choice:

poetry:

poetry add statsmodels

pip:

pip install statsmodels

conda:

conda install statsmodels

Usage

The library is easy to use for anyone familiar with the models used. However, reference to the documentation is a must to understand how to use all the available options. In this post, I will use the linear regression model estimated by ordinary least squares (OLS). Statmodels provides two APIs: their usual1 API and the formula API that allows you to specify the model using a formula string similar to R. I will use the formula API in this post. The two APIs are imported as follows:

import statsmodels.api as sm
import statsmodels.formula.api as smf

You define model a model by passing a string with the formula and a pandas DataFrame with the data to the from_formula method of the model class. For example, to create our OLS model that regresses the momentum strategy on the Fama-French 5 factors, we use the following code:

mod = smf.ols("portLS ~ MktRF + SMB + HML + RMW + CMA", data=df)

OLS coefficients and standard errors

The first model I estimate is a vanilla OLS model. In our example with the momentum strategy, the \(\beta\) vector is \([ \alpha\ \beta_{MktRF}\ \beta_{SMB}\ \beta_{HML}\ \beta_{RMW}\ \beta_{CMA}]'\).

The OLS estimator is the most common in finance research. It is also the most efficient estimator when the errors are homoskedastic and uncorrelated across time. The OLS estimator is defined as:

\[ \hat{\beta}_{OLS} = (\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\mathbf{y}, \]

where \(\mathbf{X}\) is the matrix of exogenous variables and \(\mathbf{y}\) is the vector of dependent variables. The covariance matrix \(\Sigma\) of the estimator is defined as:

\[ \Sigma_{OLS} =\sigma_{\mathbf{\epsilon}}^{2}\cdot\left(\mathbf{X}^{\prime}\mathbf{X}\right)^{-1}, \]

where \(\sigma_{\mathbf{\epsilon}}^{2}\) is usually estimated with the residuals \(\mathbf{\hat{e}}=( \mathbf{y}-\mathbf{X}\hat{\beta}_{OLS})\) from the OLS regression:

\[ \hat{\sigma}_{\mathbf{\epsilon}}^{2} =\frac{1}{N-K}\mathbf{\hat{e}}'\mathbf{\hat{e}}. \]

The estimate of the covariance matrix is then:

\[ \hat{\Sigma}_{OLS} =\frac{1}{N-K}\mathbf{\hat{e}}'\mathbf{\hat{e}}\cdot\left(\mathbf{X}^{\prime}\mathbf{X}\right)^{-1}, \]

\[ \hat{\Sigma}_{OLS} =\left[\begin{array}{ccccc} \hat{\sigma}_{\beta_1}^2 & \hat{\sigma}_{\beta_1,\beta_2} & \cdots & \hat{\sigma}_{\beta_1,\beta_{K-1}} & \hat{\sigma}_{\beta_1,\beta_K} \\ \hat{\sigma}_{\beta_2,\beta_1} & \hat{\sigma}_{\beta_2}^2 & \cdots & \hat{\sigma}_{\beta_2,\beta_{K-1}} & \hat{\sigma}_{\beta_2,\beta_K} \\ \vdots & \vdots & \ddots & \vdots & \vdots \\ \hat{\sigma}_{\beta_{K-1},\beta_1} & \hat{\sigma}_{\beta_{K-1},\beta_2} & \cdots & \hat{\sigma}_{\beta_{K-1}}^2 & \hat{\sigma}_{\beta_{K-1},\beta_K} \\ \hat{\sigma}_{\beta_K,\beta_1} & \hat{\sigma}_{\beta_K,\beta_2} & \cdots & \hat{\sigma}_{\beta_K,\beta_{K-1}} & \hat{\sigma}_{\beta_K}^2 \end{array} \right]. \]

To estimate the OLS model with statsmodels, we call the fit() method. The fit() method returns a regression results object, which contains the estimated coefficients, standard errors, and other statistics. The results object has a summary() Depending on the parameters passed to the fit() method, we can get z-statistics or t-statistics. I always use the use_t=True option to make sure I get t-statistics in the summary table.

mod.fit(use_t=True).summary()
OLS Regression Results
Dep. Variable: portLS R-squared: 0.128
Model: OLS Adj. R-squared: 0.128
Method: Least Squares F-statistic: 439.1
Date: Thu, 01 Feb 2024 Prob (F-statistic): 0.00
Time: 05:38:44 Log-Likelihood: -26910.
No. Observations: 14979 AIC: 5.383e+04
Df Residuals: 14973 BIC: 5.388e+04
Df Model: 5
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept 0.0928 0.012 7.767 0.000 0.069 0.116
MktRF -0.0635 0.013 -4.980 0.000 -0.089 -0.039
SMB -0.2919 0.023 -12.510 0.000 -0.338 -0.246
HML -0.9909 0.025 -39.510 0.000 -1.040 -0.942
RMW 0.3269 0.032 10.179 0.000 0.264 0.390
CMA 0.6359 0.040 15.712 0.000 0.557 0.715
Omnibus: 2945.727 Durbin-Watson: 1.724
Prob(Omnibus): 0.000 Jarque-Bera (JB): 43704.781
Skew: -0.523 Prob(JB): 0.00
Kurtosis: 11.302 Cond. No. 3.84


Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

White standard errors

With time series data, the errors are often heteroskedastic and correlated across time. In this case, the OLS standard errors are incorrect.

A common solution to the problem of heteroskedasticity, i.e. that the variance of the error term is not constant, is to use heteroskedasticity-robust standard errors. The most common heteroskedasticity-robust standard errors estimator is the White (1980) estimator, which is defined as

\[ \hat{\Sigma}_{White} = (\mathbf{X}' \mathbf{X})^{-1}\cdot\left[\mathbf{X}' \cdot \hat{\Omega}_0 \cdot \mathbf{X}\right] \cdot\left(\mathbf{X}' \mathbf{X}\right)^{-1}, \]

where \(\hat{\Omega}_0\) is a diagonal matrix with the squared residuals \(\mathbf{\hat{e}}^{2}\) on the diagonal, i.e., with the square of the j-th residual \(\hat{e}_{j}^{2}\) at position \((j,j)\) and \(0\) elsewhere. The matrix \(\hat{\Omega}_0\) is defined as:

\[ \hat{\Omega}_0= \left[\begin{array}{cccccccc} \hat{e}_{1}^{2} & 0 & \cdots & 0 & 0 \\ 0 & \hat{e}_{2}^{2} & \cdots & 0 & 0\\ \vdots & \vdots & \ddots & \vdots & \vdots\\ 0 & 0 & \cdots & \hat{e}_{T-1}^{2} & 0 \\ 0 & 0 & \cdots & 0 & \hat{e}_{T}^{2} \end{array} \right]. \]

This is referred to as the HC0 estimator in statsmodels. MacKinnon and White (1985) introduced three new estimators, HC1, HC2, and HC3, which are also available in statsmodels. The HC1 estimator is the same as the HC0 estimator with \(\hat{\Omega}_0\) normalized by a factor of \(T/(T-K)\) to make the estimator unbiased, where \(T\) is the number of observations and \(K\) is the number of independent variables.

The ols model we created can be used to estimate OLS models with White standard errors. The following code estimates the OLS coefficients and White standard errors:

mod.fit(cov_type="HC0", use_t=True).summary()
OLS Regression Results
Dep. Variable: portLS R-squared: 0.128
Model: OLS Adj. R-squared: 0.128
Method: Least Squares F-statistic: 94.22
Date: Thu, 01 Feb 2024 Prob (F-statistic): 4.93e-98
Time: 05:38:28 Log-Likelihood: -26910.
No. Observations: 14979 AIC: 5.383e+04
Df Residuals: 14973 BIC: 5.388e+04
Df Model: 5
Covariance Type: HC0
coef std err t P>|t| [0.025 0.975]
Intercept 0.0928 0.012 7.675 0.000 0.069 0.117
MktRF -0.0635 0.020 -3.168 0.002 -0.103 -0.024
SMB -0.2919 0.038 -7.741 0.000 -0.366 -0.218
HML -0.9909 0.056 -17.570 0.000 -1.101 -0.880
RMW 0.3269 0.069 4.724 0.000 0.191 0.463
CMA 0.6359 0.081 7.820 0.000 0.477 0.795
Omnibus: 2945.727 Durbin-Watson: 1.724
Prob(Omnibus): 0.000 Jarque-Bera (JB): 43704.781
Skew: -0.523 Prob(JB): 0.00
Kurtosis: 11.302 Cond. No. 3.84


Notes:
[1] Standard Errors are heteroscedasticity robust (HC0)

The estimated coefficients are the same as with the OLS standard errors, but the standard errors are different. In turn, all statistical measures that depend on the standard errors, such as t-statistics and p-values, are different.

Newey-West standard errors

The White estimator takes into account only heteroskedasticity, but not autocorrelation. For that, we can use a hetoroskedasticity and autocorrelation robust (HAR)2 standard errors estimator.

In empirical finance research, it is common in practice to use Newey-West standard errors (introduced in Newey and West 1987) to account for both heteroskedasticity and autocorrelation. The Newey-West estimator is defined as:

\[ \hat{\Sigma}_{NW} = (\mathbf{X}' \mathbf{X})^{-1}\cdot \hat{S}_{HAR} \cdot\left(\mathbf{X}' \mathbf{X}\right)^{-1}, \]

where \(\hat{S}_{HAR}\) is a matrix that takes into account both heteroskedasticity and autocorrelation:

\[ \hat{S}_{HAR} = \left[\mathbf{X}' \cdot \hat{\Omega}_0 \cdot \mathbf{X}\right] +\sum_{j=1}^{l}\omega(j,l) \cdot\sum_{t=j+1}^{T}\left( x_{t}\hat{e}_{t}\hat{e}_{t-j}x_{t-j}^{\prime}+x_{t-j}\hat{e}% _{t-j}\hat{e}_{t}x_{t}^{\prime}\right), \]

where \(\hat{e}_{i}\) is the \(i\)-th element of the vector \(\mathbf{\hat{e}}\). The matrix \(\hat{S}_{HAR}\) is a weighted sum of the autocovariances of the residuals \(\mathbf{\hat{e}}\), where the kernel \(\omega(j,l)\) provides the weight for lag \(j\) with a bandwith (maximum lag) of \(l\). The Newey-West estimator uses a Bartlett kernel, which is defined as:

\[ \omega(j,l) = \left\{ \begin{array}{ll} 1 - \frac{|j|}{l+1} & \text{if } |j| \leq l \\ 0 & \text{if } |j| > l \end{array} \right. \]

The truncated uniform kernel used by Hansen and Hodrick (1980) is also available in statsmodels.

To use Newey-West standard errors, you pass the following arguments to the fit() method:

  • cov_type="HAC": To use HAC standard errors. See the documentation on Kernel (HAC) for more details.
  • cov_kwds={"kernel": "bartlett", "maxlags": 5}: A dictionary with the kernel and the number of lags to use (maxlags). Use the uniform kernel for the truncated uniform kernel of Hansen and Hodrick (1980).
mod.fit(
    cov_type="HAC", cov_kwds={"kernel": "bartlett", "maxlags": 5}, use_t=True
).summary()
OLS Regression Results
Dep. Variable: portLS R-squared: 0.128
Model: OLS Adj. R-squared: 0.128
Method: Least Squares F-statistic: 58.10
Date: Thu, 01 Feb 2024 Prob (F-statistic): 4.28e-60
Time: 05:38:28 Log-Likelihood: -26910.
No. Observations: 14979 AIC: 5.383e+04
Df Residuals: 14973 BIC: 5.388e+04
Df Model: 5
Covariance Type: HAC
coef std err t P>|t| [0.025 0.975]
Intercept 0.0928 0.014 6.837 0.000 0.066 0.119
MktRF -0.0635 0.025 -2.532 0.011 -0.113 -0.014
SMB -0.2919 0.047 -6.275 0.000 -0.383 -0.201
HML -0.9909 0.073 -13.596 0.000 -1.134 -0.848
RMW 0.3269 0.100 3.261 0.001 0.130 0.523
CMA 0.6359 0.103 6.191 0.000 0.435 0.837
Omnibus: 2945.727 Durbin-Watson: 1.724
Prob(Omnibus): 0.000 Jarque-Bera (JB): 43704.781
Skew: -0.523 Prob(JB): 0.00
Kurtosis: 11.302 Cond. No. 3.84


Notes:
[1] Standard Errors are heteroscedasticity and autocorrelation robust (HAC) using 5 lags and without small sample correction

Cheat sheet

The following table summarizes the standard errors estimators in statsmodels and their corresponding methods:

# Import the formula API
import statsmodels.formula.api as smf

# Define OLS model
mod = smf.ols("portLS ~ MktRF + SMB + HML + RMW + CMA", data=df)

# Estimate OLS model with OLS standard errors
res = mod.fit(use_t=True)

# Estimate OLS model with White standard errors
res = mod.fit(cov_type="HC0", use_t=True)

# Estimate OLS model with Newey-West standard errors
res = mod.fit(
    cov_type="HAC", cov_kwds={"kernel": "bartlett", "maxlags": 5}, use_t=True
)

# Estimate OLS model with Hansen-Hodrick standard errors
res = mod.fit(
    cov_type="HAC", cov_kwds={"kernel": "uniform", "maxlags": 5}, use_t=True
)

Learn more

To learn more about the statsmodels library, I recommend reading the documentation, especially the section on linear regression.

Using widely-used software tools like statsmodels is a good way to make your research more reproducible and transparent. However, like any tool, it is important to understand the underlying econometric theory. If you are interested in learning more about the standard errors discussed in this post, I recommend reading White (1980), Newey and West (1987), MacKinnon and White (1985), and Hansen and Hodrick (1980).

References

Chen, Andrew Y., and Tom Zimmermann. 2022. “Open Source Cross-Sectional Asset Pricing.” Critical Finance Review 27 (2): 207–64. https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3604626.
Hansen, Lars Peter, and Robert J Hodrick. 1980. “Forward Exchange Rates as Optimal Predictors of Future Spot Rates: An Econometric Analysis.” Journal of Political Economy 88 (5): 829–53. https://www.journals.uchicago.edu/doi/abs/10.1086/260910.
MacKinnon, James G, and Halbert White. 1985. “Some Heteroskedasticity-Consistent Covariance Matrix Estimators with Improved Finite Sample Properties.” Journal of Econometrics 29 (3): 305–25. https://doi.org/10.1016/0304-4076(85)90158-7.
Newey, Whitney K, and Kenneth D West. 1987. “A Simple, Positive Semi-Definite, Heteroskedasticity and Autocorrelation Consistent Covariance Matrix.” Econometrica 55 (3): 703–8. https://www.jstor.org/stable/1913610.
White, Halbert. 1980. “A Heteroskedasticity-Consistent Covariance Matrix Estimator and a Direct Test for Heteroskedasticity.” Econometrica, 817–38. https://doi.org/10.2307/1912934.

Footnotes

  1. This is how it is called in the documentation.↩︎

  2. Also called heteroskedasticity and autocorrelation consistent (HAC) standard errors.↩︎

Reuse