where \(y_{it}\) is the dependent variable for entity \(i\in [1,I]\) at time \(t\in [1,T]\), \(\alpha_i\) is an entity fixed effect, \(\alpha_t\) is a time fixed effect, \(x_{it}\) is a vector of independent variables, \(\beta\) is a vector of coefficients, and \(\epsilon_{it}\) is the error term. For the sake of simplicity, in the rest of this post, I assume that the entity is a firm and the time is a year. The error term can be correlated across firms and years and can be heteroskedastic, meaning that the variance of the error term is not constant.
When we get to the data, we write the model in matrix form:
where \(\mathbf{y}\) is a \(N \times 1\) vector of dependent variables, \(\mathbf{X}\) is a \(N \times K\) matrix of independent variables, \(\beta\) is a \(K \times 1\) vector of coefficients, and \(\mathbf{\epsilon}\) is a \(N \times 1\) vector of error terms. The number of observations is \(N\) and the number of independent variables is \(K\). For a balanced panel in which each firm is observed for the same number of years, \(N=I \times T\).
More on matrix form representation
If the model includes fixed effects, we assume that the vector \(\mathbf{y}\) and the matrix \(\mathbf{X}\) are demeaned by the relevant fixed effect.
It is then common to stack the observations for each entity. For example, if we have \(I\) firms and \(T\) years, the vector \(\mathbf{y}\) of elements \(y_{i,t}\) is defined as:
The goal is then to estimate the coefficients \(\beta\) and their associated standard errors. The standard errors are important because they allow us to perform statistical inference. For example, we can test whether the coefficients are statistically different from zero.
In his seminal paper, Petersen (2009) compares the performance of different standard errors estimators in panel data. In addition, he provides programming advice on how to estimate standard errors in panel data with Stata and other languages, but not for Python.
In this post, I explain how to estimate standard errors in panel data with Python and the linearmodels library. I do so by replicating the sample results from Petersen’s test data page in Python, and cover a few more standard errors estimators that are useful in finance research.
I use the test data set provided by Mitchell Petersen and available here. The data set contains 500 firms and 10 years.
The linearmodels library is a Python package written by Kevin Sheppard at the University of Oxford that extends the statsmodels library with functions commonly used in financial econometrics. Of note, the documentation is quite good and the library is actively maintained.
Installation
You can install the library with your package manager of choice:
poetry:
poetry add linearmodels
pip:
pip install linearmodels
conda (linearmodels is not available in the default conda channel, but it is available in the conda-forge channel):
conda install -c conda-forge linearmodels
Usage
The library is easy to use, and intuitive for anyone familiar with statsmodels. In this post, I will use two model types: PanelOLS and FamaMacBeth. As their names imply, the PanelOLS class is used to estimate panel data models, and the FamaMacBeth class is used to estimate Fama-MacBeth regressions. The library also provides implementations for random effects and first differences models, and for between estimation and pooled OLS estimators. In addition to panel data models, the library also provides implementations for instrumental variables models (including 2SLS, 3SLS, GMM, and SUR) and for estimating linear factor models.
A key particularity of linearmodels is that it uses the indexing capabilities of pandas to identify the panel data structure. In particular, the DataFrame that contains your panel data must have a MultiIndex with the entity dimension in the first level and the time dimension in the second level:
df = df.set_index(["firmid", "year"])df
x
y
firmid
year
1
1
-1.113973
2.251535
2
-0.080854
1.242346
3
-0.237607
-1.426376
4
-0.152486
-1.109394
5
-0.001426
0.914686
...
...
...
...
500
6
-0.077057
3.720502
7
0.218847
0.559121
8
-0.155530
-3.766785
9
-0.040172
0.903354
10
-0.001172
-0.529761
5000 rows × 2 columns
You define models by passing the dependent variable and the exogenous variables to the model class. Alternatively, following statsmodels, linearmodels leverages the patsy formula language to provide a more intuitive way to define models. For example, the following two lines of code are equivalent:
You then estimate models by calling the fit() method. The fit() method returns a linearmodels regression results object, which contains the estimated coefficients, standard errors, and other statistics. The linearmodels results object is similar to the statsmodels results object, but it contains additional methods and attributes that are useful for panel data models.
OLS coefficients and standard errors
The first model I estimate is a vanilla OLS model.
The OLS estimator is the most common estimator in finance research. It is also the most efficient estimator when the errors are homoskedastic and uncorrelated across entities and time.
where \(\mathbf{X}\) is the matrix of exogenous variables and \(\mathbf{y}\) is the vector of dependent variables. The covariance matrix \(\Sigma\) of the estimator is defined as:
where \(\sigma_{\mathbf{\epsilon}}^{2}\) is usually estimated with the residuals \(\mathbf{\hat{e}}=( \mathbf{y}-\mathbf{X}\hat{\beta}_{OLS})\) from the OLS regression:
Note that because the intercept is usually excluded by default from panel data models because the use of fixed-effects renders it redundant, we need to add it manually to the model by adding 1 to the formula.
White standard errors
The OLS estimator standard errors are consistent when the errors are homoskedastic and uncorrelated across entities and time. In this case, the OLS estimator is unbiased and consistent. However, this is not usually the case with financial data, where the errors are often heteroskedastic and correlated across entities and/or time. In this case, the OLS estimator is biased and inconsistent, and the standard errors are incorrect.
A common solution to this problem is to use heteroskedasticity-robust standard errors. The most common heteroskedasticity-robust standard errors estimator is the White estimator, which is defined as
where \(\hat{\Omega}_0\) is a diagonal matrix with the squared residuals \(\mathbf{\hat{e}}^{2}\) on the diagonal, i.e., with the square of the j-th residual \(\hat{e}_{j}^{2}\) at position \((j,j)\) and \(0\) elsewhere. The matrix \(\hat{\Omega}_0\) is defined as:
The PanelOLS class can be used to estimate OLS models with White standard errors. The following code estimates the OLS coefficients and White standard errors:
res = mod.fit(cov_type="robust")res.summary
PanelOLS Estimation Summary
Dep. Variable:
y
R-squared:
0.2078
Estimator:
PanelOLS
R-squared (Between):
0.2208
No. Observations:
5000
R-squared (Within):
0.1907
Date:
Fri, Jan 19 2024
R-squared (Overall):
0.2078
Time:
08:23:25
Log-likelihood
-1.057e+04
Cov. Estimator:
Robust
F-statistic:
1310.7
Entities:
500
P-value
0.0000
Avg Obs:
10.0000
Distribution:
F(1,4998)
Min Obs:
10.0000
Max Obs:
10.0000
F-statistic (robust):
1328.2
P-value
0.0000
Time periods:
10
Distribution:
F(1,4998)
Avg Obs:
500.00
Min Obs:
500.00
Max Obs:
500.00
Parameter Estimates
Parameter
Std. Err.
T-stat
P-value
Lower CI
Upper CI
Intercept
0.0297
0.0284
1.0465
0.2954
-0.0259
0.0853
x
1.0348
0.0284
36.444
0.0000
0.9792
1.0905
The estimated coefficients are the same as with the OLS standard errors, but the standard errors are different. In turn, all statistical measures that depend on the standard errors, such as t-statistics and p-values, are different.
Clustered standard errors
Beyond homoskedasticity, a common concern with panel models is that the errors are correlated across entities and/or time. In this case, the OLS estimator is biased and inconsistent, and the standard errors are incorrect. Clustering standard errors is a widely used adjustment for this problem.
The intuition behind clustering is that often the errors are correlated within groups, but not across groups. For example, in a panel data set with firms, the errors could correlated within the observations of a firm, but not across firms. In this case, we can cluster the standard errors by firm to obtain consistent standard errors. Implicitly we assume that the errors have the following structure:1
Firm 1
Firm 2
Firm 3
Firm 1
\(\epsilon_{11}^2\)
\(\epsilon_{11}\epsilon_{12}\)
\(\epsilon_{11}\epsilon_{13}\)
0
0
0
0
0
0
\(\epsilon_{12}\epsilon_{11}\)
\(\epsilon_{12}^2\)
\(\epsilon_{12}\epsilon_{13}\)
0
0
0
0
0
0
\(\epsilon_{13}\epsilon_{11}\)
\(\epsilon_{13}\epsilon_{12}\)
\(\epsilon_{13}^2\)
0
0
0
0
0
0
Firm 2
0
0
0
\(\epsilon_{21}^2\)
\(\epsilon_{21}\epsilon_{22}\)
\(\epsilon_{21}\epsilon_{23}\)
0
0
0
0
0
0
\(\epsilon_{22}\epsilon_{21}\)
\(\epsilon_{22}^2\)
\(\epsilon_{22}\epsilon_{23}\)
0
0
0
0
0
0
\(\epsilon_{23}\epsilon_{21}\)
\(\epsilon_{23}\epsilon_{22}\)
\(\epsilon_{23}^2\)
0
0
0
Firm 3
0
0
0
0
0
0
\(\epsilon_{31}^2\)
\(\epsilon_{31}\epsilon_{32}\)
\(\epsilon_{31}\epsilon_{33}\)
0
0
0
0
0
0
\(\epsilon_{32}\epsilon_{31}\)
\(\epsilon_{32}^2\)
\(\epsilon_{32}\epsilon_{33}\)
0
0
0
0
0
0
\(\epsilon_{33}\epsilon_{31}\)
\(\epsilon_{33}\epsilon_{32}\)
\(\epsilon_{33}^2\)
If you assume that the errors are correlated within the observations of a firm, and across firms within each time period, you can use standard errors clustered by firm and year. In this case, the errors have the following structure:
Firm 1
Firm 2
Firm 3
Firm 1
\(\epsilon_{11}^2\)
\(\epsilon_{11}\epsilon_{12}\)
\(\epsilon_{11}\epsilon_{13}\)
\(\epsilon_{11}\epsilon_{21}\)
0
0
\(\epsilon_{11}\epsilon_{31}\)
0
0
\(\epsilon_{12}\epsilon_{11}\)
\(\epsilon_{12}^2\)
\(\epsilon_{12}\epsilon_{13}\)
0
\(\epsilon_{12}\epsilon_{22}\)
0
0
\(\epsilon_{12}\epsilon_{32}\)
0
\(\epsilon_{13}\epsilon_{11}\)
\(\epsilon_{13}\epsilon_{12}\)
\(\epsilon_{13}^2\)
0
0
\(\epsilon_{13}\epsilon_{23}\)
0
0
\(\epsilon_{13}\epsilon_{33}\)
Firm 2
\(\epsilon_{21}\epsilon_{11}\)
0
0
\(\epsilon_{21}^2\)
\(\epsilon_{21}\epsilon_{22}\)
\(\epsilon_{21}\epsilon_{23}\)
\(\epsilon_{21}\epsilon_{31}\)
0
0
0
\(\epsilon_{22}\epsilon_{12}\)
0
\(\epsilon_{22}\epsilon_{21}\)
\(\epsilon_{22}^2\)
\(\epsilon_{22}\epsilon_{23}\)
0
\(\epsilon_{22}\epsilon_{32}\)
0
0
0
\(\epsilon_{23}\epsilon_{13}\)
\(\epsilon_{23}\epsilon_{21}\)
\(\epsilon_{23}\epsilon_{22}\)
\(\epsilon_{23}^2\)
0
0
\(\epsilon_{23}\epsilon_{33}\)
Firm 3
\(\epsilon_{31}\epsilon_{11}\)
0
0
\(\epsilon_{31}\epsilon_{21}\)
0
0
\(\epsilon_{31}^2\)
\(\epsilon_{31}\epsilon_{32}\)
\(\epsilon_{31}\epsilon_{33}\)
0
\(\epsilon_{32}\epsilon_{12}\)
0
0
\(\epsilon_{32}\epsilon_{22}\)
0
\(\epsilon_{32}\epsilon_{31}\)
\(\epsilon_{32}^2\)
\(\epsilon_{32}\epsilon_{33}\)
0
0
\(\epsilon_{33}\epsilon_{13}\)
0
0
\(\epsilon_{33}\epsilon_{23}\)
\(\epsilon_{33}\epsilon_{31}\)
\(\epsilon_{33}\epsilon_{32}\)
\(\epsilon_{33}^2\)
You can easily estimate clustered standard errors with the PanelOLS class by passing cov_type="clustered" and the entity_effects and time_effects arguments to fit() method. The following code estimates the OLS coefficients and standard errors clustered by firm, year, and firm and year:
# Year onlyres = mod.fit( cov_type="clustered", cluster_entity=False, cluster_time=True, group_debias=True)res.summary
PanelOLS Estimation Summary
Dep. Variable:
y
R-squared:
0.2078
Estimator:
PanelOLS
R-squared (Between):
0.2208
No. Observations:
5000
R-squared (Within):
0.1907
Date:
Fri, Jan 19 2024
R-squared (Overall):
0.2078
Time:
08:23:25
Log-likelihood
-1.057e+04
Cov. Estimator:
Clustered
F-statistic:
1310.7
Entities:
500
P-value
0.0000
Avg Obs:
10.0000
Distribution:
F(1,4998)
Min Obs:
10.0000
Max Obs:
10.0000
F-statistic (robust):
960.59
P-value
0.0000
Time periods:
10
Distribution:
F(1,4998)
Avg Obs:
500.00
Min Obs:
500.00
Max Obs:
500.00
Parameter Estimates
Parameter
Std. Err.
T-stat
P-value
Lower CI
Upper CI
Intercept
0.0297
0.0234
1.2691
0.2045
-0.0162
0.0755
x
1.0348
0.0334
30.993
0.0000
0.9694
1.1003
# Firm and yearres = mod.fit( cov_type="clustered", cluster_entity=True, cluster_time=True, group_debias=True)res.summary
PanelOLS Estimation Summary
Dep. Variable:
y
R-squared:
0.2078
Estimator:
PanelOLS
R-squared (Between):
0.2208
No. Observations:
5000
R-squared (Within):
0.1907
Date:
Fri, Jan 19 2024
R-squared (Overall):
0.2078
Time:
08:23:25
Log-likelihood
-1.057e+04
Cov. Estimator:
Clustered
F-statistic:
1310.7
Entities:
500
P-value
0.0000
Avg Obs:
10.0000
Distribution:
F(1,4998)
Min Obs:
10.0000
Max Obs:
10.0000
F-statistic (robust):
373.33
P-value
0.0000
Time periods:
10
Distribution:
F(1,4998)
Avg Obs:
500.00
Min Obs:
500.00
Max Obs:
500.00
Parameter Estimates
Parameter
Std. Err.
T-stat
P-value
Lower CI
Upper CI
Intercept
0.0297
0.0651
0.4562
0.6483
-0.0979
0.1572
x
1.0348
0.0536
19.322
0.0000
0.9298
1.1398
Difference with Stata and statsmodels
By default, Stata and statsmodels estimators adjust the degrees of freedom when estimating clustered standard errors with small clusters. linearmodels does not do that by default, that is what the group_debias parameter is for. In this case it will make a difference because our time dimension is small.
Note that there is a corresponding debiased parameter for the OLS and White estimator, but it is set to True by default so we do not need to set it explicitly.
Panel regression with fixed effects
It is common in finance research to control for entity fixed effects to capture unobserved heterogeneity across entities and for time fixed effects to capture unobserved heterogeneity across time. For example, in a panel of firms, we can control for firm fixed effects to capture unobserved firm-specific characteristics that are constant over time. These fixed effects correspond to the \(\alpha_i\) and \(\alpha_t\) in our panel regression:
The PanelOLS class can be used to estimate panel regression with fixed effects by adding EntityEffects or TimeEffects to the formula. In theory this is equivalent to adding a dummy variable for each entity or time period (i.e. adding C(firmid) or C(year)), but in practice it is much more efficient because linear models will use a efficient transformation to apply the effetc of the dummy variables without including them in the regression estimation. linearmodels will thus not provide coefficients for the fixed effects. Also, the intercept is usually excluded by default from panel data models because the use of fixed-effects renders it redundant, so linearmodels will not provide an intercept either.
The following code estimates the OLS coefficients and standard errors with firm fixed effects, year fixed effects, and firm and year fixed effects, and clustered standard errors along the same dimensions:
F-test for Poolability: 11.372 P-value: 0.0000 Distribution: F(499,4499)
Included effects: Entity
# Year onlymod_yfe = PanelOLS.from_formula("y ~ x + TimeEffects", df)res = mod_yfe.fit( cov_type="clustered", cluster_entity=False, cluster_time=True, group_debias=True)res.summary
PanelOLS Estimation Summary
Dep. Variable:
y
R-squared:
0.2077
Estimator:
PanelOLS
R-squared (Between):
0.2208
No. Observations:
5000
R-squared (Within):
0.1907
Date:
Fri, Jan 19 2024
R-squared (Overall):
0.2078
Time:
08:23:25
Log-likelihood
-1.057e+04
Cov. Estimator:
Clustered
F-statistic:
1307.5
Entities:
500
P-value
0.0000
Avg Obs:
10.0000
Distribution:
F(1,4989)
Min Obs:
10.0000
Max Obs:
10.0000
F-statistic (robust):
961.52
P-value
0.0000
Time periods:
10
Distribution:
F(1,4989)
Avg Obs:
500.00
Min Obs:
500.00
Max Obs:
500.00
Parameter Estimates
Parameter
Std. Err.
T-stat
P-value
Lower CI
Upper CI
x
1.0351
0.0334
31.008
0.0000
0.9696
1.1005
F-test for Poolability: 0.6799 P-value: 0.7279 Distribution: F(9,4989)
Included effects: Time
# Firm and yearmod_fyfe = PanelOLS.from_formula("y ~ x + EntityEffects + TimeEffects", df)res = mod_fyfe.fit( cov_type="clustered", cluster_entity=True, cluster_time=True, group_debias=True)res.summary
PanelOLS Estimation Summary
Dep. Variable:
y
R-squared:
0.1913
Estimator:
PanelOLS
R-squared (Between):
0.2187
No. Observations:
5000
R-squared (Within):
0.1916
Date:
Fri, Jan 19 2024
R-squared (Overall):
0.2070
Time:
08:23:25
Log-likelihood
-8525.9
Cov. Estimator:
Clustered
F-statistic:
1062.0
Entities:
500
P-value
0.0000
Avg Obs:
10.0000
Distribution:
F(1,4490)
Min Obs:
10.0000
Max Obs:
10.0000
F-statistic (robust):
972.96
P-value
0.0000
Time periods:
10
Distribution:
F(1,4490)
Avg Obs:
500.00
Min Obs:
500.00
Max Obs:
500.00
Parameter Estimates
Parameter
Std. Err.
T-stat
P-value
Lower CI
Upper CI
x
0.9700
0.0311
31.192
0.0000
0.9091
1.0310
F-test for Poolability: 11.203 P-value: 0.0000 Distribution: F(508,4490)
Included effects: Entity, Time
Driscoll-Kraay standard errors
The Driscoll-Kraay standard errors estimator is a heteroskedasticity-robust standard errors estimator that is robust to serial correlation in the errors. This is particularly important in datasets where the assumption that observations are independent across cross-sections may not hold. See Driscoll and Kraay (1998) for more details.
To use the Driscoll-Kraay standard errors estimator, you need to pass the following arguments to the fit() method:
cov_type="kernel": To use the Driscoll-Kraay HAC estimator.
kernel="bartlett": The kernel to use. Bartlett is the kernel used by the Newey-West estimator. linearmodels also supports the Parzen (parzen) and the Quadratic Spectral (qs) kernels.
bandwith=3: the number of lags to use, automatically computed if not specified.
res = mod.fit(cov_type="kernel", kernel="bartlett", bandwidth=3)res.summary
PanelOLS Estimation Summary
Dep. Variable:
y
R-squared:
0.2078
Estimator:
PanelOLS
R-squared (Between):
0.2208
No. Observations:
5000
R-squared (Within):
0.1907
Date:
Sat, Jan 20 2024
R-squared (Overall):
0.2078
Time:
07:01:02
Log-likelihood
-1.057e+04
Cov. Estimator:
Driscoll-Kraay
F-statistic:
1310.7
Entities:
500
P-value
0.0000
Avg Obs:
10.0000
Distribution:
F(1,4998)
Min Obs:
10.0000
Max Obs:
10.0000
F-statistic (robust):
1708.6
P-value
0.0000
Time periods:
10
Distribution:
F(1,4998)
Avg Obs:
500.00
Min Obs:
500.00
Max Obs:
500.00
Parameter Estimates
Parameter
Std. Err.
T-stat
P-value
Lower CI
Upper CI
Intercept
0.0297
0.0218
1.3622
0.1732
-0.0130
0.0724
x
1.0348
0.0250
41.335
0.0000
0.9858
1.0839
Fama-MacBeth coefficients and standard errors
The Fama-MacBeth regression, introduced in Fama and MacBeth (1973), is a robust method for estimating financial models in the context of panel data in empirical finance research that is effective in dealing with cross-sectional correlation and heteroskedasticity. This technique involves two key steps. For our model, we would first estimate the following cross-sectional regression for each time period \(t\):
linearmodels provides a FamaMacBeth class that can be used to estimate Fama-MacBeth regressions. The following code estimates the Fama-MacBeth coefficients and standard errors:
It is common in practice to use Newey-West standard errors (introduced in Newey and West 1987) instead of the standard errors above in Fama-MacBeth regressions to account for heteroskedasticity and autocorrelation in the residuals. For more details on the Newey-West estimator, see my post on the topic.
To use Newey-West standard errors, you can pass the following arguments to the fit() method:
kernel="bartlett": The Newey-West estimator uses the Bartlett kernel. linearmodels also supports the Parzen (parzen) and the Quadratic Spectral (qs) kernels.
bandwith=3: the number of lags to use, automatically computed if not specified.
The following table summarizes the standard errors estimators available in linearmodels and their corresponding methods:
from linearmodels import PanelOLS# Model with constantmod_cst = PanelOLS.from_formula("y ~ 1 + x1 + x2", data)# Non-robust errors (regular OLS)res = mod_cst.fit()# White standard errors (heteroskedasticity-robust)res = mod_cst.fit(cov_type="robust")# Clustered standard errors# By entity (firm)res = mod_cst.fit( cov_type="clustered", cluster_entity=True, cluster_time=False, group_debias=True)# By time (year)res = mod_cst.fit( cov_type="clustered", cluster_entity=False, cluster_time=True, group_debias=True)# By entity and timeres = mod_cst.fit( cov_type="clustered", cluster_entity=True, cluster_time=True, group_debias=True)# Fixed effects models# Model with entity (firm) fixed effectsmod_fe = PanelOLS.from_formula("y ~ x1 + x2 + EntityEffects", data)# Model with time (year) fixed effectsmod_fe = PanelOLS.from_formula("y ~ x1 + x2 + TimeEffects", data)# Model with entity and time fixed effectsmod_fe = PanelOLS.from_formula("y ~ x1 + x2 + EntityEffects+ TimeEffects", data)# To cluster standard errors by entity and time, the usage is the same as above,# but applied to the FE model.# By entity (firm)res = mod_fe.fit( cov_type="clustered", cluster_entity=True, cluster_time=False, group_debias=True)# By time (year)res = mod_fe.fit( cov_type="clustered", cluster_entity=False, cluster_time=True, group_debias=True)# By entity and timeres = mod_fe.fit( cov_type="clustered", cluster_entity=True, cluster_time=True, group_debias=True)# Driscoll-Kraay standard errorsres = mod_cst.fit(cov_type="kernel", kernel="bartlett", bandwidth=3)# Fama-MacBeth regressionsfrom linearmodels import FamaMacBethmod_fm = FamaMacBeth.from_formula("y ~ 1 + x1 + x2", data)# Non-adjusted standard errorsres = mod_fm.fit()# Newey-West standard errors with 3 lagsres = mod_fm.fit(cov_type="kernel", kernel="bartlett", bandwidth=3)
Using a powerful tool like linearmodels can help you save time and avoid errors when estimating panel data models. However, it is important to understand the underlying econometric theory. If you are interested in learning more about panel data models and their associated standard error, I recommend reading the Petersen (2009) paper or watching the talk he gave at the 2008 FMA Annual meetings.