Contents
Evaluation of regression coefficients
T tests in regression analysis tests for the following null hypothesis (H0) for the regression coefficients of the generated model:
These results indicate that the test statistic, t, is outside the confidence interval and that the p-value is also close to zero, which is much lower than the significance level. Therefore, the results for the above dollor index indicate that the null hypothesis cannot be adopted. This discussion can be generalized as follows:
The calculated regression coefficients also form a distribution as probability variables, so you can conduct a test. As mentioned above, the distribution has the same shape as the distribution of errors, so you can calculate the variance of the regression coefficient based on the variance of the error. As shown in Equation 1, the regression coefficient equals the ratio of the estimate (response-error) to the independent variable. (The deviation term is a constant that is automatically determined by each regression coefficient, so you can ignore it in the next calculation.)
$$\begin{align}\tag{1}&e=y-b_1x\\ &b_1=\frac{y-e}{x}\\ &e: \text{error}\end{align}$$As shown in Equation 1, the variation of $b_1$ can be induced by the ratio of the variation of the estimation and the independent variable. The variation in the estimation is equal to the variation in the error. Therefore, the variation in the regression coefficient can be expressed as follows: $$\begin{align}\sigma^2_{b_1}&=\frac{\sigma^2_e}{\sum^n_{i=1}(x_i-\bar{x})^2}\\&=\frac{\sigma^2_e}{S_{xx}}\\ \sigma^2_{b_0}&=\frac{\sum^n_{i=1} x^2_i}{n\sum^n_{i=1}(x_i-\bar{x})}\sigma^2_e\\&=\frac{\sum^n_{i=1} x^2_i}{nS_{xx}} \sigma^2_e\end{align}$$
The variation of the regression coefficients by matrix operations can be represented as shown in Equation 2.
$$\begin{align}\tag{2} \text{var(b)}&=(X^TX)^{-1} \sigma^2_e\\&=\begin{bmatrix}\frac{\sum x^2_i}{nS_{xx}}&-\frac{\bar{x}}{S_{xx}}\\-\frac{\bar{x}}{S_{xx}}&\frac{1}{S_{xx}} \end{bmatrix}\sigma^2_e \\ &=\begin{bmatrix}c_{00}&c_{01}\\c_{10}&c_{11} \end{bmatrix}\sigma^2_e \end{align}$$In Equation 2, $c_{00}\sigma^2_e$ and $c_{11}\sigma^2_e$ are variances of $b_0, b_1$, respectively. The standard deviation of the regression coefficient calculated from the equation uses the mean error of the error term as an unbiased estimate.
Example 1)
The following is data on the Open and Close of the iShares Semiconductor ETF (SOXX). Build a regression model with Open as the independent variable and Close as the response variable.
import numpy as np import pandas as pd import matplotlib.pyplot as plt from scipy import stats from sklearn import preprocessing from sklearn.linear_model import LinearRegression import FinanceDataReader as fdr import FinanceDataReader as fdr
st=pd.Timestamp(2020,5, 10) et=pd.Timestamp(2021, 12, 24) nd=fdr.DataReader('SOXX', st, et)[['Open','Close']] nd.tail(3)
Open | Close | |
---|---|---|
Date | ||
2021-12-21 | 523.19 | 530.12 |
2021-12-22 | 527.39 | 535.63 |
2021-12-23 | 536.70 | 540.81 |
ind=nd.values[:-1,0].reshape(-1,1) #independent de=nd.values[1:,1].reshape(-1,1)#dependent new=nd.values[-1,0].reshape(-1,1) new
array([[536.7]])
#standardiztion of indepentent and dependent variable indScaler=preprocessing.StandardScaler().fit(ind) deScaler=preprocessing.StandardScaler().fit(de) indN=indScaler.transform(ind) deN=indScaler.transform(de) newN=deScaler.transform(new) newN
array([[1.81354151]])
#linear model mod=LinearRegression().fit(indN, deN) mod.coef_, mod.intercept_
(array([[0.99441754]]), array([0.00967765]))
The error due to the above model is calculated as follows.
pre=mod.predict(indN) error=deN-pre error[-3:]
Since the standard error based on the error is related to the variation of the regression coefficient of the model, it can be considered as the standard error of the regression coefficient.
array([[0.27241109], [0.15500268], [0.16754089]])
#standard error of error se=pd.DataFrame(error).sem() np.around(se, 4)
0 0.0054 dtype: float64
#standard error of regression coefficient $b_1$ sigmab=np.sqrt(error.var(ddof=2)/np.sum((indN-indN.mean())**2)) round(sigmab, 4)
0.0055
You can test the regression coefficient in a t-distribution with a degree of freedom n-2 by applying the variance of $b_1$. The null hypothesis and alternative hypothesis of this test are as follows:
$$\text{H0}: b_1=0, \quad \text{H1}: b_1 \neq 0$$In this case, the t-test statistics are as follows:
$$\text{t statistics}=\frac{0-b_1}{\sigma_{b_1}}$$Calculate the test statistics and confidence intervals and significance probabilities at the significance level of 0.05.
t=mod.coef_/sigmab print(f'Statistics: {np.around(t, 3)}')
Statistics: [[182.452]]
df=len(pre)-2 df
408
#standard deviation of regression coefficient down, up=stats.t.interval(0.95, df, mod.coef_,sigmab) print(f'lower: {np.around(down, 3)}, upper: {np.around(up, 3)}')
lower: [[0.984]], upper: [[1.005]]
#Standard error of error term pval=stats.t.sf(t, df) print(f'p-value: {np.around(pval, 3)}')
p-value: [[0.]]
The confidence interval for the above result does not include zero. The significance is also 0, indicating that it is well below the confidence level of 0.05. Therefore, the null hypothesis that b_1=0 can be rejected. In other words, the built-up linear model cannot be rejected.
Evaluation of the model
Regression is based on probabilities, and estimates by the model cause differences from observations. The evaluation of the model applies the previously introduced ANOVA as an assessment of whether the level of difference is acceptable. An analysis of variance is a comparison of each variation (dispersion) that occurs between different groups (variables) to determine whether it is generally possible to occur.
As shown in Figure 1, the unbiased estimator of the observation y uses an average of $\bar{y}$. If the mean value matches the prediction by the regression model, then the regression analysis has no meaning. That is, if the regression model is appropriate, there is a difference between the mean and the estimate, and an error between the estimate and the observation. Therefore, the relationship between $\bar{y},\; \hat{y}$ and y can be represented by Equation 3.
$$\begin{equation}\tag{3}|\bar{y}-y| = |\bar{y}-\hat{y}|+|\hat{y}-y|\end{equation}$$plt.figure(figsize=(7, 4)) plt.plot(x, y, label="regression line") plt.hlines(ymu,0, 4, linestyle="--", label="Mean Line") plt.hlines(-1,0, 4, linestyle="--", color="lightgray") plt.scatter([2, 2, 2], [-1, 1, ymu], color="black") plt.vlines(2, 1, ymu, color="red") plt.vlines(2, -1, 1, color="blue") plt.vlines(3.5, -1, ymu, color="black") plt.text(1.3, 2.5, "SSReg", size=12, weight="bold", color="red") plt.text(0.8, 2, "=|mean-predict|", weight="bold", color="red") plt.text(1.3, -0.3, "SSE", size=12, weight="bold", color="blue") plt.text(1, -0.8, "=|Y-predict|", weight="bold", color="blue") plt.text(2.1, -1, 'Y', size=12, weight="bold") plt.text(2.1, 1, 'predicted Y', size=12, weight="bold") plt.text(2.1, ymu, 'mean Y', size=12, weight="bold") plt.text(3.6, 1, 'SST', size=12, weight="bold", color="black") plt.text(3.6, 0.5, '=|Y-mean Y|', size=12, weight="bold", color="black") plt.xlim(0, 5) plt.ylim(-2, 5) plt.legend(loc="best") plt.show()
Equation 4 is derived from Equation 3.
$$\begin{align}\tag{4} \sum^n_{i=1}(y-\bar{y})^2&=\sum^n_{i=1}(\hat{y}-\bar{y})^2+\sum^n_{i=1}(y-\hat{y})^2\\ SST&=SSReg+SSE\\ \text{Total}&=\text{Explanable parts}+\text{Unexplanable parts}\end{align}$$The left term in Equation 2 is the sum of the sum square total (SST) and the right term is the sum of the sum square regression (SSReg) and the sum square error (SSE). The regression square sum is a part that can be explained by the generated regression model, but the residual square sum is an unexplained part. This expression is appropriate because the smaller the unexplained portion of the sum of total squares, the greater the explanatory power by the model. In other words, as SSReg increases during SST, it becomes an accurate model.
Equation 5 is the calculation of the ratio of errors that can be explained during the total error, called coefficient of determination ($R^2$), which provides a basis for determining the degree of explanatory power by the regression model. {\vskip1em} $$\begin{align}\tag{5} R^2&=\frac{SSReg}{SST}\\&=\frac{SST-SSE}{SST}\\&=1-\frac{SSE}{SST}\\&=1-\frac{\sum^n_{i=1}(y-\hat{y})^2}{\sum^n_{i=1}(y-\bar{y})^2} \end{align}$$
The coefficient of determination is equal to the square of the correlation coefficient r. This coefficient indicates that the closer the value between [0, 1] is to 1, the closer the estimate of the model is to the observation. It can be calculated by the rsquared property of the model generated by statmodels.OLS()
and the score()
method of the model generated by linearRegression() class in sklearn.linear_model.
Example 2)
Calculate the coefficient of determination from the model in Example 1. The model above uses the linearRegression() class, and the coefficient of determination can also be determined by applying this class.
R2=mod.score(indN, deN) print(f'R2 by LinearRegression() : {round(R2, 3)}')
R2 by LinearRegression() : 0.988
Equation 6 follows the F distribution as a ratio of explainable (MSReg) and unexplainable variation (MSE), allowing testing of the following null hypotheses based on this distribution:
$$\text{H0:} b_1 = 0, \; \text{H1:} b_1 \neq 0$$ $$\begin{equation}\tag{6}F=\frac{SSReg}{SSE}\end{equation}$$Calculate the sum of squares and the mean of squares and perform the F test as follows:
sst=np.sum((deN.mean()-deN)**2) mst=sst/(len(deN)-1) print(f'mst: {round(mst, 3)}')
mst: 1.003
ssreg=np.sum((deN.mean()-pre)**2) msreg=ssreg/1 print(f'msreg: {round(msreg, 3)}')
msreg: 405.435
sse=np.sum((deN-pre)**2) mse=sse/(len(deN)-2) print(f'mse: {round(mse, 3)}')
mse: 0.012
fratio=msreg/mse print(f'F statistics: {round(fratio, 3)}')
F statistics: 33288.607
ci=stats.f.interval(0.95, 1, df) print(f'Lower: {round(ci[0], 3):}, Upper: {round(ci[1], 3)}')
Lower: 0.001, Upper: 5.061
pval=stats.f.sf(msreg/mse, 1, df) print(f'p-val: {round(pval, 3)}')
p-val: 0.0
The results above indicate that the null hypothesis cannot be adopted based on confidence intervals and significance probabilities based on the significance level of 0.05. You can apply the f_regression
function to return test statistics and significance probabilities.
from sklearn.feature_selection import f_regression re=f_regression(pre, deN.ravel()) print(f'statistics: {np.around(re[0], 3)}, p-value: {np.around(re[1], 3)}')
statistics: [33288.607], p-value: [0.]
Regression Estimation
Estimates that apply models created by statsmodel
or sklearn.linearRegression
are calculated by .predict()
method. However, this is an estimate, so it may differ from the actual value. It can be represented as in Equation 7.
Example 3)
&esmp; Apply the model built in Example 1 to the estimate of the independent variable new in that data.
pre=mod.predict(newN) pre
array([[1.81309514]])
#return to original scale realPre=deScaler.inverse_transform(pre)
realPrearray([[536.6642586]])
Because the independent and response variables used in regression are probability variables, the error generated based on those variables is also probability variables. Therefore, according to the central extreme theorem, the error is consistent with the normal distribution. This normality can be determined by the stats.probplot()
function.
errorRe=stats.probplot(error.ravel(), plot=plt) print(f'slope:{round(errorRe[1][0], 3)}, bias:{round(errorRe[1][1],3)}, R2:{round(errorRe[1][2],3)}')
slope:0.11, bias:0.0, R2:0.994
The above results show that the error is consistent with normality. Therefore, you can calculate the confidence interval of the error at a significance level of 0.05.
mu=error.mean() se=pd.DataFrame(error).sem(ddof=1) print(f'mean: {np.around(mu,3)}, std.err.: {np.around(se.values,3)}')
mean: 0.0, std.err.: [0.005]
ci=stats.norm.interval(0.95, scale=se) pd.Series([float(ci[0]), float(ci[1])], index=['Lower','Upper'])
Lower -0.010669 Upper 0.010669 dtype: float64
Applying the above results to estimates of the new variable can indicate the upper and lower limits as follows:
pre
array([[1.81309514]])
preCi=pre+ci.values preCi
array([[1.8024258 , 1.82376448]])
re=pd.DataFrame(deScaler.inverse_transform(preCi.reshape(-1,1)), index=["lower", "upper"]) np.around(re, 3)
0 | |
---|---|
lower | 535.810 |
upper | 537.519 |
댓글
댓글 쓰기