Evaluation of regression coefficients, model & Estimation

Evaluation of regression coefficients
Evaluation of the model
Regression Estimation

Evaluation of regression coefficients

T tests in regression analysis tests for the following null hypothesis (H0) for the regression coefficients of the generated model:

H0: No significant difference by coefficient

These results indicate that the test statistic, t, is outside the confidence interval and that the p-value is also close to zero, which is much lower than the significance level. Therefore, the results for the above dollor index indicate that the null hypothesis cannot be adopted. This discussion can be generalized as follows:

The calculated regression coefficients also form a distribution as probability variables, so you can conduct a test. As mentioned above, the distribution has the same shape as the distribution of errors, so you can calculate the variance of the regression coefficient based on the variance of the error. As shown in Equation 1, the regression coefficient equals the ratio of the estimate (response-error) to the independent variable. (The deviation term is a constant that is automatically determined by each regression coefficient, so you can ignore it in the next calculation.)

$$\begin{align}\tag{1}&e=y-b_1x\\ &b_1=\frac{y-e}{x}\\ &e: \text{error}\end{align}$$

As shown in Equation 1, the variation of $b_1$ can be induced by the ratio of the variation of the estimation and the independent variable. The variation in the estimation is equal to the variation in the error. Therefore, the variation in the regression coefficient can be expressed as follows: $$\begin{align}\sigma^2_{b_1}&=\frac{\sigma^2_e}{\sum^n_{i=1}(x_i-\bar{x})^2}\\&=\frac{\sigma^2_e}{S_{xx}}\\ \sigma^2_{b_0}&=\frac{\sum^n_{i=1} x^2_i}{n\sum^n_{i=1}(x_i-\bar{x})}\sigma^2_e\\&=\frac{\sum^n_{i=1} x^2_i}{nS_{xx}} \sigma^2_e\end{align}$$

The variation of the regression coefficients by matrix operations can be represented as shown in Equation 2.

$$\begin{align}\tag{2} \text{var(b)}&=(X^TX)^{-1} \sigma^2_e\\&=\begin{bmatrix}\frac{\sum x^2_i}{nS_{xx}}&-\frac{\bar{x}}{S_{xx}}\\-\frac{\bar{x}}{S_{xx}}&\frac{1}{S_{xx}} \end{bmatrix}\sigma^2_e \\ &=\begin{bmatrix}c_{00}&c_{01}\\c_{10}&c_{11} \end{bmatrix}\sigma^2_e \end{align}$$

In Equation 2, $c_{00}\sigma^2_e$ and $c_{11}\sigma^2_e$ are variances of $b_0, b_1$, respectively. The standard deviation of the regression coefficient calculated from the equation uses the mean error of the error term as an unbiased estimate.

Example 1)
The following is data on the Open and Close of the iShares Semiconductor ETF (SOXX). Build a regression model with Open as the independent variable and Close as the response variable.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy import stats
from sklearn import preprocessing
from sklearn.linear_model import LinearRegression
import FinanceDataReader as fdr
import FinanceDataReader as fdr

st=pd.Timestamp(2020,5, 10)
et=pd.Timestamp(2021, 12, 24)
nd=fdr.DataReader('SOXX', st, et)[['Open','Close']]
nd.tail(3)

	Open	Close
Date
2021-12-21	523.19	530.12
2021-12-22	527.39	535.63
2021-12-23	536.70	540.81

ind=nd.values[:-1,0].reshape(-1,1) #independent
de=nd.values[1:,1].reshape(-1,1)#dependent
new=nd.values[-1,0].reshape(-1,1)
new

array([[536.7]])

#standardiztion of indepentent and dependent variable
indScaler=preprocessing.StandardScaler().fit(ind)
deScaler=preprocessing.StandardScaler().fit(de)
indN=indScaler.transform(ind)
deN=indScaler.transform(de)
newN=deScaler.transform(new)
newN

array([[1.81354151]])

#linear model
mod=LinearRegression().fit(indN, deN)
mod.coef_, mod.intercept_

(array([[0.99441754]]), array([0.00967765]))

The error due to the above model is calculated as follows.

pre=mod.predict(indN)
error=deN-pre
error[-3:]

Since the standard error based on the error is related to the variation of the regression coefficient of the model, it can be considered as the standard error of the regression coefficient.

array([[0.27241109],
       [0.15500268],
       [0.16754089]])

#standard error of error
se=pd.DataFrame(error).sem()
np.around(se, 4)

0    0.0054
dtype: float64

#standard error of regression coefficient $b_1$
sigmab=np.sqrt(error.var(ddof=2)/np.sum((indN-indN.mean())**2))
round(sigmab, 4)

0.0055

You can test the regression coefficient in a t-distribution with a degree of freedom n-2 by applying the variance of $b_1$. The null hypothesis and alternative hypothesis of this test are as follows:

$$\text{H0}: b_1=0, \quad \text{H1}: b_1 \neq 0$$

In this case, the t-test statistics are as follows:

$$\text{t statistics}=\frac{0-b_1}{\sigma_{b_1}}$$

Calculate the test statistics and confidence intervals and significance probabilities at the significance level of 0.05.

t=mod.coef_/sigmab
print(f'Statistics: {np.around(t, 3)}')

Statistics: [[182.452]]

df=len(pre)-2
df

#standard deviation of regression coefficient
down, up=stats.t.interval(0.95, df, mod.coef_,sigmab)
print(f'lower: {np.around(down, 3)}, upper: {np.around(up, 3)}')

lower: [[0.984]], upper: [[1.005]]

#Standard error of error term
pval=stats.t.sf(t, df)
print(f'p-value: {np.around(pval, 3)}')

p-value: [[0.]]

The confidence interval for the above result does not include zero. The significance is also 0, indicating that it is well below the confidence level of 0.05. Therefore, the null hypothesis that b_1=0 can be rejected. In other words, the built-up linear model cannot be rejected.

Evaluation of the model

Regression is based on probabilities, and estimates by the model cause differences from observations. The evaluation of the model applies the previously introduced ANOVA as an assessment of whether the level of difference is acceptable. An analysis of variance is a comparison of each variation (dispersion) that occurs between different groups (variables) to determine whether it is generally possible to occur.

As shown in Figure 1, the unbiased estimator of the observation y uses an average of $\bar{y}$. If the mean value matches the prediction by the regression model, then the regression analysis has no meaning. That is, if the regression model is appropriate, there is a difference between the mean and the estimate, and an error between the estimate and the observation. Therefore, the relationship between $\bar{y},\; \hat{y}$ and y can be represented by Equation 3.

$$\begin{equation}\tag{3}|\bar{y}-y| = |\bar{y}-\hat{y}|+|\hat{y}-y|\end{equation}$$

plt.figure(figsize=(7, 4))
plt.plot(x, y, label="regression line")
plt.hlines(ymu,0, 4, linestyle="--", label="Mean Line")
plt.hlines(-1,0, 4, linestyle="--", color="lightgray")
plt.scatter([2, 2, 2], [-1, 1, ymu], color="black")
plt.vlines(2, 1, ymu, color="red")
plt.vlines(2, -1, 1, color="blue")
plt.vlines(3.5, -1, ymu, color="black")
plt.text(1.3, 2.5, "SSReg", size=12, weight="bold", color="red")
plt.text(0.8, 2, "=|mean-predict|", weight="bold", color="red")
plt.text(1.3, -0.3, "SSE", size=12, weight="bold", color="blue")
plt.text(1, -0.8, "=|Y-predict|", weight="bold", color="blue")
plt.text(2.1, -1, 'Y', size=12, weight="bold")
plt.text(2.1, 1, 'predicted Y', size=12, weight="bold")
plt.text(2.1, ymu, 'mean Y', size=12, weight="bold")
plt.text(3.6, 1, 'SST', size=12, weight="bold", color="black")
plt.text(3.6, 0.5, '=|Y-mean Y|', size=12, weight="bold", color="black")
plt.xlim(0, 5)
plt.ylim(-2, 5)
plt.legend(loc="best")
plt.show()

Figure 1. Error analysis in regression model.

Equation 4 is derived from Equation 3.

$$\begin{align}\tag{4} \sum^n_{i=1}(y-\bar{y})^2&=\sum^n_{i=1}(\hat{y}-\bar{y})^2+\sum^n_{i=1}(y-\hat{y})^2\\ SST&=SSReg+SSE\\ \text{Total}&=\text{Explanable parts}+\text{Unexplanable parts}\end{align}$$

The left term in Equation 2 is the sum of the sum square total (SST) and the right term is the sum of the sum square regression (SSReg) and the sum square error (SSE). The regression square sum is a part that can be explained by the generated regression model, but the residual square sum is an unexplained part. This expression is appropriate because the smaller the unexplained portion of the sum of total squares, the greater the explanatory power by the model. In other words, as SSReg increases during SST, it becomes an accurate model.

Equation 5 is the calculation of the ratio of errors that can be explained during the total error, called coefficient of determination ($R^2$), which provides a basis for determining the degree of explanatory power by the regression model. {\vskip1em} $$\begin{align}\tag{5} R^2&=\frac{SSReg}{SST}\\&=\frac{SST-SSE}{SST}\\&=1-\frac{SSE}{SST}\\&=1-\frac{\sum^n_{i=1}(y-\hat{y})^2}{\sum^n_{i=1}(y-\bar{y})^2} \end{align}$$

The coefficient of determination is equal to the square of the correlation coefficient r. This coefficient indicates that the closer the value between [0, 1] is to 1, the closer the estimate of the model is to the observation. It can be calculated by the rsquared property of the model generated by statmodels.OLS() and the score() method of the model generated by linearRegression() class in sklearn.linear_model.

Example 2)
Calculate the coefficient of determination from the model in Example 1. The model above uses the linearRegression() class, and the coefficient of determination can also be determined by applying this class.

R2=mod.score(indN, deN)
print(f'R2 by LinearRegression() : {round(R2, 3)}')

R2 by LinearRegression() : 0.988

Equation 6 follows the F distribution as a ratio of explainable (MSReg) and unexplainable variation (MSE), allowing testing of the following null hypotheses based on this distribution:

$$\text{H0:} b_1 = 0, \; \text{H1:} b_1 \neq 0$$ $$\begin{equation}\tag{6}F=\frac{SSReg}{SSE}\end{equation}$$

Calculate the sum of squares and the mean of squares and perform the F test as follows:

sst=np.sum((deN.mean()-deN)**2)
mst=sst/(len(deN)-1)
print(f'mst: {round(mst, 3)}')

mst: 1.003

ssreg=np.sum((deN.mean()-pre)**2)
msreg=ssreg/1 
print(f'msreg: {round(msreg, 3)}')

msreg: 405.435

sse=np.sum((deN-pre)**2)
mse=sse/(len(deN)-2)
print(f'mse: {round(mse, 3)}')

mse: 0.012

fratio=msreg/mse
print(f'F statistics: {round(fratio, 3)}')

F statistics: 33288.607

ci=stats.f.interval(0.95, 1, df)
print(f'Lower: {round(ci[0], 3):}, Upper: {round(ci[1], 3)}')

Lower: 0.001, Upper: 5.061

pval=stats.f.sf(msreg/mse, 1, df)
print(f'p-val: {round(pval, 3)}')

p-val: 0.0

The results above indicate that the null hypothesis cannot be adopted based on confidence intervals and significance probabilities based on the significance level of 0.05. You can apply the f_regression function to return test statistics and significance probabilities.

from sklearn.feature_selection import f_regression
re=f_regression(pre, deN.ravel())
print(f'statistics: {np.around(re[0], 3)}, p-value: {np.around(re[1], 3)}')

statistics: [33288.607], p-value: [0.]

Regression Estimation

Estimates that apply models created by statsmodel or sklearn.linearRegression are calculated by .predict() method. However, this is an estimate, so it may differ from the actual value. It can be represented as in Equation 7.

$$\begin{align}\tag{7}\hat{y}&=b_1x+b_0\\y&=b_1x+b_0+\text{error} \\ & =\hat{y}+\text{error} \end{align}$$

Example 3)
&esmp; Apply the model built in Example 1 to the estimate of the independent variable new in that data.

pre=mod.predict(newN)
pre

array([[1.81309514]])

#return to original scale 
realPre=deScaler.inverse_transform(pre)

realPre

array([[536.6642586]])

Because the independent and response variables used in regression are probability variables, the error generated based on those variables is also probability variables. Therefore, according to the central extreme theorem, the error is consistent with the normal distribution. This normality can be determined by the stats.probplot() function.

errorRe=stats.probplot(error.ravel(), plot=plt)
print(f'slope:{round(errorRe[1][0], 3)}, bias:{round(errorRe[1][1],3)}, R2:{round(errorRe[1][2],3)}')

slope:0.11, bias:0.0, R2:0.994

The above results show that the error is consistent with normality. Therefore, you can calculate the confidence interval of the error at a significance level of 0.05.

mu=error.mean()
se=pd.DataFrame(error).sem(ddof=1)
print(f'mean: {np.around(mu,3)}, std.err.: {np.around(se.values,3)}')

mean: 0.0, std.err.: [0.005]

ci=stats.norm.interval(0.95, scale=se)
pd.Series([float(ci[0]), float(ci[1])], index=['Lower','Upper'])

Lower   -0.010669
Upper    0.010669
dtype: float64

Applying the above results to estimates of the new variable can indicate the upper and lower limits as follows:

pre

array([[1.81309514]])

preCi=pre+ci.values
preCi

array([[1.8024258 , 1.82376448]])

re=pd.DataFrame(deScaler.inverse_transform(preCi.reshape(-1,1)), index=["lower", "upper"])
np.around(re, 3)

0
lower	535.810
upper	537.519

sympy.solvers로 방정식해 구하기

sympy.solvers로 방정식해 구하기 대수 방정식을 해를 계산하기 위해 다음 함수를 사용합니다. sympy.solvers.solve(f, *symbols, **flags) f=0, 즉 동차방정식에 대해 지정한 변수의 해를 계산 f : 식 또는 함수 symbols: 식의 해를 계산하기 위한 변수, 변수가 하나인 경우는 생략가능(자동으로 인식) flags: 계산 또는 결과의 방식을 지정하기 위한 인수들 dict=True: {x:3, y:1}같이 사전형식, 기본값 = False set=True :{(x,3),(y,1)}같이 집합형식, 기본값 = False ratioal=True : 실수를 유리수로 반환, 기본값 = False positive=True: 해들 중에 양수만을 반환, 기본값 = False 예 $x^2=1$의 해를 결정합니다. solve() 함수에 적용하기 위해서는 다음과 같이 식의 한쪽이 0이 되는 형태인 동차식으로 구성되어야 합니다. $$x^2-1=0$$ import numpy as np from sympy import * x = symbols('x') solve(x**2-1, x) [-1, 1] 위 식은 계산 과정은 다음과 같습니다. $$\begin{aligned}x^2-1=0 \rightarrow (x+1)(x-1)=0 \\ x=1 \; \text{or}\; -1\end{aligned}$$ 예 $x^4=1$의 해를 결정합니다. solve() 함수의 인수 set=True를 지정하였으므로 결과는 집합(set)형으로 반환됩니다. eq=x**4-1 solve(eq, set=True) ([x], {(-1,), (-I,), (1,), (I,)}) 위의 경우 I는 복소수입니다.즉 위 결과의 과정은 다음과 같습니다. $$x^4-1=(x^2+1)(x+1)(x-1)=0 \rightarrow x=\pm \sqrt{-1}, \; \pm 1=\pm i,\; \pm1$$ 실수...

sons dataStory

이 블로그 검색

pandas_ta를 적용한 통계적 인덱스 지표