기본 콘텐츠로 건너뛰기

[ML] 결정트리(Decision Tree) 모델

Evaluation of regression coefficients, model & Estimation

Contents

Evaluation of regression coefficients

T tests in regression analysis tests for the following null hypothesis (H0) for the regression coefficients of the generated model:

H0: No significant difference by coefficient

These results indicate that the test statistic, t, is outside the confidence interval and that the p-value is also close to zero, which is much lower than the significance level. Therefore, the results for the above dollor index indicate that the null hypothesis cannot be adopted. This discussion can be generalized as follows:

The calculated regression coefficients also form a distribution as probability variables, so you can conduct a test. As mentioned above, the distribution has the same shape as the distribution of errors, so you can calculate the variance of the regression coefficient based on the variance of the error. As shown in Equation 1, the regression coefficient equals the ratio of the estimate (response-error) to the independent variable. (The deviation term is a constant that is automatically determined by each regression coefficient, so you can ignore it in the next calculation.)

$$\begin{align}\tag{1}&e=y-b_1x\\ &b_1=\frac{y-e}{x}\\ &e: \text{error}\end{align}$$

As shown in Equation 1, the variation of $b_1$ can be induced by the ratio of the variation of the estimation and the independent variable. The variation in the estimation is equal to the variation in the error. Therefore, the variation in the regression coefficient can be expressed as follows: $$\begin{align}\sigma^2_{b_1}&=\frac{\sigma^2_e}{\sum^n_{i=1}(x_i-\bar{x})^2}\\&=\frac{\sigma^2_e}{S_{xx}}\\ \sigma^2_{b_0}&=\frac{\sum^n_{i=1} x^2_i}{n\sum^n_{i=1}(x_i-\bar{x})}\sigma^2_e\\&=\frac{\sum^n_{i=1} x^2_i}{nS_{xx}} \sigma^2_e\end{align}$$

The variation of the regression coefficients by matrix operations can be represented as shown in Equation 2.

$$\begin{align}\tag{2} \text{var(b)}&=(X^TX)^{-1} \sigma^2_e\\&=\begin{bmatrix}\frac{\sum x^2_i}{nS_{xx}}&-\frac{\bar{x}}{S_{xx}}\\-\frac{\bar{x}}{S_{xx}}&\frac{1}{S_{xx}} \end{bmatrix}\sigma^2_e \\ &=\begin{bmatrix}c_{00}&c_{01}\\c_{10}&c_{11} \end{bmatrix}\sigma^2_e \end{align}$$

In Equation 2, $c_{00}\sigma^2_e$ and $c_{11}\sigma^2_e$ are variances of $b_0, b_1$, respectively. The standard deviation of the regression coefficient calculated from the equation uses the mean error of the error term as an unbiased estimate.

Example 1)
  The following is data on the Open and Close of the iShares Semiconductor ETF (SOXX). Build a regression model with Open as the independent variable and Close as the response variable.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy import stats
from sklearn import preprocessing
from sklearn.linear_model import LinearRegression
import FinanceDataReader as fdr
import FinanceDataReader as fdr
st=pd.Timestamp(2020,5, 10)
et=pd.Timestamp(2021, 12, 24)
nd=fdr.DataReader('SOXX', st, et)[['Open','Close']]
nd.tail(3)
Open Close
Date
2021-12-21523.19530.12
2021-12-22527.39535.63
2021-12-23536.70540.81
ind=nd.values[:-1,0].reshape(-1,1) #independent
de=nd.values[1:,1].reshape(-1,1)#dependent
new=nd.values[-1,0].reshape(-1,1)
new
array([[536.7]])
#standardiztion of indepentent and dependent variable
indScaler=preprocessing.StandardScaler().fit(ind)
deScaler=preprocessing.StandardScaler().fit(de)
indN=indScaler.transform(ind)
deN=indScaler.transform(de)
newN=deScaler.transform(new)
newN
array([[1.81354151]])
#linear model
mod=LinearRegression().fit(indN, deN)
mod.coef_, mod.intercept_
(array([[0.99441754]]), array([0.00967765]))

The error due to the above model is calculated as follows.

pre=mod.predict(indN)
error=deN-pre
error[-3:]

Since the standard error based on the error is related to the variation of the regression coefficient of the model, it can be considered as the standard error of the regression coefficient.

array([[0.27241109],
       [0.15500268],
       [0.16754089]])
#standard error of error
se=pd.DataFrame(error).sem()
np.around(se, 4)
0    0.0054
dtype: float64
#standard error of regression coefficient $b_1$
sigmab=np.sqrt(error.var(ddof=2)/np.sum((indN-indN.mean())**2))
round(sigmab, 4)
0.0055

You can test the regression coefficient in a t-distribution with a degree of freedom n-2 by applying the variance of $b_1$. The null hypothesis and alternative hypothesis of this test are as follows:

$$\text{H0}: b_1=0, \quad \text{H1}: b_1 \neq 0$$

In this case, the t-test statistics are as follows:

$$\text{t statistics}=\frac{0-b_1}{\sigma_{b_1}}$$

Calculate the test statistics and confidence intervals and significance probabilities at the significance level of 0.05.

t=mod.coef_/sigmab
print(f'Statistics: {np.around(t, 3)}')
Statistics: [[182.452]]
df=len(pre)-2
df
408
#standard deviation of regression coefficient
down, up=stats.t.interval(0.95, df, mod.coef_,sigmab)
print(f'lower: {np.around(down, 3)}, upper: {np.around(up, 3)}')
lower: [[0.984]], upper: [[1.005]]
#Standard error of error term
pval=stats.t.sf(t, df)
print(f'p-value: {np.around(pval, 3)}')
p-value: [[0.]]

The confidence interval for the above result does not include zero. The significance is also 0, indicating that it is well below the confidence level of 0.05. Therefore, the null hypothesis that b_1=0 can be rejected. In other words, the built-up linear model cannot be rejected.

Evaluation of the model

Regression is based on probabilities, and estimates by the model cause differences from observations. The evaluation of the model applies the previously introduced ANOVA as an assessment of whether the level of difference is acceptable. An analysis of variance is a comparison of each variation (dispersion) that occurs between different groups (variables) to determine whether it is generally possible to occur.

As shown in Figure 1, the unbiased estimator of the observation y uses an average of $\bar{y}$. If the mean value matches the prediction by the regression model, then the regression analysis has no meaning. That is, if the regression model is appropriate, there is a difference between the mean and the estimate, and an error between the estimate and the observation. Therefore, the relationship between $\bar{y},\; \hat{y}$ and y can be represented by Equation 3.

$$\begin{equation}\tag{3}|\bar{y}-y| = |\bar{y}-\hat{y}|+|\hat{y}-y|\end{equation}$$
plt.figure(figsize=(7, 4))
plt.plot(x, y, label="regression line")
plt.hlines(ymu,0, 4, linestyle="--", label="Mean Line")
plt.hlines(-1,0, 4, linestyle="--", color="lightgray")
plt.scatter([2, 2, 2], [-1, 1, ymu], color="black")
plt.vlines(2, 1, ymu, color="red")
plt.vlines(2, -1, 1, color="blue")
plt.vlines(3.5, -1, ymu, color="black")
plt.text(1.3, 2.5, "SSReg", size=12, weight="bold", color="red")
plt.text(0.8, 2, "=|mean-predict|", weight="bold", color="red")
plt.text(1.3, -0.3, "SSE", size=12, weight="bold", color="blue")
plt.text(1, -0.8, "=|Y-predict|", weight="bold", color="blue")
plt.text(2.1, -1, 'Y', size=12, weight="bold")
plt.text(2.1, 1, 'predicted Y', size=12, weight="bold")
plt.text(2.1, ymu, 'mean Y', size=12, weight="bold")
plt.text(3.6, 1, 'SST', size=12, weight="bold", color="black")
plt.text(3.6, 0.5, '=|Y-mean Y|', size=12, weight="bold", color="black")
plt.xlim(0, 5)
plt.ylim(-2, 5)
plt.legend(loc="best")
plt.show()
Figure 1. Error analysis in regression model.

Equation 4 is derived from Equation 3.

$$\begin{align}\tag{4} \sum^n_{i=1}(y-\bar{y})^2&=\sum^n_{i=1}(\hat{y}-\bar{y})^2+\sum^n_{i=1}(y-\hat{y})^2\\ SST&=SSReg+SSE\\ \text{Total}&=\text{Explanable parts}+\text{Unexplanable parts}\end{align}$$

The left term in Equation 2 is the sum of the sum square total (SST) and the right term is the sum of the sum square regression (SSReg) and the sum square error (SSE). The regression square sum is a part that can be explained by the generated regression model, but the residual square sum is an unexplained part. This expression is appropriate because the smaller the unexplained portion of the sum of total squares, the greater the explanatory power by the model. In other words, as SSReg increases during SST, it becomes an accurate model.

Equation 5 is the calculation of the ratio of errors that can be explained during the total error, called coefficient of determination ($R^2$), which provides a basis for determining the degree of explanatory power by the regression model. {\vskip1em} $$\begin{align}\tag{5} R^2&=\frac{SSReg}{SST}\\&=\frac{SST-SSE}{SST}\\&=1-\frac{SSE}{SST}\\&=1-\frac{\sum^n_{i=1}(y-\hat{y})^2}{\sum^n_{i=1}(y-\bar{y})^2} \end{align}$$

The coefficient of determination is equal to the square of the correlation coefficient r. This coefficient indicates that the closer the value between [0, 1] is to 1, the closer the estimate of the model is to the observation. It can be calculated by the rsquared property of the model generated by statmodels.OLS() and the score() method of the model generated by linearRegression() class in sklearn.linear_model.

Example 2)
  Calculate the coefficient of determination from the model in Example 1. The model above uses the linearRegression() class, and the coefficient of determination can also be determined by applying this class.

R2=mod.score(indN, deN)
print(f'R2 by LinearRegression() : {round(R2, 3)}')
R2 by LinearRegression() : 0.988

Equation 6 follows the F distribution as a ratio of explainable (MSReg) and unexplainable variation (MSE), allowing testing of the following null hypotheses based on this distribution:

$$\text{H0:} b_1 = 0, \; \text{H1:} b_1 \neq 0$$ $$\begin{equation}\tag{6}F=\frac{SSReg}{SSE}\end{equation}$$

Calculate the sum of squares and the mean of squares and perform the F test as follows:

sst=np.sum((deN.mean()-deN)**2)
mst=sst/(len(deN)-1)
print(f'mst: {round(mst, 3)}')
mst: 1.003
ssreg=np.sum((deN.mean()-pre)**2)
msreg=ssreg/1 
print(f'msreg: {round(msreg, 3)}')
msreg: 405.435
sse=np.sum((deN-pre)**2)
mse=sse/(len(deN)-2)
print(f'mse: {round(mse, 3)}')
mse: 0.012
fratio=msreg/mse
print(f'F statistics: {round(fratio, 3)}')
F statistics: 33288.607
ci=stats.f.interval(0.95, 1, df)
print(f'Lower: {round(ci[0], 3):}, Upper: {round(ci[1], 3)}')
Lower: 0.001, Upper: 5.061
pval=stats.f.sf(msreg/mse, 1, df)
print(f'p-val: {round(pval, 3)}')
p-val: 0.0

The results above indicate that the null hypothesis cannot be adopted based on confidence intervals and significance probabilities based on the significance level of 0.05. You can apply the f_regression function to return test statistics and significance probabilities.

from sklearn.feature_selection import f_regression
re=f_regression(pre, deN.ravel())
print(f'statistics: {np.around(re[0], 3)}, p-value: {np.around(re[1], 3)}')
statistics: [33288.607], p-value: [0.]

Regression Estimation

Estimates that apply models created by statsmodel or sklearn.linearRegression are calculated by .predict() method. However, this is an estimate, so it may differ from the actual value. It can be represented as in Equation 7.

$$\begin{align}\tag{7}\hat{y}&=b_1x+b_0\\y&=b_1x+b_0+\text{error} \\ & =\hat{y}+\text{error} \end{align}$$

Example 3)
&esmp; Apply the model built in Example 1 to the estimate of the independent variable new in that data.

pre=mod.predict(newN)
pre
array([[1.81309514]])
#return to original scale 
realPre=deScaler.inverse_transform(pre)
realPre
array([[536.6642586]])

Because the independent and response variables used in regression are probability variables, the error generated based on those variables is also probability variables. Therefore, according to the central extreme theorem, the error is consistent with the normal distribution. This normality can be determined by the stats.probplot() function.

errorRe=stats.probplot(error.ravel(), plot=plt)
print(f'slope:{round(errorRe[1][0], 3)}, bias:{round(errorRe[1][1],3)}, R2:{round(errorRe[1][2],3)}')
slope:0.11, bias:0.0, R2:0.994

The above results show that the error is consistent with normality. Therefore, you can calculate the confidence interval of the error at a significance level of 0.05.

mu=error.mean()
se=pd.DataFrame(error).sem(ddof=1)
print(f'mean: {np.around(mu,3)}, std.err.: {np.around(se.values,3)}')
mean: 0.0, std.err.: [0.005]
ci=stats.norm.interval(0.95, scale=se)
pd.Series([float(ci[0]), float(ci[1])], index=['Lower','Upper'])
Lower   -0.010669
Upper    0.010669
dtype: float64

Applying the above results to estimates of the new variable can indicate the upper and lower limits as follows:

pre
array([[1.81309514]])
preCi=pre+ci.values
preCi
array([[1.8024258 , 1.82376448]])
re=pd.DataFrame(deScaler.inverse_transform(preCi.reshape(-1,1)), index=["lower", "upper"])
np.around(re, 3)
0
lower 535.810
upper 537.519

댓글

이 블로그의 인기 게시물

[Linear Algebra] 유사변환(Similarity transformation)

유사변환(Similarity transformation) n×n 차원의 정방 행렬 A, B 그리고 가역 행렬 P 사이에 식 1의 관계가 성립하면 행렬 A와 B는 유사행렬(similarity matrix)이 되며 행렬 A를 가역행렬 P와 B로 분해하는 것을 유사 변환(similarity transformation) 이라고 합니다. $$\tag{1} A = PBP^{-1} \Leftrightarrow P^{-1}AP = B $$ 식 2는 식 1의 양변에 B의 고유값을 고려한 것입니다. \begin{align}\tag{식 2} B - \lambda I &= P^{-1}AP – \lambda P^{-1}P\\ &= P^{-1}(AP – \lambda P)\\ &= P^{-1}(A - \lambda I)P \end{align} 식 2의 행렬식은 식 3과 같이 정리됩니다. \begin{align} &\begin{aligned}\textsf{det}(B - \lambda I ) & = \textsf{det}(P^{-1}(AP – \lambda P))\\ &= \textsf{det}(P^{-1}) \textsf{det}((A – \lambda I)) \textsf{det}(P)\\ &= \textsf{det}(P^{-1}) \textsf{det}(P) \textsf{det}((A – \lambda I))\\ &= \textsf{det}(A – \lambda I)\end{aligned}\\ &\begin{aligned}\because \; \textsf{det}(P^{-1}) \textsf{det}(P) &= \textsf{det}(P^{-1}P)\\ &= \textsf{det}(I)\end{aligned}\end{align} 유사행렬의 특성 유사행렬인 두 정방행렬 A와 B는 'A ~ B' 와 같

[matplotlib] 히스토그램(Histogram)

히스토그램(Histogram) 히스토그램은 확률분포의 그래픽적인 표현이며 막대그래프의 종류입니다. 이 그래프가 확률분포와 관계가 있으므로 통계적 요소를 나타내기 위해 많이 사용됩니다. plt.hist(X, bins=10)함수를 사용합니다. x=np.random.randn(1000) plt.hist(x, 10) plt.show() 위 그래프의 y축은 각 구간에 해당하는 갯수이다. 빈도수 대신 확률밀도를 나타내기 위해서는 위 함수의 매개변수 normed=True로 조정하여 나타낼 수 있다. 또한 매개변수 bins의 인수를 숫자로 전달할 수 있지만 리스트 객체로 지정할 수 있다. 막대그래프의 경우와 마찬가지로 각 막대의 폭은 매개변수 width에 의해 조정된다. y=np.linspace(min(x)-1, max(x)+1, 10) y array([-4.48810153, -3.54351935, -2.59893717, -1.65435499, -0.70977282, 0.23480936, 1.17939154, 2.12397372, 3.0685559 , 4.01313807]) plt.hist(x, y, normed=True) plt.show()

R 미분과 적분

내용 expression 미분 2차 미분 mosaic를 사용한 미분 적분 미분과 적분 R에서의 미분과 적분 함수는 expression()함수에 의해 생성된 표현식을 대상으로 합니다. expression expression(문자, 또는 식) 이 표현식의 평가는 eval() 함수에 의해 실행됩니다. > ex1<-expression(1+0:9) > ex1 expression(1 + 0:9) > eval(ex1) [1] 1 2 3 4 5 6 7 8 9 10 > ex2<-expression(u, 2, u+0:9) > ex2 expression(u, 2, u + 0:9) > ex2[1] expression(u) > ex2[2] expression(2) > ex2[3] expression(u + 0:9) > u<-0.9 > eval(ex2[3]) [1] 0.9 1.9 2.9 3.9 4.9 5.9 6.9 7.9 8.9 9.9 미분 D(표현식, 미분 변수) 함수로 미분을 실행합니다. 이 함수의 표현식은 expression() 함수로 생성된 객체이며 미분 변수는 다음 식의 분모의 변수를 의미합니다. $$\frac{d}{d \text{변수}}\text{표현식}$$ 이 함수는 어떤 함수의 미분의 결과를 표현식으로 반환합니다. > D(expression(2*x^3), "x") 2 * (3 * x^2) > eq<-expression(log(x)) > eq expression(log(x)) > D(eq, "x") 1/x > eq2<-expression(a/(1+b*exp(-d*x))); eq2 expression(a/(1 + b * exp(-d * x))) > D(eq2, "x") a * (b * (exp(-d * x) * d))/(1 + b