기본 콘텐츠로 건너뛰기

[ML] 결정트리(Decision Tree) 모델

Autocorrelation & Mean of Square Error

Contents

Residual(Error)

The generated regression model needs to be statistically tested, and the main object in the test is an error, the difference between the observations and estimates calculated by Equation 1.

$$\begin{align}\tag{1}\text{e}&=y-(b_0+b_1x)\\&=y-\hat{y} \end{align}$$

Errors in the regression model have the following prerequisites:

  • Probability variables that follow a normal distribution
    • Because independent variables are probabilities that follow a normal distribution, the error between the response and the estimate is also a probability variable that follows a normal distribution. This means that the error cannot be artificially adjusted.
  • Homoscedastic of error terms
    • Various regression models are possible, as shown in Figure 1. This means that you can configure the probability distribution for the regression coefficients. This distribution has means and variances. The mean of this distribution is the regression coefficient generated by the least square method. Because changes in residuals tend to be the same as changes in regression coefficients, estimates by models generated from least squares methods become means of residual distributions and variances are equal to those of regression coefficients. As a result, the variance in the residual distribution generated from all predictions is all equal. This is homoscedastic.
  • There is no autocorrelation between errors with different time points.
    • This means that there is no systematic relationship between errors. However, for real data, especially for time series, this assumption is not easy to match because it has a relationship between each data. You can apply a variety of independent variables, or reduce this relationship by using methods such as lasso.
import numpy as np
import pandas as pd
from scipy import stats
from sklearn import preprocessing
import matplotlib.pyplot as plt
import FinanceDataReader as fdr
from sklearn.linear_model import LinearRegression
x=np.linspace(-5, 5, 100).reshape(-1,1)
r=np.random.rand(100,1)*10
y=1.2*x+r
m=LinearRegression()
m.fit(x, y)
p=m.predict(x)
plt.figure(figsize=(7,4))
plt.scatter(x, y)
plt.plot(x, p, color="red", label="regression")
for i, j in zip(np.random.rand(5)*3.1, np.random.rand(5)*5.1):
    y1=i*x+b
    plt.plot(x, y1, '--')
plt.xlabel("x", size=12, weight="bold")
plt.ylabel("y", size=12, weight="bold")
plt.legend(loc="best")
plt.show()
Figure 1.

Autocorrelation analysis

Correlation represents the relationship between two variables, whereas autocorrelation determines the relationship between values in time differences within a variable. In other words, if the relationship between the values in the rows in the data expressed in the form of a matrix is a correlation, autocorrelation is a representation of the relationship between values present within a column. The degree of autocorrelation is represented by the autocorrelation coefficient ($R_h$) in Equation 2.

$$\begin{align}\tag{2} R_h& =\frac{ \text{Autocovariance}}{\text{Variance}}\\ &=\frac{\sum^{N-h}_{t=1} (x_t-\bar{x})(x_{t+h}-\bar{x})}{\sum^N_{i=1}(x_t-\bar{x})}\end{align}$$

The autocorrelation number in Equation 2 is calculated by the function pandas.Series.autocorr(lag=1). This function can only be applied to Series objects, that is, one-dimensional vectors consisting of one column or row.

The following example shows daily opening and closing prices from Nasdaq stock prices. A linear model is implemented with the opening price as the independent variable and the closing price as the dependent variable.

import FinanceDataReader as fdr
st=pd.Timestamp(2020,5, 10)
et=pd.Timestamp(2021, 12, 22)
nd=fdr.DataReader('IXIC', st, et)[['Open','Close']]
nd.tail(3)
Open Close
Date
2021-12-20 14933.0 14980.9
2021-12-21 15140.4 15341.1
2021-12-22 15319.2 15521.9
ind=nd.values[:-1,0].reshape(-1,1)
de=nd.values[1:,1].reshape(-1,1)
new=nd.values[-1,0].reshape(-1,1)
new
array([[15319.2]])
indScaler=preprocessing.StandardScaler().fit(ind)
deScaler=preprocessing.StandardScaler().fit(de)
indN=indScaler.transform(ind)
deN=indScaler.transform(de)
newN=deScaler.transform(new)
newN
array([[1.23623733]])
mod=LinearRegression().fit(indN, deN)
pre=mod.predict(indN)
error=deN-pre
mu_e=error.mean()
r=np.sum((error[:-1]-mu_e)*(error[1:]-mu_e))/(np.sum((error-mu_e)**2))
print(f'autocor.codeff.: {round(r, 3)}')
autocor.codeff.: 0.296
r_h=pd.Series(error.flatten()).autocorr(lag=1)
print(f'autocor.codeff.: {round(r_h, 3)}')
autocor.codeff.: 0.298

The value of the autocorrelation number exists in the range of [-1,1] and if 0, indicates no autocorrelation. The confidence interval of the autocorrelation is calculated as shown in Equation 3.

$$\begin{align}\tag{3}&\text{Confidence Inteval}=\pm \frac{z_{1-\alpha/2}}{\sqrt{N}}\\ &\alpha:\text{sigmificant level}\\ &N:\text{sample size} \end{align}$$

The following is a calculation of the confidence intervals at the significance level of 0.05.

ci=stats.norm.interval(0.95)/np.sqrt(len(deN))
print(f'Lower: {round(ci[0], 3)}, Upper: {round(ci[1], 3)}')
Lower: -0.097, Upper: 0.097

The null hypothesis of the analysis is correlation coefficient = 0. Based on the results above, which indicate that the regression coefficient exists within the confidence interval, the null hypothesis can be rejected. In other words, the autocorrelation of the error caused by the regression model does exist.

Along with the above test method, autocorrelation can be determined based on the statistics calculated by Equation 4. This test method is called the Durbin_Watson test.

$$\begin{equation}\tag{4}\frac{\sum^T_{t=2}(\text{error}_t - \text{error}_{t-1})^2}{\sum^T_{t=1}\text{error}^2_t}\end{equation}$$

The test statistic above is approximately equal to $2(1-r_h))$ and exists in the range [0, 4]. If $r_h$=0, this test statistic represents 2, so the closer 0 is positive autocorrelation, the closer 4 is strong negative autocorrelation.

denom=np.sum((error[:-1]-error[1:])**2)
nom=np.sum(error**2)
dw=denom/nom
print(f'Statistics: {round(dw, 3)}')
Statistics: 1.398
You can apply the durbin_watson() function of the statsmodel package.
import statsmodels as sm
dw1=sm.stats.stattools.durbin_watson(error)
print(f'DW test statistcs: {np.around(dw1, 3)}')
DW test statistcs: [1.398]

Mean of Square Error

As mentioned in the estimation of the regression coefficient, the sum of all residuals produced by Equation 1 will be close to zero or zero, so use the sum of squared residuals to determine the degree of the residuals in the model. As shown in Equation 5, the sum of the squared residuals divided by the degrees of freedom is the Mean of Square Error(MSE).

$$\begin{align}\tag{5}\text{MSE}&=\frac{\text{SSE}}{\text{df}}\\ &=\frac{\sum^n_{i=1}(y_i-\hat{y}_i)^2}{df} \end{align}$$

The denominator (degree of freedom) in Equation 5 subtracts the number of independent variables (p) from the total number (n), in which case the degree of freedom is n-p-1, or n-2, because one independent variable and a variable for constant terms exist.

Both the independent variables and the response variables used in the model are probability variables and can assume a normal distribution by central limit theorem. Therefore, the errors generated by the model can also be assumed to follow a normal distribution, and the mean (expected value) of the errors is zero. You can also apply the MSE as an estimate of the variance of the error distribution, so that the error distribution can be expressed as follows:

$$\text{e } \sim \text{N(0, mse)}$$
sse=np.sum((deN-pre)**2)
mse=sse/(len(deN)-2)
print(f'mse: {round(mse, 5)}')
mse: 0.01128

MSE can be calculated by the function sklearn.metrics.mean_squared_error(), but this function applies the sample size instead of the degree of freedom.

from sklearn.metrics import mean_squared_error
mse1=mean_squared_error(pre, deN)
print(f'mse: {round(mse1, 5)}')
mse: 0.01123
rmse=mean_squared_error(pre, deN, squared=False)
print(f'rmse: {round(rmse, 5)}')  #=$\sqrt{\text{mse}}
rmse: 0.10595

댓글

이 블로그의 인기 게시물

[Linear Algebra] 유사변환(Similarity transformation)

유사변환(Similarity transformation) n×n 차원의 정방 행렬 A, B 그리고 가역 행렬 P 사이에 식 1의 관계가 성립하면 행렬 A와 B는 유사행렬(similarity matrix)이 되며 행렬 A를 가역행렬 P와 B로 분해하는 것을 유사 변환(similarity transformation) 이라고 합니다. $$\tag{1} A = PBP^{-1} \Leftrightarrow P^{-1}AP = B $$ 식 2는 식 1의 양변에 B의 고유값을 고려한 것입니다. \begin{align}\tag{식 2} B - \lambda I &= P^{-1}AP – \lambda P^{-1}P\\ &= P^{-1}(AP – \lambda P)\\ &= P^{-1}(A - \lambda I)P \end{align} 식 2의 행렬식은 식 3과 같이 정리됩니다. \begin{align} &\begin{aligned}\textsf{det}(B - \lambda I ) & = \textsf{det}(P^{-1}(AP – \lambda P))\\ &= \textsf{det}(P^{-1}) \textsf{det}((A – \lambda I)) \textsf{det}(P)\\ &= \textsf{det}(P^{-1}) \textsf{det}(P) \textsf{det}((A – \lambda I))\\ &= \textsf{det}(A – \lambda I)\end{aligned}\\ &\begin{aligned}\because \; \textsf{det}(P^{-1}) \textsf{det}(P) &= \textsf{det}(P^{-1}P)\\ &= \textsf{det}(I)\end{aligned}\end{align} 유사행렬의 특성 유사행렬인 두 정방행렬 A와 B는 'A ~ B' 와 같

[matplotlib] 히스토그램(Histogram)

히스토그램(Histogram) 히스토그램은 확률분포의 그래픽적인 표현이며 막대그래프의 종류입니다. 이 그래프가 확률분포와 관계가 있으므로 통계적 요소를 나타내기 위해 많이 사용됩니다. plt.hist(X, bins=10)함수를 사용합니다. x=np.random.randn(1000) plt.hist(x, 10) plt.show() 위 그래프의 y축은 각 구간에 해당하는 갯수이다. 빈도수 대신 확률밀도를 나타내기 위해서는 위 함수의 매개변수 normed=True로 조정하여 나타낼 수 있다. 또한 매개변수 bins의 인수를 숫자로 전달할 수 있지만 리스트 객체로 지정할 수 있다. 막대그래프의 경우와 마찬가지로 각 막대의 폭은 매개변수 width에 의해 조정된다. y=np.linspace(min(x)-1, max(x)+1, 10) y array([-4.48810153, -3.54351935, -2.59893717, -1.65435499, -0.70977282, 0.23480936, 1.17939154, 2.12397372, 3.0685559 , 4.01313807]) plt.hist(x, y, normed=True) plt.show()

R 미분과 적분

내용 expression 미분 2차 미분 mosaic를 사용한 미분 적분 미분과 적분 R에서의 미분과 적분 함수는 expression()함수에 의해 생성된 표현식을 대상으로 합니다. expression expression(문자, 또는 식) 이 표현식의 평가는 eval() 함수에 의해 실행됩니다. > ex1<-expression(1+0:9) > ex1 expression(1 + 0:9) > eval(ex1) [1] 1 2 3 4 5 6 7 8 9 10 > ex2<-expression(u, 2, u+0:9) > ex2 expression(u, 2, u + 0:9) > ex2[1] expression(u) > ex2[2] expression(2) > ex2[3] expression(u + 0:9) > u<-0.9 > eval(ex2[3]) [1] 0.9 1.9 2.9 3.9 4.9 5.9 6.9 7.9 8.9 9.9 미분 D(표현식, 미분 변수) 함수로 미분을 실행합니다. 이 함수의 표현식은 expression() 함수로 생성된 객체이며 미분 변수는 다음 식의 분모의 변수를 의미합니다. $$\frac{d}{d \text{변수}}\text{표현식}$$ 이 함수는 어떤 함수의 미분의 결과를 표현식으로 반환합니다. > D(expression(2*x^3), "x") 2 * (3 * x^2) > eq<-expression(log(x)) > eq expression(log(x)) > D(eq, "x") 1/x > eq2<-expression(a/(1+b*exp(-d*x))); eq2 expression(a/(1 + b * exp(-d * x))) > D(eq2, "x") a * (b * (exp(-d * x) * d))/(1 + b