기본 콘텐츠로 건너뛰기

벡터와 행렬에 관련된 그림들

Autocorrelation & Mean of Square Error

Contents

Residual(Error)

The generated regression model needs to be statistically tested, and the main object in the test is an error, the difference between the observations and estimates calculated by Equation 1.

$$\begin{align}\tag{1}\text{e}&=y-(b_0+b_1x)\\&=y-\hat{y} \end{align}$$

Errors in the regression model have the following prerequisites:

  • Probability variables that follow a normal distribution
    • Because independent variables are probabilities that follow a normal distribution, the error between the response and the estimate is also a probability variable that follows a normal distribution. This means that the error cannot be artificially adjusted.
  • Homoscedastic of error terms
    • Various regression models are possible, as shown in Figure 1. This means that you can configure the probability distribution for the regression coefficients. This distribution has means and variances. The mean of this distribution is the regression coefficient generated by the least square method. Because changes in residuals tend to be the same as changes in regression coefficients, estimates by models generated from least squares methods become means of residual distributions and variances are equal to those of regression coefficients. As a result, the variance in the residual distribution generated from all predictions is all equal. This is homoscedastic.
  • There is no autocorrelation between errors with different time points.
    • This means that there is no systematic relationship between errors. However, for real data, especially for time series, this assumption is not easy to match because it has a relationship between each data. You can apply a variety of independent variables, or reduce this relationship by using methods such as lasso.
import numpy as np
import pandas as pd
from scipy import stats
from sklearn import preprocessing
import matplotlib.pyplot as plt
import FinanceDataReader as fdr
from sklearn.linear_model import LinearRegression
x=np.linspace(-5, 5, 100).reshape(-1,1)
r=np.random.rand(100,1)*10
y=1.2*x+r
m=LinearRegression()
m.fit(x, y)
p=m.predict(x)
plt.figure(figsize=(7,4))
plt.scatter(x, y)
plt.plot(x, p, color="red", label="regression")
for i, j in zip(np.random.rand(5)*3.1, np.random.rand(5)*5.1):
    y1=i*x+b
    plt.plot(x, y1, '--')
plt.xlabel("x", size=12, weight="bold")
plt.ylabel("y", size=12, weight="bold")
plt.legend(loc="best")
plt.show()
Figure 1.

Autocorrelation analysis

Correlation represents the relationship between two variables, whereas autocorrelation determines the relationship between values in time differences within a variable. In other words, if the relationship between the values in the rows in the data expressed in the form of a matrix is a correlation, autocorrelation is a representation of the relationship between values present within a column. The degree of autocorrelation is represented by the autocorrelation coefficient ($R_h$) in Equation 2.

$$\begin{align}\tag{2} R_h& =\frac{ \text{Autocovariance}}{\text{Variance}}\\ &=\frac{\sum^{N-h}_{t=1} (x_t-\bar{x})(x_{t+h}-\bar{x})}{\sum^N_{i=1}(x_t-\bar{x})}\end{align}$$

The autocorrelation number in Equation 2 is calculated by the function pandas.Series.autocorr(lag=1). This function can only be applied to Series objects, that is, one-dimensional vectors consisting of one column or row.

The following example shows daily opening and closing prices from Nasdaq stock prices. A linear model is implemented with the opening price as the independent variable and the closing price as the dependent variable.

import FinanceDataReader as fdr
st=pd.Timestamp(2020,5, 10)
et=pd.Timestamp(2021, 12, 22)
nd=fdr.DataReader('IXIC', st, et)[['Open','Close']]
nd.tail(3)
Open Close
Date
2021-12-20 14933.0 14980.9
2021-12-21 15140.4 15341.1
2021-12-22 15319.2 15521.9
ind=nd.values[:-1,0].reshape(-1,1)
de=nd.values[1:,1].reshape(-1,1)
new=nd.values[-1,0].reshape(-1,1)
new
array([[15319.2]])
indScaler=preprocessing.StandardScaler().fit(ind)
deScaler=preprocessing.StandardScaler().fit(de)
indN=indScaler.transform(ind)
deN=indScaler.transform(de)
newN=deScaler.transform(new)
newN
array([[1.23623733]])
mod=LinearRegression().fit(indN, deN)
pre=mod.predict(indN)
error=deN-pre
mu_e=error.mean()
r=np.sum((error[:-1]-mu_e)*(error[1:]-mu_e))/(np.sum((error-mu_e)**2))
print(f'autocor.codeff.: {round(r, 3)}')
autocor.codeff.: 0.296
r_h=pd.Series(error.flatten()).autocorr(lag=1)
print(f'autocor.codeff.: {round(r_h, 3)}')
autocor.codeff.: 0.298

The value of the autocorrelation number exists in the range of [-1,1] and if 0, indicates no autocorrelation. The confidence interval of the autocorrelation is calculated as shown in Equation 3.

$$\begin{align}\tag{3}&\text{Confidence Inteval}=\pm \frac{z_{1-\alpha/2}}{\sqrt{N}}\\ &\alpha:\text{sigmificant level}\\ &N:\text{sample size} \end{align}$$

The following is a calculation of the confidence intervals at the significance level of 0.05.

ci=stats.norm.interval(0.95)/np.sqrt(len(deN))
print(f'Lower: {round(ci[0], 3)}, Upper: {round(ci[1], 3)}')
Lower: -0.097, Upper: 0.097

The null hypothesis of the analysis is correlation coefficient = 0. Based on the results above, which indicate that the regression coefficient exists within the confidence interval, the null hypothesis can be rejected. In other words, the autocorrelation of the error caused by the regression model does exist.

Along with the above test method, autocorrelation can be determined based on the statistics calculated by Equation 4. This test method is called the Durbin_Watson test.

$$\begin{equation}\tag{4}\frac{\sum^T_{t=2}(\text{error}_t - \text{error}_{t-1})^2}{\sum^T_{t=1}\text{error}^2_t}\end{equation}$$

The test statistic above is approximately equal to $2(1-r_h))$ and exists in the range [0, 4]. If $r_h$=0, this test statistic represents 2, so the closer 0 is positive autocorrelation, the closer 4 is strong negative autocorrelation.

denom=np.sum((error[:-1]-error[1:])**2)
nom=np.sum(error**2)
dw=denom/nom
print(f'Statistics: {round(dw, 3)}')
Statistics: 1.398
You can apply the durbin_watson() function of the statsmodel package.
import statsmodels as sm
dw1=sm.stats.stattools.durbin_watson(error)
print(f'DW test statistcs: {np.around(dw1, 3)}')
DW test statistcs: [1.398]

Mean of Square Error

As mentioned in the estimation of the regression coefficient, the sum of all residuals produced by Equation 1 will be close to zero or zero, so use the sum of squared residuals to determine the degree of the residuals in the model. As shown in Equation 5, the sum of the squared residuals divided by the degrees of freedom is the Mean of Square Error(MSE).

$$\begin{align}\tag{5}\text{MSE}&=\frac{\text{SSE}}{\text{df}}\\ &=\frac{\sum^n_{i=1}(y_i-\hat{y}_i)^2}{df} \end{align}$$

The denominator (degree of freedom) in Equation 5 subtracts the number of independent variables (p) from the total number (n), in which case the degree of freedom is n-p-1, or n-2, because one independent variable and a variable for constant terms exist.

Both the independent variables and the response variables used in the model are probability variables and can assume a normal distribution by central limit theorem. Therefore, the errors generated by the model can also be assumed to follow a normal distribution, and the mean (expected value) of the errors is zero. You can also apply the MSE as an estimate of the variance of the error distribution, so that the error distribution can be expressed as follows:

$$\text{e } \sim \text{N(0, mse)}$$
sse=np.sum((deN-pre)**2)
mse=sse/(len(deN)-2)
print(f'mse: {round(mse, 5)}')
mse: 0.01128

MSE can be calculated by the function sklearn.metrics.mean_squared_error(), but this function applies the sample size instead of the degree of freedom.

from sklearn.metrics import mean_squared_error
mse1=mean_squared_error(pre, deN)
print(f'mse: {round(mse1, 5)}')
mse: 0.01123
rmse=mean_squared_error(pre, deN, squared=False)
print(f'rmse: {round(rmse, 5)}')  #=$\sqrt{\text{mse}}
rmse: 0.10595

댓글

이 블로그의 인기 게시물

[Linear Algebra] 유사변환(Similarity transformation)

유사변환(Similarity transformation) n×n 차원의 정방 행렬 A, B 그리고 가역 행렬 P 사이에 식 1의 관계가 성립하면 행렬 A와 B는 유사행렬(similarity matrix)이 되며 행렬 A를 가역행렬 P와 B로 분해하는 것을 유사 변환(similarity transformation) 이라고 합니다. $$\tag{1} A = PBP^{-1} \Leftrightarrow P^{-1}AP = B $$ 식 2는 식 1의 양변에 B의 고유값을 고려한 것입니다. \begin{align}\tag{식 2} B - \lambda I &= P^{-1}AP – \lambda P^{-1}P\\ &= P^{-1}(AP – \lambda P)\\ &= P^{-1}(A - \lambda I)P \end{align} 식 2의 행렬식은 식 3과 같이 정리됩니다. \begin{align} &\begin{aligned}\textsf{det}(B - \lambda I ) & = \textsf{det}(P^{-1}(AP – \lambda P))\\ &= \textsf{det}(P^{-1}) \textsf{det}((A – \lambda I)) \textsf{det}(P)\\ &= \textsf{det}(P^{-1}) \textsf{det}(P) \textsf{det}((A – \lambda I))\\ &= \textsf{det}(A – \lambda I)\end{aligned}\\ &\begin{aligned}\because \; \textsf{det}(P^{-1}) \textsf{det}(P) &= \textsf{det}(P^{-1}P)\\ &= \textsf{det}(I)\end{aligned}\end{align} 유사행렬의 특성 유사행렬인 두 정방행렬 A와 B는 'A ~ B' 와 같...

[sympy] Sympy객체의 표현을 위한 함수들

Sympy객체의 표현을 위한 함수들 General simplify(x): 식 x(sympy 객체)를 간단히 정리 합니다. import numpy as np from sympy import * x=symbols("x") a=sin(x)**2+cos(x)**2 a $\sin^{2}{\left(x \right)} + \cos^{2}{\left(x \right)}$ simplify(a) 1 simplify(b) $\frac{x^{3} + x^{2} - x - 1}{x^{2} + 2 x + 1}$ simplify(b) x - 1 c=gamma(x)/gamma(x-2) c $\frac{\Gamma\left(x\right)}{\Gamma\left(x - 2\right)}$ simplify(c) $\displaystyle \left(x - 2\right) \left(x - 1\right)$ 위의 예들 중 객체 c의 감마함수(gamma(x))는 확률분포 등 여러 부분에서 사용되는 표현식으로 다음과 같이 정의 됩니다. 감마함수는 음이 아닌 정수를 제외한 모든 수에서 정의됩니다. 식 1과 같이 자연수에서 감마함수는 factorial(!), 부동소수(양의 실수)인 경우 적분을 적용하여 계산합니다. $$\tag{식 1}\Gamma(n) =\begin{cases}(n-1)!& n:\text{자연수}\\\int^\infty_0x^{n-1}e^{-x}\,dx& n:\text{부동소수}\end{cases}$$ x=symbols('x') gamma(x).subs(x,4) $\displaystyle 6$ factorial 계산은 math.factorial() 함수를 사용할 수 있습니다. import math math.factorial(3) 6 a=gamma(x).subs(x,4.5) a.evalf(3) 11.6 simpilfy() 함수의 알고리즘은 식에서 공통사항을 찾아 정리하...

sympy.solvers로 방정식해 구하기

sympy.solvers로 방정식해 구하기 대수 방정식을 해를 계산하기 위해 다음 함수를 사용합니다. sympy.solvers.solve(f, *symbols, **flags) f=0, 즉 동차방정식에 대해 지정한 변수의 해를 계산 f : 식 또는 함수 symbols: 식의 해를 계산하기 위한 변수, 변수가 하나인 경우는 생략가능(자동으로 인식) flags: 계산 또는 결과의 방식을 지정하기 위한 인수들 dict=True: {x:3, y:1}같이 사전형식, 기본값 = False set=True :{(x,3),(y,1)}같이 집합형식, 기본값 = False ratioal=True : 실수를 유리수로 반환, 기본값 = False positive=True: 해들 중에 양수만을 반환, 기본값 = False 예 $x^2=1$의 해를 결정합니다. solve() 함수에 적용하기 위해서는 다음과 같이 식의 한쪽이 0이 되는 형태인 동차식으로 구성되어야 합니다. $$x^2-1=0$$ import numpy as np from sympy import * x = symbols('x') solve(x**2-1, x) [-1, 1] 위 식은 계산 과정은 다음과 같습니다. $$\begin{aligned}x^2-1=0 \rightarrow (x+1)(x-1)=0 \\ x=1 \; \text{or}\; -1\end{aligned}$$ 예 $x^4=1$의 해를 결정합니다. solve() 함수의 인수 set=True를 지정하였으므로 결과는 집합(set)형으로 반환됩니다. eq=x**4-1 solve(eq, set=True) ([x], {(-1,), (-I,), (1,), (I,)}) 위의 경우 I는 복소수입니다.즉 위 결과의 과정은 다음과 같습니다. $$x^4-1=(x^2+1)(x+1)(x-1)=0 \rightarrow x=\pm \sqrt{-1}, \; \pm 1=\pm i,\; \pm1$$ 실수...