Autocorrelation & Mean of Square Error

Autocorrelation analysis
Mean of Square Error

Residual(Error)

The generated regression model needs to be statistically tested, and the main object in the test is an error, the difference between the observations and estimates calculated by Equation 1.

$$\begin{align}\tag{1}\text{e}&=y-(b_0+b_1x)\\&=y-\hat{y} \end{align}$$

Errors in the regression model have the following prerequisites:

Probability variables that follow a normal distribution
- Because independent variables are probabilities that follow a normal distribution, the error between the response and the estimate is also a probability variable that follows a normal distribution. This means that the error cannot be artificially adjusted.
Homoscedastic of error terms
- Various regression models are possible, as shown in Figure 1. This means that you can configure the probability distribution for the regression coefficients. This distribution has means and variances. The mean of this distribution is the regression coefficient generated by the least square method. Because changes in residuals tend to be the same as changes in regression coefficients, estimates by models generated from least squares methods become means of residual distributions and variances are equal to those of regression coefficients. As a result, the variance in the residual distribution generated from all predictions is all equal. This is homoscedastic.
There is no autocorrelation between errors with different time points.
- This means that there is no systematic relationship between errors. However, for real data, especially for time series, this assumption is not easy to match because it has a relationship between each data. You can apply a variety of independent variables, or reduce this relationship by using methods such as lasso.

import numpy as np
import pandas as pd
from scipy import stats
from sklearn import preprocessing
import matplotlib.pyplot as plt
import FinanceDataReader as fdr
from sklearn.linear_model import LinearRegression

x=np.linspace(-5, 5, 100).reshape(-1,1)
r=np.random.rand(100,1)*10
y=1.2*x+r
m=LinearRegression()
m.fit(x, y)
p=m.predict(x)
plt.figure(figsize=(7,4))
plt.scatter(x, y)
plt.plot(x, p, color="red", label="regression")
for i, j in zip(np.random.rand(5)*3.1, np.random.rand(5)*5.1):
    y1=i*x+b
    plt.plot(x, y1, '--')
plt.xlabel("x", size=12, weight="bold")
plt.ylabel("y", size=12, weight="bold")
plt.legend(loc="best")
plt.show()

Autocorrelation analysis

Correlation represents the relationship between two variables, whereas autocorrelation determines the relationship between values in time differences within a variable. In other words, if the relationship between the values in the rows in the data expressed in the form of a matrix is a correlation, autocorrelation is a representation of the relationship between values present within a column. The degree of autocorrelation is represented by the autocorrelation coefficient ($R_h$) in Equation 2.

$$\begin{align}\tag{2} R_h& =\frac{ \text{Autocovariance}}{\text{Variance}}\\ &=\frac{\sum^{N-h}_{t=1} (x_t-\bar{x})(x_{t+h}-\bar{x})}{\sum^N_{i=1}(x_t-\bar{x})}\end{align}$$

The autocorrelation number in Equation 2 is calculated by the function pandas.Series.autocorr(lag=1). This function can only be applied to Series objects, that is, one-dimensional vectors consisting of one column or row.

The following example shows daily opening and closing prices from Nasdaq stock prices. A linear model is implemented with the opening price as the independent variable and the closing price as the dependent variable.

import FinanceDataReader as fdr
st=pd.Timestamp(2020,5, 10)
et=pd.Timestamp(2021, 12, 22)
nd=fdr.DataReader('IXIC', st, et)[['Open','Close']]
nd.tail(3)

	Open	Close
Date
2021-12-20	14933.0	14980.9
2021-12-21	15140.4	15341.1
2021-12-22	15319.2	15521.9

ind=nd.values[:-1,0].reshape(-1,1)
de=nd.values[1:,1].reshape(-1,1)
new=nd.values[-1,0].reshape(-1,1)
new

array([[15319.2]])

indScaler=preprocessing.StandardScaler().fit(ind)
deScaler=preprocessing.StandardScaler().fit(de)
indN=indScaler.transform(ind)
deN=indScaler.transform(de)
newN=deScaler.transform(new)
newN

array([[1.23623733]])

mod=LinearRegression().fit(indN, deN)
pre=mod.predict(indN)
error=deN-pre
mu_e=error.mean()
r=np.sum((error[:-1]-mu_e)*(error[1:]-mu_e))/(np.sum((error-mu_e)**2))
print(f'autocor.codeff.: {round(r, 3)}')

autocor.codeff.: 0.296

r_h=pd.Series(error.flatten()).autocorr(lag=1)
print(f'autocor.codeff.: {round(r_h, 3)}')

autocor.codeff.: 0.298

The value of the autocorrelation number exists in the range of [-1,1] and if 0, indicates no autocorrelation. The confidence interval of the autocorrelation is calculated as shown in Equation 3.

$$\begin{align}\tag{3}&\text{Confidence Inteval}=\pm \frac{z_{1-\alpha/2}}{\sqrt{N}}\\ &\alpha:\text{sigmificant level}\\ &N:\text{sample size} \end{align}$$

The following is a calculation of the confidence intervals at the significance level of 0.05.

ci=stats.norm.interval(0.95)/np.sqrt(len(deN))
print(f'Lower: {round(ci[0], 3)}, Upper: {round(ci[1], 3)}')

Lower: -0.097, Upper: 0.097

The null hypothesis of the analysis is correlation coefficient = 0. Based on the results above, which indicate that the regression coefficient exists within the confidence interval, the null hypothesis can be rejected. In other words, the autocorrelation of the error caused by the regression model does exist.

Along with the above test method, autocorrelation can be determined based on the statistics calculated by Equation 4. This test method is called the Durbin_Watson test.

$$\begin{equation}\tag{4}\frac{\sum^T_{t=2}(\text{error}_t - \text{error}_{t-1})^2}{\sum^T_{t=1}\text{error}^2_t}\end{equation}$$

The test statistic above is approximately equal to $2(1-r_h))$ and exists in the range [0, 4]. If $r_h$=0, this test statistic represents 2, so the closer 0 is positive autocorrelation, the closer 4 is strong negative autocorrelation.

denom=np.sum((error[:-1]-error[1:])**2)
nom=np.sum(error**2)
dw=denom/nom
print(f'Statistics: {round(dw, 3)}')

Statistics: 1.398

You can apply the durbin_watson() function of the statsmodel package.

import statsmodels as sm
dw1=sm.stats.stattools.durbin_watson(error)
print(f'DW test statistcs: {np.around(dw1, 3)}')

DW test statistcs: [1.398]

Mean of Square Error

As mentioned in the estimation of the regression coefficient, the sum of all residuals produced by Equation 1 will be close to zero or zero, so use the sum of squared residuals to determine the degree of the residuals in the model. As shown in Equation 5, the sum of the squared residuals divided by the degrees of freedom is the Mean of Square Error(MSE).

$$\begin{align}\tag{5}\text{MSE}&=\frac{\text{SSE}}{\text{df}}\\ &=\frac{\sum^n_{i=1}(y_i-\hat{y}_i)^2}{df} \end{align}$$

The denominator (degree of freedom) in Equation 5 subtracts the number of independent variables (p) from the total number (n), in which case the degree of freedom is n-p-1, or n-2, because one independent variable and a variable for constant terms exist.

Both the independent variables and the response variables used in the model are probability variables and can assume a normal distribution by central limit theorem. Therefore, the errors generated by the model can also be assumed to follow a normal distribution, and the mean (expected value) of the errors is zero. You can also apply the MSE as an estimate of the variance of the error distribution, so that the error distribution can be expressed as follows:

$$\text{e } \sim \text{N(0, mse)}$$

sse=np.sum((deN-pre)**2)
mse=sse/(len(deN)-2)
print(f'mse: {round(mse, 5)}')

mse: 0.01128

MSE can be calculated by the function sklearn.metrics.mean_squared_error(), but this function applies the sample size instead of the degree of freedom.

from sklearn.metrics import mean_squared_error
mse1=mean_squared_error(pre, deN)
print(f'mse: {round(mse1, 5)}')

mse: 0.01123

rmse=mean_squared_error(pre, deN, squared=False)
print(f'rmse: {round(rmse, 5)}')  #=$\sqrt{\text{mse}}

rmse: 0.10595

[Linear Algebra] 유사변환(Similarity transformation)

유사변환(Similarity transformation) n×n 차원의 정방 행렬 A, B 그리고 가역 행렬 P 사이에 식 1의 관계가 성립하면 행렬 A와 B는 유사행렬(similarity matrix)이 되며 행렬 A를 가역행렬 P와 B로 분해하는 것을 유사 변환(similarity transformation) 이라고 합니다. $$\tag{1} A = PBP^{-1} \Leftrightarrow P^{-1}AP = B $$ 식 2는 식 1의 양변에 B의 고유값을 고려한 것입니다. \begin{align}\tag{식 2} B - \lambda I &= P^{-1}AP – \lambda P^{-1}P\\ &= P^{-1}(AP – \lambda P)\\ &= P^{-1}(A - \lambda I)P \end{align} 식 2의 행렬식은 식 3과 같이 정리됩니다. \begin{align} &\begin{aligned}\textsf{det}(B - \lambda I ) & = \textsf{det}(P^{-1}(AP – \lambda P))\\ &= \textsf{det}(P^{-1}) \textsf{det}((A – \lambda I)) \textsf{det}(P)\\ &= \textsf{det}(P^{-1}) \textsf{det}(P) \textsf{det}((A – \lambda I))\\ &= \textsf{det}(A – \lambda I)\end{aligned}\\ &\begin{aligned}\because \; \textsf{det}(P^{-1}) \textsf{det}(P) &= \textsf{det}(P^{-1}P)\\ &= \textsf{det}(I)\end{aligned}\end{align} 유사행렬의 특성 유사행렬인 두 정방행렬 A와 B는 'A ~ B' 와 같

sons dataStory

이 블로그 검색

[ML] 결정트리(Decision Tree) 모델

Autocorrelation & Mean of Square Error

Contents

Residual(Error)

Autocorrelation analysis

Mean of Square Error

태그

댓글

댓글 쓰기

이 블로그의 인기 게시물

[Linear Algebra] 유사변환(Similarity transformation)

[matplotlib] 히스토그램(Histogram)

R 미분과 적분