Contents
Residual(Error)
The generated regression model needs to be statistically tested, and the main object in the test is an error, the difference between the observations and estimates calculated by Equation 1.
$$\begin{align}\tag{1}\text{e}&=y-(b_0+b_1x)\\&=y-\hat{y} \end{align}$$Errors in the regression model have the following prerequisites:
- Probability variables that follow a normal distribution
- Because independent variables are probabilities that follow a normal distribution, the error between the response and the estimate is also a probability variable that follows a normal distribution. This means that the error cannot be artificially adjusted.
- Homoscedastic of error terms
- Various regression models are possible, as shown in Figure 1. This means that you can configure the probability distribution for the regression coefficients. This distribution has means and variances. The mean of this distribution is the regression coefficient generated by the least square method. Because changes in residuals tend to be the same as changes in regression coefficients, estimates by models generated from least squares methods become means of residual distributions and variances are equal to those of regression coefficients. As a result, the variance in the residual distribution generated from all predictions is all equal. This is homoscedastic.
- There is no autocorrelation between errors with different time points.
- This means that there is no systematic relationship between errors. However, for real data, especially for time series, this assumption is not easy to match because it has a relationship between each data. You can apply a variety of independent variables, or reduce this relationship by using methods such as lasso.
import numpy as np import pandas as pd from scipy import stats from sklearn import preprocessing import matplotlib.pyplot as plt import FinanceDataReader as fdr from sklearn.linear_model import LinearRegression
x=np.linspace(-5, 5, 100).reshape(-1,1) r=np.random.rand(100,1)*10 y=1.2*x+r m=LinearRegression() m.fit(x, y) p=m.predict(x) plt.figure(figsize=(7,4)) plt.scatter(x, y) plt.plot(x, p, color="red", label="regression") for i, j in zip(np.random.rand(5)*3.1, np.random.rand(5)*5.1): y1=i*x+b plt.plot(x, y1, '--') plt.xlabel("x", size=12, weight="bold") plt.ylabel("y", size=12, weight="bold") plt.legend(loc="best") plt.show()
Autocorrelation analysis
Correlation represents the relationship between two variables, whereas autocorrelation determines the relationship between values in time differences within a variable. In other words, if the relationship between the values in the rows in the data expressed in the form of a matrix is a correlation, autocorrelation is a representation of the relationship between values present within a column. The degree of autocorrelation is represented by the autocorrelation coefficient ($R_h$) in Equation 2.
$$\begin{align}\tag{2} R_h& =\frac{ \text{Autocovariance}}{\text{Variance}}\\ &=\frac{\sum^{N-h}_{t=1} (x_t-\bar{x})(x_{t+h}-\bar{x})}{\sum^N_{i=1}(x_t-\bar{x})}\end{align}$$The autocorrelation number in Equation 2 is calculated by the function pandas.Series.autocorr(lag=1)
. This function can only be applied to Series objects, that is, one-dimensional vectors consisting of one column or row.
The following example shows daily opening and closing prices from Nasdaq stock prices. A linear model is implemented with the opening price as the independent variable and the closing price as the dependent variable.
import FinanceDataReader as fdr st=pd.Timestamp(2020,5, 10) et=pd.Timestamp(2021, 12, 22) nd=fdr.DataReader('IXIC', st, et)[['Open','Close']] nd.tail(3)
Open | Close | |
---|---|---|
Date | ||
2021-12-20 | 14933.0 | 14980.9 |
2021-12-21 | 15140.4 | 15341.1 |
2021-12-22 | 15319.2 | 15521.9 |
ind=nd.values[:-1,0].reshape(-1,1) de=nd.values[1:,1].reshape(-1,1) new=nd.values[-1,0].reshape(-1,1) new
array([[15319.2]])
indScaler=preprocessing.StandardScaler().fit(ind) deScaler=preprocessing.StandardScaler().fit(de) indN=indScaler.transform(ind) deN=indScaler.transform(de) newN=deScaler.transform(new) newN
array([[1.23623733]])
mod=LinearRegression().fit(indN, deN) pre=mod.predict(indN) error=deN-pre mu_e=error.mean() r=np.sum((error[:-1]-mu_e)*(error[1:]-mu_e))/(np.sum((error-mu_e)**2)) print(f'autocor.codeff.: {round(r, 3)}')
autocor.codeff.: 0.296
r_h=pd.Series(error.flatten()).autocorr(lag=1) print(f'autocor.codeff.: {round(r_h, 3)}')
autocor.codeff.: 0.298
The value of the autocorrelation number exists in the range of [-1,1] and if 0, indicates no autocorrelation. The confidence interval of the autocorrelation is calculated as shown in Equation 3.
$$\begin{align}\tag{3}&\text{Confidence Inteval}=\pm \frac{z_{1-\alpha/2}}{\sqrt{N}}\\ &\alpha:\text{sigmificant level}\\ &N:\text{sample size} \end{align}$$The following is a calculation of the confidence intervals at the significance level of 0.05.
ci=stats.norm.interval(0.95)/np.sqrt(len(deN)) print(f'Lower: {round(ci[0], 3)}, Upper: {round(ci[1], 3)}')
Lower: -0.097, Upper: 0.097
The null hypothesis of the analysis is correlation coefficient = 0. Based on the results above, which indicate that the regression coefficient exists within the confidence interval, the null hypothesis can be rejected. In other words, the autocorrelation of the error caused by the regression model does exist.
Along with the above test method, autocorrelation can be determined based on the statistics calculated by Equation 4. This test method is called the Durbin_Watson test.
$$\begin{equation}\tag{4}\frac{\sum^T_{t=2}(\text{error}_t - \text{error}_{t-1})^2}{\sum^T_{t=1}\text{error}^2_t}\end{equation}$$The test statistic above is approximately equal to $2(1-r_h))$ and exists in the range [0, 4]. If $r_h$=0, this test statistic represents 2, so the closer 0 is positive autocorrelation, the closer 4 is strong negative autocorrelation.
denom=np.sum((error[:-1]-error[1:])**2) nom=np.sum(error**2) dw=denom/nom print(f'Statistics: {round(dw, 3)}')
Statistics: 1.398You can apply the
durbin_watson()
function of the statsmodel package.
import statsmodels as sm dw1=sm.stats.stattools.durbin_watson(error) print(f'DW test statistcs: {np.around(dw1, 3)}')
DW test statistcs: [1.398]
Mean of Square Error
As mentioned in the estimation of the regression coefficient, the sum of all residuals produced by Equation 1 will be close to zero or zero, so use the sum of squared residuals to determine the degree of the residuals in the model. As shown in Equation 5, the sum of the squared residuals divided by the degrees of freedom is the Mean of Square Error(MSE).
$$\begin{align}\tag{5}\text{MSE}&=\frac{\text{SSE}}{\text{df}}\\ &=\frac{\sum^n_{i=1}(y_i-\hat{y}_i)^2}{df} \end{align}$$The denominator (degree of freedom) in Equation 5 subtracts the number of independent variables (p) from the total number (n), in which case the degree of freedom is n-p-1, or n-2, because one independent variable and a variable for constant terms exist.
Both the independent variables and the response variables used in the model are probability variables and can assume a normal distribution by central limit theorem. Therefore, the errors generated by the model can also be assumed to follow a normal distribution, and the mean (expected value) of the errors is zero. You can also apply the MSE as an estimate of the variance of the error distribution, so that the error distribution can be expressed as follows:
$$\text{e } \sim \text{N(0, mse)}$$sse=np.sum((deN-pre)**2) mse=sse/(len(deN)-2) print(f'mse: {round(mse, 5)}')
mse: 0.01128
MSE can be calculated by the function sklearn.metrics.mean_squared_error()
, but this function applies the sample size instead of the degree of freedom.
from sklearn.metrics import mean_squared_error mse1=mean_squared_error(pre, deN) print(f'mse: {round(mse1, 5)}')
mse: 0.01123
rmse=mean_squared_error(pre, deN, squared=False) print(f'rmse: {round(rmse, 5)}') #=$\sqrt{\text{mse}}
rmse: 0.10595
댓글
댓글 쓰기