기본 콘텐츠로 건너뛰기

[ML] 결정트리(Decision Tree) 모델

Regression Analysis: simple regression & regression coefficient

Contents

  1. What is Regression?
  2. Simple regression
  3. Regression Coefficient

What is Regression?

Regression is a statistical method of setting up a model for the relationship between variables and estimating new values through that model.

Figure 1 is a graph of the force (y) corresponding to a constant height (x), showing the exact direct proportional relationship in which y increases as x increases. This relationship is based on data from generalized laws of physics, which can fully predict the forces applied at a certain height within the Earth where gravity acts.

plt.figure(figsize=(7,5))
h=range(7)
w=40
F=[w*9.8*i for i in h]
plt.plot(h, F, "o-")
plt.xlabel("Height(m)", size=13, weight="bold")
plt.ylabel("Force(N)", size=13, weight="bold")
plt.text(2.5, 1500, 'F=Wgh', color="blue", size=13, weight="bold")
plt.text(2, -600, r'w:weight (kg), g: Gravity Acceleration($m/sec^2$)', color="blue", size=12)
plt.show()
Figure 1. The change of power with height.

Figure 2 shows an increase in y as x increases, but it is difficult to predict unknown values with complete lines as shown in Figure 1, i.e., the relational expression of y values corresponding to each x point will have a wide variety of lines within the three lines. In this situation, inferring an expression (regression model) to estimate y for a new x is the final purpose of regression. As such, when it is difficult to determine the relationship model between two variables, building the model is stochastic. In other words, regression is a major statistical method of understanding the characteristics of data from a probabilistic perspective and inferring unknown values.

Figure 2. Y's relationship to X.

Statistical reasoning can be largely divided into parametric and nonparametric methods. Regression is essentially a parametric (parameter of a population) method that determines or assumes the probability distribution of the data you want to analyze with a probability-based analysis and then deduces it based on that distribution. The probability distribution assumed by parametric methods may vary widely, but regression assumes a normal distribution as in previous sections because it can be normalized by increasing the size of the data, as can be expressed by central limit theorem.

Simple regression

Simple regression builds a regression model with one independent variable.

The following data are part of the daily open and close price of the dollar index(DX):

import numpy as np
import numpy.linalg as la
import pandas as pd
from sympy import *
from scipy import special
from scipy import linalg
from scipy import stats
from sklearn import preprocessing
from sympy.calculus.util import continuous_domain
import matplotlib.pyplot as plt
import matplotlib as mpl
import matplotlib.font_manager as fm
import FinanceDataReader as fdr
import statsmodels.api as sm
import FinanceDataReader as fdr
st=pd.Timestamp(2020,5, 10)
et=pd.Timestamp(2021, 12, 22)
dxy=fdr.DataReader('DX', st, et)[['Open','Close']]
dxy.tail(3)
Open Close
Date
2021-12-20 96.620 96.545
2021-12-21 96.515 96.489
2021-12-22 96.470 96.059

The above data include two variables, the independent variable Open and the response Close, and use the last day's Open value separately as an independent variable for final estimation.

ind=dxy.values[:-1,0]
de=dxy.values[:-1,1]
new=dxy.values[-1,0]
new
96.47

Standardization is required as a preprocessing of data for building regression models. The following are the reasons why the processing is necessary:

  • Scale down data
  • Scale the data that occurs between variables constantly when multiple variables are used
  • Standardization of response variables is not required. However, it is executed for the same reason as number (2).

Estimates by regression models built with all data including response variables standardized should finally be reduced to the original scale. In this case, statistics of the mean and standard deviation calculated during the standardization phase are required. sklearn.processing.StandardScale() class meets these requirements.

#standardize independnt variable
indScaler=preprocessing.StandardScaler().fit(dxy.values[:,:-1])
#standardize respond variable
deScaler=preprocessing.StandardScaler().fit(dxy.values[:,-1].reshape(-1,1))
indNor=indScaler.transform(ind.reshape(-1,1))
indNor[:3]
array([[2.81281703],
       [2.98469087],
       [2.8945619 ]])
deNor=indScaler.transform(de.reshape(-1,1))
deNor[:3]
array([[2.98469087],
       [2.85306066],
       [2.99349417]])

The relationship between the variable Open (indNor) and Close (deNor) is shown in Figure 3. In this plot, the regression line is a regression model calculated from the code below.

plt.figure(figsize=(6, 4))
plt.scatter(indNor, deNor, label="Data")
plt.plot(indNor, 0.99767075*indNor-0.001, color="red", label="Regression line")
plt.legend(loc="best")
plt.xlabel("Open")
plt.ylabel('Close')
plt.show()
Figure 3. The relationship between Close and Open.

In Figure 3, the increase in Close seems clear as Open increases. Lines between the data are models generated by regression analysis and can be set as Equation 1:

$$\begin{align}\tag{1}&\text{y}=\beta_0+\beta_1 \text{x} +\varepsilon \\ & \text{x}: \text{independent variable}\\ & \text{y}: \text{dependent variable}\\ & \beta_0 : \text{intercept(bias)}\\ & \beta_1 : \text{regression coefficient(weight)}\\ & \varepsilon: \text{error} \end{align}$$

Equation 1 is called the regression model. Regression models can be created by applying classes or functions provided by Python's various packages. The following are the results of applying statsmodel.api.OLS() and linearRegression classes: Both classes must pass both independent and response variables to the numpy array object.

In the following code, add_constant() is intended to add a deviation term to a description variable, which is as follows:

$$X=\begin{bmatrix} x_1\\x_2\\\vdots \\ x_n \end{bmatrix} \rightarrow X0=\begin{bmatrix} 1&x_1\\1&x_2\\\vdots&\vdots \\ 1&x_n \end{bmatrix}$$
import statsmodels.api as sm
indNor0=sm.add_constant(indNor) 
indNor0.shape, de.shape
((416, 2), (416,))

The following code is the result of applying the statsmodel.api.OLS() class: The .fit() method of this class creates a model by applying descriptive and reactive variables. Also, the regression coefficient of the generated model is expressed using the .params property.

reg=sm.OLS(deNor,indNor0).fit()
print(f'reg.coeff.:{np.around(reg.params,3)}, R^2 :{np.around(reg.rsquared,3)}')
reg.coeff.:[-0.002  0.985], R^2 :0.984
reg.summary()
OLS Regression Results
Dep. Variable: y R-squared: 0.984
Model: OLS Adj. R-squared: 0.983
Method: Least Squares F-statistic: 2.469e+04
Date: Thu, 23 Dec 2021 Prob (F-statistic): 0.00
Time: 12:55:41 Log-Likelihood: 266.80
No. Observations: 416 AIC: -529.6
Df Residuals: 414 BIC: -521.5
Df Model: 1
Covariance Type:nonrobust

coef std err t P>|t| [0.025 0.975]
const -0.0023 0.006 -0.362 0.718 -0.015 0.010
x1 0.9851 0.006 157.127 0.000 0.973 0.997

Omnibus: 2.426 Durbin-Watson: 2.036
Prob(Omnibus): 0.297 Jarque-Bera (JB): 2.256
Skew: 0.107 Prob(JB): 0.324
Kurtosis: 2.709 Cond. No. 1.00
Notes:
  [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

A summary of regression can be represented by the model.summay() function. These results consist of tables for the regression model, the regression coefficient, and the distribution of explanatory variables. The meaning of the results in each table will be introduced in the following sections.

Regression models can also be built using linearRegression classes in sklearn packages. Create a model that fits description variables and response variables using the .fit() method, such as the OLS class in the statsmodel package applied above. The coefficients and deviations of the generated model can be represented using the following properties:

  • Model class name.coef_: return coefficient of model
  • Model class name.intercept_: returns intercept of model
from sklearn.linear_model import LinearRegression
mod = LinearRegression()
mod.fit(indNor, deNor)
print(f'Reg.Coeff.:{np.around(mod.coef_,3)}, bias: {np.around(mod.intercept_,3)}, R2:{np.around(mod.score(indNor, deNor),3)}')
Reg.Coeff.:[[0.985]], bias: [-0.002], R2:0.984

The regression model built with both classes is as follows:

$$\widehat{\text{Close}} = 0.997\text{Open} + -0.001$$

Models built in those classes have a predictor(independent) method, and estimates are calculated by this method. The following estimates correspond to the variable new specified above:

For models by sm.OLS(), a vector with all values of 1 was added to the independent variable to add the bias term. This process must be applied equally to all arguments passed to the .predict() method. Therefore, you must also add the term 1 to the newNor, an independent variable for final estimation. The variable converted by that process is newNor1.

newNor=indScaler.transform([[new]])
newNor
array([[1.39171374]])
newNor1=np.c_[1, newNor]
newNor1
array([[1.        , 1.39171374]])
pre=reg.predict(newNor1)
pre
array([1.36867904])
preVal=indScaler.inverse_transform(pre)
preVal
array([96.41505142])  

However, for sklearn.linearRegression(), the deviation term is automatically inserted by setting the factor fit_intercept=True. Therefore, you do not need to modify these variable dimensions.

pre1=mod.predict(newNor)
pre1
array([[1.36867904]])
preVal1=indScaler.inverse_transform(pre1)
preVal1
array([[96.41505142]])

Regression Coefficient

The regression line is a statistically estimated equation, as shown in Figure 3. The predicted values from this model are also statistically estimated and differ from the actual observed values. Therefore, it is necessary to determine the fit of the model by the difference between the estimated and observed values.

The components of the model, bias $\beta_0$ and the regression coefficient, $\beta_1$, are parameters for the regression model of the population and are unknown values, and should be estimated from the sample's statistics. To distinguish this, represent the bias and regression coefficients of the sample group as $b_0$ and $b_1$ respectively and use them as an unbiased estimator to estimate parameters. Of these estimates, the difference between observations and actual measurements, or error, is calculated as shown in Equation 2. The error is called residual.

$$\begin{align}\tag{2}\text{e}&=y-(b_0+b_1x)\\&=y-\hat{y} \end{align}$$

To minimize the error, it is more efficient to use the sum of squares of the error than to consider the sum of each error across the data. Because each error occurs in negative and positive values, their sum is close to zero, which cannot be used as a basis for judgment for the optimization model. This value is called Sum of Square Error (SSE) or Residual of Square Sum (RSS) and is calculated as shown in Equation 3.

$$\begin{align}\tag{3}SSE&=\sum^n_{i=1}\text{e}^2\\&=\sum^n_{i=1}(y-\hat{y})^2\\&=\sum^n_{i=1}(y-(b_0+b_1x))^2\end{align}$$

For optimal regression models, $b_0$ and $b_1$ can be derived from Equation 5.26, to minimize RSS. The expression shows the form of a secondary function as shown in Figure 4. The minimum value for these functions is generated at the point where the differential is zero.

Figure 4. Differentiation of quadratic functions.

Thus, as shown in Equation 4, the regression coefficient and deviation are the results of $\frac{\partial(sse)}{\partial b_0}=0\, \text{and}\, \frac{\partial(sse)}{\partial b_1}=0$, respectively.

$$\begin{align}\tag{4} b_1&=\frac{\sum^n_{i=1}(x_i-\bar{x})(y_i-\bar{y})}{\sum^n_{i=1}(x_i-\bar{x})^2}\\&=\frac{S_{xy}}{S_{xx}}\\ b_0&=\frac{y-b_1x}{n} \end{align}$$

Estimating regression coefficients to minimize errors is called the Least Square method or Ordinary Least Square(OLS).

b1=(np.sum((indNor-indNor.mean())*(deNor-deNor.mean())))/np.sum((indNor-indNor.mean())**2)
print(f'reg.coeff.:{round(b1,3)}')
reg.coeff.:0.985
b0=np.sum(deNor-b1*indNor)/len(deNor)
print(f'bias: {round(b0,3)}')
bias: -0.002

The above results are found in the table generated from reg.summary() in the previous section.

A simple regression model with one independent variable and one response makes the calculation easier, but if there are more than one independent variable, switching the variable to an array structure, especially a matrix, helps you understand the calculation or model. For a simple example, switching independent and dependent variables from a simple regression model to a matrix can be represented as shown in Equation 5.

$$\begin{equation}\tag{5}\begin{bmatrix} \hat{y}_1\\\hat{y}_2\\\vdots \\ \hat{y}_n \end{bmatrix}=\begin{bmatrix} 1&x_1\\1&x_2\\\vdots &\vdots \\ 1&x_n \end{bmatrix} \begin{bmatrix}b_0\\b_1 \end{bmatrix} \end{equation}$$

The coefficient matrix, the second right matrix in Equation 5 can be calculated using the inverse matrix of the first matrix of the right term. The calculation process is shown in Equation 6.

$$\begin{align}\tag{6} &Y=Xa\\&\begin{aligned}X^{-1}Y&=X^{-1}Xa\\&=a\end{aligned} \end{align} $$

In Equation 6 X must be a square matrix. A square matrix from a matrix is generated by a matrix product operation with the transposed matrix of that matrix. For example, if matrix a has a dimension of 3×4, the dimensions of the matrix product between the transposition are summarized as follows:

$$a \times a^T : (3 \times 4) \times (4 \times 3)=(3 \times 3) \; \text{or} \; a^T \times a : (4 \times 3) \times (3 \times 4)=(4 \times 4)$$

Therefore, the calculation process of the coefficient matrix is generalized, as shown in Equation 7.

$$\begin{align} \tag{7} &Y=Xb \\&\rightarrow X^TY=X^TXb\\ & \rightarrow \begin{aligned}(X^TX)^{-1}X^TY&=(X^TX)^{-1}X^TXb\\&=b \end{aligned} \end{align}$$

The matrix operation process of the regression coefficient is as follows:

xtx=np.dot(indNor0.T, indNor0)
xty=np.dot(indNor0.T, deNor)
b=np.linalg.solve(xtx, xty)
np.around(b,3)
array([[-0.002],
       [ 0.985]])

댓글

이 블로그의 인기 게시물

[Linear Algebra] 유사변환(Similarity transformation)

유사변환(Similarity transformation) n×n 차원의 정방 행렬 A, B 그리고 가역 행렬 P 사이에 식 1의 관계가 성립하면 행렬 A와 B는 유사행렬(similarity matrix)이 되며 행렬 A를 가역행렬 P와 B로 분해하는 것을 유사 변환(similarity transformation) 이라고 합니다. $$\tag{1} A = PBP^{-1} \Leftrightarrow P^{-1}AP = B $$ 식 2는 식 1의 양변에 B의 고유값을 고려한 것입니다. \begin{align}\tag{식 2} B - \lambda I &= P^{-1}AP – \lambda P^{-1}P\\ &= P^{-1}(AP – \lambda P)\\ &= P^{-1}(A - \lambda I)P \end{align} 식 2의 행렬식은 식 3과 같이 정리됩니다. \begin{align} &\begin{aligned}\textsf{det}(B - \lambda I ) & = \textsf{det}(P^{-1}(AP – \lambda P))\\ &= \textsf{det}(P^{-1}) \textsf{det}((A – \lambda I)) \textsf{det}(P)\\ &= \textsf{det}(P^{-1}) \textsf{det}(P) \textsf{det}((A – \lambda I))\\ &= \textsf{det}(A – \lambda I)\end{aligned}\\ &\begin{aligned}\because \; \textsf{det}(P^{-1}) \textsf{det}(P) &= \textsf{det}(P^{-1}P)\\ &= \textsf{det}(I)\end{aligned}\end{align} 유사행렬의 특성 유사행렬인 두 정방행렬 A와 B는 'A ~ B' 와 같

[matplotlib] 히스토그램(Histogram)

히스토그램(Histogram) 히스토그램은 확률분포의 그래픽적인 표현이며 막대그래프의 종류입니다. 이 그래프가 확률분포와 관계가 있으므로 통계적 요소를 나타내기 위해 많이 사용됩니다. plt.hist(X, bins=10)함수를 사용합니다. x=np.random.randn(1000) plt.hist(x, 10) plt.show() 위 그래프의 y축은 각 구간에 해당하는 갯수이다. 빈도수 대신 확률밀도를 나타내기 위해서는 위 함수의 매개변수 normed=True로 조정하여 나타낼 수 있다. 또한 매개변수 bins의 인수를 숫자로 전달할 수 있지만 리스트 객체로 지정할 수 있다. 막대그래프의 경우와 마찬가지로 각 막대의 폭은 매개변수 width에 의해 조정된다. y=np.linspace(min(x)-1, max(x)+1, 10) y array([-4.48810153, -3.54351935, -2.59893717, -1.65435499, -0.70977282, 0.23480936, 1.17939154, 2.12397372, 3.0685559 , 4.01313807]) plt.hist(x, y, normed=True) plt.show()

R 미분과 적분

내용 expression 미분 2차 미분 mosaic를 사용한 미분 적분 미분과 적분 R에서의 미분과 적분 함수는 expression()함수에 의해 생성된 표현식을 대상으로 합니다. expression expression(문자, 또는 식) 이 표현식의 평가는 eval() 함수에 의해 실행됩니다. > ex1<-expression(1+0:9) > ex1 expression(1 + 0:9) > eval(ex1) [1] 1 2 3 4 5 6 7 8 9 10 > ex2<-expression(u, 2, u+0:9) > ex2 expression(u, 2, u + 0:9) > ex2[1] expression(u) > ex2[2] expression(2) > ex2[3] expression(u + 0:9) > u<-0.9 > eval(ex2[3]) [1] 0.9 1.9 2.9 3.9 4.9 5.9 6.9 7.9 8.9 9.9 미분 D(표현식, 미분 변수) 함수로 미분을 실행합니다. 이 함수의 표현식은 expression() 함수로 생성된 객체이며 미분 변수는 다음 식의 분모의 변수를 의미합니다. $$\frac{d}{d \text{변수}}\text{표현식}$$ 이 함수는 어떤 함수의 미분의 결과를 표현식으로 반환합니다. > D(expression(2*x^3), "x") 2 * (3 * x^2) > eq<-expression(log(x)) > eq expression(log(x)) > D(eq, "x") 1/x > eq2<-expression(a/(1+b*exp(-d*x))); eq2 expression(a/(1 + b * exp(-d * x))) > D(eq2, "x") a * (b * (exp(-d * x) * d))/(1 + b