기본 콘텐츠로 건너뛰기

벡터와 행렬에 관련된 그림들

Regression Analysis: simple regression & regression coefficient

Contents

  1. What is Regression?
  2. Simple regression
  3. Regression Coefficient

What is Regression?

Regression is a statistical method of setting up a model for the relationship between variables and estimating new values through that model.

Figure 1 is a graph of the force (y) corresponding to a constant height (x), showing the exact direct proportional relationship in which y increases as x increases. This relationship is based on data from generalized laws of physics, which can fully predict the forces applied at a certain height within the Earth where gravity acts.

plt.figure(figsize=(7,5))
h=range(7)
w=40
F=[w*9.8*i for i in h]
plt.plot(h, F, "o-")
plt.xlabel("Height(m)", size=13, weight="bold")
plt.ylabel("Force(N)", size=13, weight="bold")
plt.text(2.5, 1500, 'F=Wgh', color="blue", size=13, weight="bold")
plt.text(2, -600, r'w:weight (kg), g: Gravity Acceleration($m/sec^2$)', color="blue", size=12)
plt.show()
Figure 1. The change of power with height.

Figure 2 shows an increase in y as x increases, but it is difficult to predict unknown values with complete lines as shown in Figure 1, i.e., the relational expression of y values corresponding to each x point will have a wide variety of lines within the three lines. In this situation, inferring an expression (regression model) to estimate y for a new x is the final purpose of regression. As such, when it is difficult to determine the relationship model between two variables, building the model is stochastic. In other words, regression is a major statistical method of understanding the characteristics of data from a probabilistic perspective and inferring unknown values.

Figure 2. Y's relationship to X.

Statistical reasoning can be largely divided into parametric and nonparametric methods. Regression is essentially a parametric (parameter of a population) method that determines or assumes the probability distribution of the data you want to analyze with a probability-based analysis and then deduces it based on that distribution. The probability distribution assumed by parametric methods may vary widely, but regression assumes a normal distribution as in previous sections because it can be normalized by increasing the size of the data, as can be expressed by central limit theorem.

Simple regression

Simple regression builds a regression model with one independent variable.

The following data are part of the daily open and close price of the dollar index(DX):

import numpy as np
import numpy.linalg as la
import pandas as pd
from sympy import *
from scipy import special
from scipy import linalg
from scipy import stats
from sklearn import preprocessing
from sympy.calculus.util import continuous_domain
import matplotlib.pyplot as plt
import matplotlib as mpl
import matplotlib.font_manager as fm
import FinanceDataReader as fdr
import statsmodels.api as sm
import FinanceDataReader as fdr
st=pd.Timestamp(2020,5, 10)
et=pd.Timestamp(2021, 12, 22)
dxy=fdr.DataReader('DX', st, et)[['Open','Close']]
dxy.tail(3)
Open Close
Date
2021-12-20 96.620 96.545
2021-12-21 96.515 96.489
2021-12-22 96.470 96.059

The above data include two variables, the independent variable Open and the response Close, and use the last day's Open value separately as an independent variable for final estimation.

ind=dxy.values[:-1,0]
de=dxy.values[:-1,1]
new=dxy.values[-1,0]
new
96.47

Standardization is required as a preprocessing of data for building regression models. The following are the reasons why the processing is necessary:

  • Scale down data
  • Scale the data that occurs between variables constantly when multiple variables are used
  • Standardization of response variables is not required. However, it is executed for the same reason as number (2).

Estimates by regression models built with all data including response variables standardized should finally be reduced to the original scale. In this case, statistics of the mean and standard deviation calculated during the standardization phase are required. sklearn.processing.StandardScale() class meets these requirements.

#standardize independnt variable
indScaler=preprocessing.StandardScaler().fit(dxy.values[:,:-1])
#standardize respond variable
deScaler=preprocessing.StandardScaler().fit(dxy.values[:,-1].reshape(-1,1))
indNor=indScaler.transform(ind.reshape(-1,1))
indNor[:3]
array([[2.81281703],
       [2.98469087],
       [2.8945619 ]])
deNor=indScaler.transform(de.reshape(-1,1))
deNor[:3]
array([[2.98469087],
       [2.85306066],
       [2.99349417]])

The relationship between the variable Open (indNor) and Close (deNor) is shown in Figure 3. In this plot, the regression line is a regression model calculated from the code below.

plt.figure(figsize=(6, 4))
plt.scatter(indNor, deNor, label="Data")
plt.plot(indNor, 0.99767075*indNor-0.001, color="red", label="Regression line")
plt.legend(loc="best")
plt.xlabel("Open")
plt.ylabel('Close')
plt.show()
Figure 3. The relationship between Close and Open.

In Figure 3, the increase in Close seems clear as Open increases. Lines between the data are models generated by regression analysis and can be set as Equation 1:

$$\begin{align}\tag{1}&\text{y}=\beta_0+\beta_1 \text{x} +\varepsilon \\ & \text{x}: \text{independent variable}\\ & \text{y}: \text{dependent variable}\\ & \beta_0 : \text{intercept(bias)}\\ & \beta_1 : \text{regression coefficient(weight)}\\ & \varepsilon: \text{error} \end{align}$$

Equation 1 is called the regression model. Regression models can be created by applying classes or functions provided by Python's various packages. The following are the results of applying statsmodel.api.OLS() and linearRegression classes: Both classes must pass both independent and response variables to the numpy array object.

In the following code, add_constant() is intended to add a deviation term to a description variable, which is as follows:

$$X=\begin{bmatrix} x_1\\x_2\\\vdots \\ x_n \end{bmatrix} \rightarrow X0=\begin{bmatrix} 1&x_1\\1&x_2\\\vdots&\vdots \\ 1&x_n \end{bmatrix}$$
import statsmodels.api as sm
indNor0=sm.add_constant(indNor) 
indNor0.shape, de.shape
((416, 2), (416,))

The following code is the result of applying the statsmodel.api.OLS() class: The .fit() method of this class creates a model by applying descriptive and reactive variables. Also, the regression coefficient of the generated model is expressed using the .params property.

reg=sm.OLS(deNor,indNor0).fit()
print(f'reg.coeff.:{np.around(reg.params,3)}, R^2 :{np.around(reg.rsquared,3)}')
reg.coeff.:[-0.002  0.985], R^2 :0.984
reg.summary()
OLS Regression Results
Dep. Variable: y R-squared: 0.984
Model: OLS Adj. R-squared: 0.983
Method: Least Squares F-statistic: 2.469e+04
Date: Thu, 23 Dec 2021 Prob (F-statistic): 0.00
Time: 12:55:41 Log-Likelihood: 266.80
No. Observations: 416 AIC: -529.6
Df Residuals: 414 BIC: -521.5
Df Model: 1
Covariance Type:nonrobust

coef std err t P>|t| [0.025 0.975]
const -0.0023 0.006 -0.362 0.718 -0.015 0.010
x1 0.9851 0.006 157.127 0.000 0.973 0.997

Omnibus: 2.426 Durbin-Watson: 2.036
Prob(Omnibus): 0.297 Jarque-Bera (JB): 2.256
Skew: 0.107 Prob(JB): 0.324
Kurtosis: 2.709 Cond. No. 1.00
Notes:
  [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

A summary of regression can be represented by the model.summay() function. These results consist of tables for the regression model, the regression coefficient, and the distribution of explanatory variables. The meaning of the results in each table will be introduced in the following sections.

Regression models can also be built using linearRegression classes in sklearn packages. Create a model that fits description variables and response variables using the .fit() method, such as the OLS class in the statsmodel package applied above. The coefficients and deviations of the generated model can be represented using the following properties:

  • Model class name.coef_: return coefficient of model
  • Model class name.intercept_: returns intercept of model
from sklearn.linear_model import LinearRegression
mod = LinearRegression()
mod.fit(indNor, deNor)
print(f'Reg.Coeff.:{np.around(mod.coef_,3)}, bias: {np.around(mod.intercept_,3)}, R2:{np.around(mod.score(indNor, deNor),3)}')
Reg.Coeff.:[[0.985]], bias: [-0.002], R2:0.984

The regression model built with both classes is as follows:

$$\widehat{\text{Close}} = 0.997\text{Open} + -0.001$$

Models built in those classes have a predictor(independent) method, and estimates are calculated by this method. The following estimates correspond to the variable new specified above:

For models by sm.OLS(), a vector with all values of 1 was added to the independent variable to add the bias term. This process must be applied equally to all arguments passed to the .predict() method. Therefore, you must also add the term 1 to the newNor, an independent variable for final estimation. The variable converted by that process is newNor1.

newNor=indScaler.transform([[new]])
newNor
array([[1.39171374]])
newNor1=np.c_[1, newNor]
newNor1
array([[1.        , 1.39171374]])
pre=reg.predict(newNor1)
pre
array([1.36867904])
preVal=indScaler.inverse_transform(pre)
preVal
array([96.41505142])  

However, for sklearn.linearRegression(), the deviation term is automatically inserted by setting the factor fit_intercept=True. Therefore, you do not need to modify these variable dimensions.

pre1=mod.predict(newNor)
pre1
array([[1.36867904]])
preVal1=indScaler.inverse_transform(pre1)
preVal1
array([[96.41505142]])

Regression Coefficient

The regression line is a statistically estimated equation, as shown in Figure 3. The predicted values from this model are also statistically estimated and differ from the actual observed values. Therefore, it is necessary to determine the fit of the model by the difference between the estimated and observed values.

The components of the model, bias $\beta_0$ and the regression coefficient, $\beta_1$, are parameters for the regression model of the population and are unknown values, and should be estimated from the sample's statistics. To distinguish this, represent the bias and regression coefficients of the sample group as $b_0$ and $b_1$ respectively and use them as an unbiased estimator to estimate parameters. Of these estimates, the difference between observations and actual measurements, or error, is calculated as shown in Equation 2. The error is called residual.

$$\begin{align}\tag{2}\text{e}&=y-(b_0+b_1x)\\&=y-\hat{y} \end{align}$$

To minimize the error, it is more efficient to use the sum of squares of the error than to consider the sum of each error across the data. Because each error occurs in negative and positive values, their sum is close to zero, which cannot be used as a basis for judgment for the optimization model. This value is called Sum of Square Error (SSE) or Residual of Square Sum (RSS) and is calculated as shown in Equation 3.

$$\begin{align}\tag{3}SSE&=\sum^n_{i=1}\text{e}^2\\&=\sum^n_{i=1}(y-\hat{y})^2\\&=\sum^n_{i=1}(y-(b_0+b_1x))^2\end{align}$$

For optimal regression models, $b_0$ and $b_1$ can be derived from Equation 5.26, to minimize RSS. The expression shows the form of a secondary function as shown in Figure 4. The minimum value for these functions is generated at the point where the differential is zero.

Figure 4. Differentiation of quadratic functions.

Thus, as shown in Equation 4, the regression coefficient and deviation are the results of $\frac{\partial(sse)}{\partial b_0}=0\, \text{and}\, \frac{\partial(sse)}{\partial b_1}=0$, respectively.

$$\begin{align}\tag{4} b_1&=\frac{\sum^n_{i=1}(x_i-\bar{x})(y_i-\bar{y})}{\sum^n_{i=1}(x_i-\bar{x})^2}\\&=\frac{S_{xy}}{S_{xx}}\\ b_0&=\frac{y-b_1x}{n} \end{align}$$

Estimating regression coefficients to minimize errors is called the Least Square method or Ordinary Least Square(OLS).

b1=(np.sum((indNor-indNor.mean())*(deNor-deNor.mean())))/np.sum((indNor-indNor.mean())**2)
print(f'reg.coeff.:{round(b1,3)}')
reg.coeff.:0.985
b0=np.sum(deNor-b1*indNor)/len(deNor)
print(f'bias: {round(b0,3)}')
bias: -0.002

The above results are found in the table generated from reg.summary() in the previous section.

A simple regression model with one independent variable and one response makes the calculation easier, but if there are more than one independent variable, switching the variable to an array structure, especially a matrix, helps you understand the calculation or model. For a simple example, switching independent and dependent variables from a simple regression model to a matrix can be represented as shown in Equation 5.

$$\begin{equation}\tag{5}\begin{bmatrix} \hat{y}_1\\\hat{y}_2\\\vdots \\ \hat{y}_n \end{bmatrix}=\begin{bmatrix} 1&x_1\\1&x_2\\\vdots &\vdots \\ 1&x_n \end{bmatrix} \begin{bmatrix}b_0\\b_1 \end{bmatrix} \end{equation}$$

The coefficient matrix, the second right matrix in Equation 5 can be calculated using the inverse matrix of the first matrix of the right term. The calculation process is shown in Equation 6.

$$\begin{align}\tag{6} &Y=Xa\\&\begin{aligned}X^{-1}Y&=X^{-1}Xa\\&=a\end{aligned} \end{align} $$

In Equation 6 X must be a square matrix. A square matrix from a matrix is generated by a matrix product operation with the transposed matrix of that matrix. For example, if matrix a has a dimension of 3×4, the dimensions of the matrix product between the transposition are summarized as follows:

$$a \times a^T : (3 \times 4) \times (4 \times 3)=(3 \times 3) \; \text{or} \; a^T \times a : (4 \times 3) \times (3 \times 4)=(4 \times 4)$$

Therefore, the calculation process of the coefficient matrix is generalized, as shown in Equation 7.

$$\begin{align} \tag{7} &Y=Xb \\&\rightarrow X^TY=X^TXb\\ & \rightarrow \begin{aligned}(X^TX)^{-1}X^TY&=(X^TX)^{-1}X^TXb\\&=b \end{aligned} \end{align}$$

The matrix operation process of the regression coefficient is as follows:

xtx=np.dot(indNor0.T, indNor0)
xty=np.dot(indNor0.T, deNor)
b=np.linalg.solve(xtx, xty)
np.around(b,3)
array([[-0.002],
       [ 0.985]])

댓글

이 블로그의 인기 게시물

[Linear Algebra] 유사변환(Similarity transformation)

유사변환(Similarity transformation) n×n 차원의 정방 행렬 A, B 그리고 가역 행렬 P 사이에 식 1의 관계가 성립하면 행렬 A와 B는 유사행렬(similarity matrix)이 되며 행렬 A를 가역행렬 P와 B로 분해하는 것을 유사 변환(similarity transformation) 이라고 합니다. $$\tag{1} A = PBP^{-1} \Leftrightarrow P^{-1}AP = B $$ 식 2는 식 1의 양변에 B의 고유값을 고려한 것입니다. \begin{align}\tag{식 2} B - \lambda I &= P^{-1}AP – \lambda P^{-1}P\\ &= P^{-1}(AP – \lambda P)\\ &= P^{-1}(A - \lambda I)P \end{align} 식 2의 행렬식은 식 3과 같이 정리됩니다. \begin{align} &\begin{aligned}\textsf{det}(B - \lambda I ) & = \textsf{det}(P^{-1}(AP – \lambda P))\\ &= \textsf{det}(P^{-1}) \textsf{det}((A – \lambda I)) \textsf{det}(P)\\ &= \textsf{det}(P^{-1}) \textsf{det}(P) \textsf{det}((A – \lambda I))\\ &= \textsf{det}(A – \lambda I)\end{aligned}\\ &\begin{aligned}\because \; \textsf{det}(P^{-1}) \textsf{det}(P) &= \textsf{det}(P^{-1}P)\\ &= \textsf{det}(I)\end{aligned}\end{align} 유사행렬의 특성 유사행렬인 두 정방행렬 A와 B는 'A ~ B' 와 같...

[sympy] Sympy객체의 표현을 위한 함수들

Sympy객체의 표현을 위한 함수들 General simplify(x): 식 x(sympy 객체)를 간단히 정리 합니다. import numpy as np from sympy import * x=symbols("x") a=sin(x)**2+cos(x)**2 a $\sin^{2}{\left(x \right)} + \cos^{2}{\left(x \right)}$ simplify(a) 1 simplify(b) $\frac{x^{3} + x^{2} - x - 1}{x^{2} + 2 x + 1}$ simplify(b) x - 1 c=gamma(x)/gamma(x-2) c $\frac{\Gamma\left(x\right)}{\Gamma\left(x - 2\right)}$ simplify(c) $\displaystyle \left(x - 2\right) \left(x - 1\right)$ 위의 예들 중 객체 c의 감마함수(gamma(x))는 확률분포 등 여러 부분에서 사용되는 표현식으로 다음과 같이 정의 됩니다. 감마함수는 음이 아닌 정수를 제외한 모든 수에서 정의됩니다. 식 1과 같이 자연수에서 감마함수는 factorial(!), 부동소수(양의 실수)인 경우 적분을 적용하여 계산합니다. $$\tag{식 1}\Gamma(n) =\begin{cases}(n-1)!& n:\text{자연수}\\\int^\infty_0x^{n-1}e^{-x}\,dx& n:\text{부동소수}\end{cases}$$ x=symbols('x') gamma(x).subs(x,4) $\displaystyle 6$ factorial 계산은 math.factorial() 함수를 사용할 수 있습니다. import math math.factorial(3) 6 a=gamma(x).subs(x,4.5) a.evalf(3) 11.6 simpilfy() 함수의 알고리즘은 식에서 공통사항을 찾아 정리하...

sympy.solvers로 방정식해 구하기

sympy.solvers로 방정식해 구하기 대수 방정식을 해를 계산하기 위해 다음 함수를 사용합니다. sympy.solvers.solve(f, *symbols, **flags) f=0, 즉 동차방정식에 대해 지정한 변수의 해를 계산 f : 식 또는 함수 symbols: 식의 해를 계산하기 위한 변수, 변수가 하나인 경우는 생략가능(자동으로 인식) flags: 계산 또는 결과의 방식을 지정하기 위한 인수들 dict=True: {x:3, y:1}같이 사전형식, 기본값 = False set=True :{(x,3),(y,1)}같이 집합형식, 기본값 = False ratioal=True : 실수를 유리수로 반환, 기본값 = False positive=True: 해들 중에 양수만을 반환, 기본값 = False 예 $x^2=1$의 해를 결정합니다. solve() 함수에 적용하기 위해서는 다음과 같이 식의 한쪽이 0이 되는 형태인 동차식으로 구성되어야 합니다. $$x^2-1=0$$ import numpy as np from sympy import * x = symbols('x') solve(x**2-1, x) [-1, 1] 위 식은 계산 과정은 다음과 같습니다. $$\begin{aligned}x^2-1=0 \rightarrow (x+1)(x-1)=0 \\ x=1 \; \text{or}\; -1\end{aligned}$$ 예 $x^4=1$의 해를 결정합니다. solve() 함수의 인수 set=True를 지정하였으므로 결과는 집합(set)형으로 반환됩니다. eq=x**4-1 solve(eq, set=True) ([x], {(-1,), (-I,), (1,), (I,)}) 위의 경우 I는 복소수입니다.즉 위 결과의 과정은 다음과 같습니다. $$x^4-1=(x^2+1)(x+1)(x-1)=0 \rightarrow x=\pm \sqrt{-1}, \; \pm 1=\pm i,\; \pm1$$ 실수...