Contents
What is Regression?
Regression is a statistical method of setting up a model for the relationship between variables and estimating new values through that model.
Figure 1 is a graph of the force (y) corresponding to a constant height (x), showing the exact direct proportional relationship in which y increases as x increases. This relationship is based on data from generalized laws of physics, which can fully predict the forces applied at a certain height within the Earth where gravity acts.
plt.figure(figsize=(7,5)) h=range(7) w=40 F=[w*9.8*i for i in h] plt.plot(h, F, "o-") plt.xlabel("Height(m)", size=13, weight="bold") plt.ylabel("Force(N)", size=13, weight="bold") plt.text(2.5, 1500, 'F=Wgh', color="blue", size=13, weight="bold") plt.text(2, -600, r'w:weight (kg), g: Gravity Acceleration($m/sec^2$)', color="blue", size=12) plt.show()
Figure 2 shows an increase in y as x increases, but it is difficult to predict unknown values with complete lines as shown in Figure 1, i.e., the relational expression of y values corresponding to each x point will have a wide variety of lines within the three lines. In this situation, inferring an expression (regression model) to estimate y for a new x is the final purpose of regression. As such, when it is difficult to determine the relationship model between two variables, building the model is stochastic. In other words, regression is a major statistical method of understanding the characteristics of data from a probabilistic perspective and inferring unknown values.
Statistical reasoning can be largely divided into parametric and nonparametric methods. Regression is essentially a parametric (parameter of a population) method that determines or assumes the probability distribution of the data you want to analyze with a probability-based analysis and then deduces it based on that distribution. The probability distribution assumed by parametric methods may vary widely, but regression assumes a normal distribution as in previous sections because it can be normalized by increasing the size of the data, as can be expressed by central limit theorem.
Simple regression
Simple regression builds a regression model with one independent variable.
The following data are part of the daily open and close price of the dollar index(DX):
import numpy as np import numpy.linalg as la import pandas as pd from sympy import * from scipy import special from scipy import linalg from scipy import stats from sklearn import preprocessing from sympy.calculus.util import continuous_domain import matplotlib.pyplot as plt import matplotlib as mpl import matplotlib.font_manager as fm import FinanceDataReader as fdr import statsmodels.api as sm
import FinanceDataReader as fdr st=pd.Timestamp(2020,5, 10) et=pd.Timestamp(2021, 12, 22) dxy=fdr.DataReader('DX', st, et)[['Open','Close']] dxy.tail(3)
Open | Close | |
---|---|---|
Date | ||
2021-12-20 | 96.620 | 96.545 |
2021-12-21 | 96.515 | 96.489 |
2021-12-22 | 96.470 | 96.059 |
The above data include two variables, the independent variable Open and the response Close, and use the last day's Open value separately as an independent variable for final estimation.
ind=dxy.values[:-1,0] de=dxy.values[:-1,1] new=dxy.values[-1,0] new
96.47
Standardization is required as a preprocessing of data for building regression models. The following are the reasons why the processing is necessary:
- Scale down data
- Scale the data that occurs between variables constantly when multiple variables are used
- Standardization of response variables is not required. However, it is executed for the same reason as number (2).
Estimates by regression models built with all data including response variables standardized should finally be reduced to the original scale. In this case, statistics of the mean and standard deviation calculated during the standardization phase are required. sklearn.processing.StandardScale()
class meets these requirements.
#standardize independnt variable indScaler=preprocessing.StandardScaler().fit(dxy.values[:,:-1]) #standardize respond variable deScaler=preprocessing.StandardScaler().fit(dxy.values[:,-1].reshape(-1,1)) indNor=indScaler.transform(ind.reshape(-1,1)) indNor[:3]
array([[2.81281703], [2.98469087], [2.8945619 ]])
deNor=indScaler.transform(de.reshape(-1,1)) deNor[:3]
array([[2.98469087], [2.85306066], [2.99349417]])
The relationship between the variable Open (indNor) and Close (deNor) is shown in Figure 3. In this plot, the regression line is a regression model calculated from the code below.
plt.figure(figsize=(6, 4)) plt.scatter(indNor, deNor, label="Data") plt.plot(indNor, 0.99767075*indNor-0.001, color="red", label="Regression line") plt.legend(loc="best") plt.xlabel("Open") plt.ylabel('Close') plt.show()
In Figure 3, the increase in Close seems clear as Open increases. Lines between the data are models generated by regression analysis and can be set as Equation 1:
$$\begin{align}\tag{1}&\text{y}=\beta_0+\beta_1 \text{x} +\varepsilon \\ & \text{x}: \text{independent variable}\\ & \text{y}: \text{dependent variable}\\ & \beta_0 : \text{intercept(bias)}\\ & \beta_1 : \text{regression coefficient(weight)}\\ & \varepsilon: \text{error} \end{align}$$Equation 1 is called the regression model. Regression models can be created by applying classes or functions provided by Python's various packages. The following are the results of applying statsmodel.api.OLS()
and linearRegression classes: Both classes must pass both independent and response variables to the numpy array object.
In the following code, add_constant()
is intended to add a deviation term to a description variable, which is as follows:
import statsmodels.api as sm indNor0=sm.add_constant(indNor) indNor0.shape, de.shape
((416, 2), (416,))
The following code is the result of applying the statsmodel.api.OLS()
class: The .fit()
method of this class creates a model by applying descriptive and reactive variables. Also, the regression coefficient of the generated model is expressed using the .params
property.
reg=sm.OLS(deNor,indNor0).fit() print(f'reg.coeff.:{np.around(reg.params,3)}, R^2 :{np.around(reg.rsquared,3)}')
reg.coeff.:[-0.002 0.985], R^2 :0.984
reg.summary()
OLS Regression Results | |||
Dep. Variable: | y | R-squared: | 0.984 |
---|---|---|---|
Model: | OLS | Adj. R-squared: | 0.983 |
Method: | Least Squares | F-statistic: | 2.469e+04 |
Date: | Thu, 23 Dec 2021 | Prob (F-statistic): | 0.00 |
Time: | 12:55:41 | Log-Likelihood: | 266.80 |
No. Observations: | 416 | AIC: | -529.6 |
Df Residuals: | 414 | BIC: | -521.5 |
Df Model: | 1 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [0.025 | 0.975] | |
const | -0.0023 | 0.006 | -0.362 | 0.718 | -0.015 | 0.010 |
x1 | 0.9851 | 0.006 | 157.127 | 0.000 | 0.973 | 0.997 |
Omnibus: | 2.426 | Durbin-Watson: | 2.036 |
Prob(Omnibus): | 0.297 | Jarque-Bera (JB): | 2.256 |
Skew: | 0.107 | Prob(JB): | 0.324 |
Kurtosis: | 2.709 | Cond. No. | 1.00 |
Notes: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
A summary of regression can be represented by the model.summay()
function. These results consist of tables for the regression model, the regression coefficient, and the distribution of explanatory variables. The meaning of the results in each table will be introduced in the following sections.
Regression models can also be built using linearRegression classes in sklearn packages. Create a model that fits description variables and response variables using the .fit()
method, such as the OLS class in the statsmodel package applied above. The coefficients and deviations of the generated model can be represented using the following properties:
- Model class name.coef_: return coefficient of model
- Model class name.intercept_: returns intercept of model
from sklearn.linear_model import LinearRegression mod = LinearRegression() mod.fit(indNor, deNor) print(f'Reg.Coeff.:{np.around(mod.coef_,3)}, bias: {np.around(mod.intercept_,3)}, R2:{np.around(mod.score(indNor, deNor),3)}')
Reg.Coeff.:[[0.985]], bias: [-0.002], R2:0.984
The regression model built with both classes is as follows:
$$\widehat{\text{Close}} = 0.997\text{Open} + -0.001$$Models built in those classes have a predictor(independent) method, and estimates are calculated by this method. The following estimates correspond to the variable new specified above:
For models by sm.OLS()
, a vector with all values of 1 was added to the independent variable to add the bias term. This process must be applied equally to all arguments passed to the .predict()
method. Therefore, you must also add the term 1 to the newNor, an independent variable for final estimation. The variable converted by that process is newNor1.
newNor=indScaler.transform([[new]]) newNor
array([[1.39171374]])
newNor1=np.c_[1, newNor] newNor1
array([[1. , 1.39171374]])
pre=reg.predict(newNor1) pre
array([1.36867904])
preVal=indScaler.inverse_transform(pre) preVal
array([96.41505142])
However, for sklearn.linearRegression(), the deviation term is automatically inserted by setting the factor fit_intercept=True
. Therefore, you do not need to modify these variable dimensions.
pre1=mod.predict(newNor) pre1
array([[1.36867904]])
preVal1=indScaler.inverse_transform(pre1) preVal1
array([[96.41505142]])
Regression Coefficient
The regression line is a statistically estimated equation, as shown in Figure 3. The predicted values from this model are also statistically estimated and differ from the actual observed values. Therefore, it is necessary to determine the fit of the model by the difference between the estimated and observed values.
The components of the model, bias $\beta_0$ and the regression coefficient, $\beta_1$, are parameters for the regression model of the population and are unknown values, and should be estimated from the sample's statistics. To distinguish this, represent the bias and regression coefficients of the sample group as $b_0$ and $b_1$ respectively and use them as an unbiased estimator to estimate parameters. Of these estimates, the difference between observations and actual measurements, or error, is calculated as shown in Equation 2. The error is called residual.
$$\begin{align}\tag{2}\text{e}&=y-(b_0+b_1x)\\&=y-\hat{y} \end{align}$$To minimize the error, it is more efficient to use the sum of squares of the error than to consider the sum of each error across the data. Because each error occurs in negative and positive values, their sum is close to zero, which cannot be used as a basis for judgment for the optimization model. This value is called Sum of Square Error (SSE) or Residual of Square Sum (RSS) and is calculated as shown in Equation 3.
$$\begin{align}\tag{3}SSE&=\sum^n_{i=1}\text{e}^2\\&=\sum^n_{i=1}(y-\hat{y})^2\\&=\sum^n_{i=1}(y-(b_0+b_1x))^2\end{align}$$For optimal regression models, $b_0$ and $b_1$ can be derived from Equation 5.26, to minimize RSS. The expression shows the form of a secondary function as shown in Figure 4. The minimum value for these functions is generated at the point where the differential is zero.
Thus, as shown in Equation 4, the regression coefficient and deviation are the results of $\frac{\partial(sse)}{\partial b_0}=0\, \text{and}\, \frac{\partial(sse)}{\partial b_1}=0$, respectively.
$$\begin{align}\tag{4} b_1&=\frac{\sum^n_{i=1}(x_i-\bar{x})(y_i-\bar{y})}{\sum^n_{i=1}(x_i-\bar{x})^2}\\&=\frac{S_{xy}}{S_{xx}}\\ b_0&=\frac{y-b_1x}{n} \end{align}$$Estimating regression coefficients to minimize errors is called the Least Square method or Ordinary Least Square(OLS).
b1=(np.sum((indNor-indNor.mean())*(deNor-deNor.mean())))/np.sum((indNor-indNor.mean())**2) print(f'reg.coeff.:{round(b1,3)}')
reg.coeff.:0.985
b0=np.sum(deNor-b1*indNor)/len(deNor) print(f'bias: {round(b0,3)}')
bias: -0.002
The above results are found in the table generated from reg.summary() in the previous section.
A simple regression model with one independent variable and one response makes the calculation easier, but if there are more than one independent variable, switching the variable to an array structure, especially a matrix, helps you understand the calculation or model. For a simple example, switching independent and dependent variables from a simple regression model to a matrix can be represented as shown in Equation 5.
$$\begin{equation}\tag{5}\begin{bmatrix} \hat{y}_1\\\hat{y}_2\\\vdots \\ \hat{y}_n \end{bmatrix}=\begin{bmatrix} 1&x_1\\1&x_2\\\vdots &\vdots \\ 1&x_n \end{bmatrix} \begin{bmatrix}b_0\\b_1 \end{bmatrix} \end{equation}$$The coefficient matrix, the second right matrix in Equation 5 can be calculated using the inverse matrix of the first matrix of the right term. The calculation process is shown in Equation 6.
$$\begin{align}\tag{6} &Y=Xa\\&\begin{aligned}X^{-1}Y&=X^{-1}Xa\\&=a\end{aligned} \end{align} $$In Equation 6 X must be a square matrix. A square matrix from a matrix is generated by a matrix product operation with the transposed matrix of that matrix. For example, if matrix a has a dimension of 3×4, the dimensions of the matrix product between the transposition are summarized as follows:
$$a \times a^T : (3 \times 4) \times (4 \times 3)=(3 \times 3) \; \text{or} \; a^T \times a : (4 \times 3) \times (3 \times 4)=(4 \times 4)$$Therefore, the calculation process of the coefficient matrix is generalized, as shown in Equation 7.
$$\begin{align} \tag{7} &Y=Xb \\&\rightarrow X^TY=X^TXb\\ & \rightarrow \begin{aligned}(X^TX)^{-1}X^TY&=(X^TX)^{-1}X^TXb\\&=b \end{aligned} \end{align}$$The matrix operation process of the regression coefficient is as follows:
xtx=np.dot(indNor0.T, indNor0) xty=np.dot(indNor0.T, deNor) b=np.linalg.solve(xtx, xty) np.around(b,3)
array([[-0.002], [ 0.985]])
댓글
댓글 쓰기