기본 콘텐츠로 건너뛰기

[ML] 결정트리(Decision Tree) 모델

Analysis of variance

Contents

Analysis of variance

Analysis of variance(ANOVA) is a statistical method that tests the null hypothesis that all groups have the same mean by comparing the variation within and between groups in two or more groups. The tests for two groups have applied a normal or t distribution, but to compare more groups, use an F distribution that compares the degree of variability between groups.

The data for ANOVA consist of the nominal variables (factors) that are being compared and the values for each factor, i.e., response variable. Each factor can be classified into several sub-groups and the factors in this group are called treatment (factor levels. The analysis of a single response corresponding to the factor level is called one-way ANOVAA} (anova), or multiple responses are called manova Multivariate analysis is beyond the scope of this book, but one-way and two-way variance analysis can be the foundation for that analysis.

The null hypothesis of ANOVA is as follows:

$$\text{H0}: \mu_1 = \mu_2 = \cdots = \mu_n$$

TAnalysis of variance assumes:

  • Each population follows a normal distribution.
  • All populations have the same variance.
  • Observations must be independent.

ANOVA

As shown in Table 1, one-way ANOVA has no distinction between factor and treatment. Therefore, treatment becomes a descriptive variable and analyzes the differences between the variables.

Table 1. Data structure for one-way ANOVA
Treatment1 2$\cdots$ t
sample$x_{11}$$x_{12}$$\cdots$ $x_{1t}$
$x_{21}$ $x_{22}$$\cdots$ $x_{2t}$
$\vdots$ $\vdots$$\ddots$ $\vdots$
$x_{n1}$ $x_{n2}$$\cdots$ $x_{nt}$
Mean$\bar{X_1}$ $\bar{X_2}$$\cdots$$\bar{X_t}$
Grand mean$\bar{X_{..}}$

From Table 1, each value can be represented by a generalized model, as shown in Equation 1.

$$\begin{align}\tag{1} &\text{Individual value} = \text{Group mean + error}\\ &x_{ij} = \mu_{i} + e_{ij} \\ & i: 1, 2, \cdots, n\\ & j: 1, 2, \cdots, t \end{align}$$

The model assumes that each variable is independent and conforms to a normal distribution, so the error term (e) also follows a normal distribution with zero mean and a constant variance. In other words, it can be expressed as follows:

$$\text{e} \sim N(0, \sigma^2)$$

This error term can be represented by decomposition as shown in Equation 2.

$$\begin{align}\tag{2}&\text{total error} = \text{intergroup error}-\text{within-group error}\\ &(x_{ij}- x) = (\overline{X_{j}}-\overline{X}) + (x_{ij} - \overline{X_j}) \end{align} $$

The sum of the squares on both sides of Equation 2 is as follows:

$$\begin{align}&\sum^t_{j=1}\sum^n_{i=1}(x_{ij}-\bar{X})^2\\ &=\sum^t_{j=1}\sum^n_{i=1}(\bar{x_j}-\bar{x})^2+\sum^t_{j=1}\sum^n_{i=1}(x_{ij}-\bar{X_j})^2+\sum \text{Cov} \\ &\sum \text{Cov} :\text{Sum of covariances between two groups}\end{align}$$

Because ANOVA assumes that each group is independent, the terms associated with covariance in the expression above are zero. Thus, a relationship of Equation 3 is established between sum of squares.

$$\begin{align}\sum^t_{j=1}\sum^n_{i=1}(x_{ij}-\bar{X})^2 &=\sum^t_{j=1}\sum^n_{i=1}(\bar{x_j}-\bar{x})^2+\sum^t_{j=1}\sum^n_{i=1}(x_{ij}-\bar{X_j})^2\\ \text{SST}&= \text{SSTr}+\text{SSE}\\ \text{SST}: &\quad\text{sum of squares total}\\ \text{SSTr}:&\quad \text{sum of squares treatment} \\ \text{SSE}: &\quad\text{sum of squares error} \end{align}$$

The sum of squares in 3 divided by the corresponding degrees of freedom is called mean of squares and defined as in Equation 4.

$$\begin{align}\tag{4}&\begin{aligned} \text{MST}&=\frac{\sum^t_{j=1}\sum^n_{i=1}(x_{ij}-\bar{X})^2}{n \cdot t-1}\\&=\frac{\text{SST}}{n \cdot t-1}\\ \text{MSTr}&=\frac{\sum^t_{j=1}\sum^n_{i=1}(\bar{x_j}-\bar{x})^2}{t-1}\\&=\frac{\text{SSTr}}{t-1}\\ \text{MSE}&=\frac{\sum^t_{j=1}\sum^n_{i=1}(x_{ij}-\bar{X_j})^2}{n-t}\\&=\frac{\text{SSE}}{n\cdot t-t} \end{aligned}\\ &\text{MST}: \text{mean of squares total}\\&\text{MSTr}: \text{mean of squares treatment}\\&\text{MSE}: \text{mean of squares error} \end{align}$$

Example 1)
  Let's calculate str, sse, mstr, and mse for arbitrary matrices in 10 × 4 dimensions.

import numpy as np
import pandas as pd
from scipy import stats
import matplotlib.pyplot as plt
import FinanceDataReader as fdr
np.random.seed(1)
da=np.random.randint(1, 11, size=(10, 4))
da
array([[ 6,  9, 10,  6],
       [ 1,  1,  2,  8],
       [ 7, 10,  3,  5],
       [ 6,  3,  5,  3],
       [ 5,  8,  8, 10],
       [ 2,  8,  1,  7],
       [10, 10,  8,  7],
       [10,  2,  1,  2],
       [ 9,  9,  4, 10],
       [ 9,  8,  4,  7]])

Calculate each statistic for the above data da.

n, t=da.shape
N=n*t
print(f'total size: {N}, # of variable: {t}, size per var: {n}')
total size: 40, # of variable: 4, size per var: 10
grandMean=da.mean(axis=None)
groupMean=np.mean(da, axis=0)
print(f'grand mean: {np.around(grandMean, 3)}, group mean: {groupMean}')
grand mean: 6.1, group mean: [6.5 6.8 4.6 6.5]
sst=np.sum((da.flatten()-grandMean)**2)
mst=sst/(N-1)
print(f'SST:{round(sst,3)}, MST:{round(mst,3)}')
SST:377.6, MST:9.682
sstr=np.sum((groupMean-grandMean)**2*np.repeat(len(da), 4))
mstr=sstr/(t-1)
print(f'SSTr:{round(sstr,3)}, MSTr:{round(mstr,3)}')
SSTr:30.6, MSTr:10.2
sse=np.sum((da-groupMean)**2)
mse=sse/(N-t)
print(f'SSE:{round(sse,3)}, MSE:{round(mse,3)}')
SSE:347.0, MSE:9.639

Determine whether or not the difference is significant by testing the difference between the intergroup and within-group errors among the overall errors. Use the F distribution for this determination. In other words, the data for ANOVA follow a normal distribution that is independent, so their squared sum or squared means follow a chi-square distribution. Also, the ratio of each variable that follows this distribution is F distribution. Thus, the ratio of the intergroup squared mean (MSTr) to the within-group squared mean (MSE) of the total error mean (MST), as shown in Equation 3, follows the F distribution.

$$\begin{align}\tag{3}F&=\frac{\text{MSTr}}{\text{MSE}}\\&={\Large {\frac{\frac{SSTr}{t-1}}{\frac{SSE}{n \cdot t-t}}}}\end{align}$$

The F value, the result of Equation 3 is the test statistic and is tested based on the confidence interval or the significant probability (p-value)

Example 2)
  Perform a ANOVA on the data da in Example 1.

Fratio=mstr/mse
print(f'F ratio: {round(Fratio, 3)}')
F ratio: 1.058
ci=stats.f.interval(0.95, t-1, N-t)
print(f'lower: {round(ci[0],3)}, upper: {round(ci[1],3)}')
lower: 0.071, upper: 3.505
pVal=stats.f.sf(Fratio, t-1, N-t)
print(f'p-value: {round(pVal, 3)}')
p-value: 0.379

One way-Anova analysis can be analyzed using the function scipy.stats.f_oneway()

re=stats.f_oneway(da[:,0],da[:,1],da[:,2],da[:,3])
print(f'statistics: {round(re[0], 3)}, p-value: {round(re[1], 3)}')
statistics: 1.058, p-value: 0.379

Based on the results above, the null hypothesis cannot be rejected because the F ratio exists within the confidence interval and the p value of the test statistic is also greater than the significance level (0.05). That is, the population means for each group are the same.

Example 3)
  Let's run a variance analysis of the daily changes in Dow Jones (dj), nasdaq (na), S\&P 500 (snp), and VIX (vix) over a period of time.

This data can be called by the following code, and the data for analysis (da) is generated by calculating the daily rate of change in each data's Open and Close.

st=pd.Timestamp(2020,5, 10)
et=pd.Timestamp(2021, 12, 16)
code=["DJI", "IXIC", "INX", "VIX"]
nme=['dj','na','snp','vix']
da=pd.DataFrame()
for i in code:
    x=fdr.DataReader(i,st, et)
    da=pd.concat([da, (x["Close"]-x["Open"])/x["Open"]*100], axis=1)
da.columns=nme
da.head(2)
dj na snp vix
2020-05-11 00:00:00 -0.142065 1.51741 23.076923 -3.127196
2020-05-12 00:00:00 -2.173727 -2.41404 37.500000 16.051985

If there are missing values in the above data, the data should be improved by replacing them with other values or deleting them. You can use the pd.isna() and np.where() functions to locate missing values. It also applies the .dropna$^{\ref{dropna}}$ method to delete rows or columns where missing values exist.

If the material contains zero, it can be treated as missing values in various statistical functions, which can cause problems in calculations. In addition, for this example, 0 can be replaced by a value immediately before without a rate of change. Alternatively, you can use the dropna() method to delete rows or columns containing missing values. In this case, dropna() does not recognize zero as a missing value, so replace this value with np.nan and apply replace() method. However, this can lead to data loss problems.

len(np.where(da.isna())[0])
58
da1=da.replace(0, method='ffill')
da1=da1.dropna()
np.where(da1==da1.isna())
(array([], dtype=int64), array([], dtype=int64))

One way ANOVA can apply stats.f_oneway() function.

re=stats.f_oneway(da1['dj'],da1['na'],da1['snp'],da1['vix'])
print(f'statistics: {round(re[0], 3)}, p-value:{round(re[1], 3)}')
statistics: 1.12, p-value:0.34

Based on the significance level of 0.05, the above results show that the p-value is larger than the significance level, so the null hypothesis that all groups have the same population cannot be rejected.

The test statistics above are calculated by Equation 3. The ratio of numerate to denominator in the equation follows F distribution. The confidence intervals can be calculated based on the F distribution in which the degrees of freedom of numerator and denominator are parameters.

df1=da.shape[1]-1
df2=da.shape[0]*da.shape[1]-da.shape[1]
df1, df2
(3, 1660)
#confidence interval
ci=stats.f.interval(0.95, df1, df2)
print(f'lower: {round(ci[0], 3)}, upper: {round(ci[1], 3)}')
lower: 0.072, upper: 3.124
#p-Values
pVal=stats.f.sf(re[0], df1, df2)
print(f'p-value: {round(pVal, 3)}')
p-value: 0.34

The null hypothesis cannot be rejected because the test statistic exists within the confidence interval. This conclusion is the same as by p-value. The above results do not provide information such as sum of squares and mean squares for analysis. Instead, the anova_lm() function in the statmodels.api module provides more information about ANOVA. statmodels is a typical statistical package of python containing various statistical functions.

The argument of anova_lm() becomes the class instance of the regression model to be introduced in the next section. You can create objects to use the methods and properties contained in the class, which are called instance. For example, you can use the ols class within the statsmodels.formula.api module in the statmodels package to create a regression model for the relationship between a independent variable(descriptive variable) and a response variable as follows:

$$\text{response variable}= \text{weight} \cdot \text{independent variable}+\text{error}$$

In the data for ANOVA, treatment is an independent variable, and the value corresponding to each treatment is a response.

The data in this example are rearranged into independent variables and response variables, shown in Table 2.

Table 2. Data displayed as independent and response variables
treatmentvalue
dj-0.142065
$\vdots$$\vdots$
na1.51741
$\vdots$$\vdots$
snp 23.076923
$\vdots$$\vdots$
vix -3.127196
$\vdots$$\vdots$

To apply the anova_lm() function, you must create a regression model using the following steps:

  1. import statsmodels.api as sm
  2. from statsmodels.formula.api import ols
  3. model =ols('response ∼ C(independent)', data) #class instance
  4. re=model.fit()
  5. sm.anova\_lm(re)

The C() function was applied in the above relationship for argument delivery of function ols(). C() means that the variable is nominal.

To apply the anova_lm() function, switch the data in this example to the format shown in Table 3. Take the following steps:

  1. Each variable dj, na, snp, and vix are represented by 1,2,3,4.: apply np.repeat()
  2. The values of each variable are converted to a one-dimensional vector.: np.ravel() application
  3. Switch the above generated objects to pd.DataFrame objects.
da2=pd.DataFrame({'cl':np.repeat([1,2,3,4],da1.shape[0]), 'd':np.ravel(da1.values, order="F")})
da2.head(3)
cld
01-0.142065
11-2.173727
21-1.916239
da.shape
(416, 4)

Apply anova_lm() function to the data. The results are the same as above.

import statsmodels.api as sm
from statsmodels.formula.api import ols
model=ols('d ~ C(cl)', data=da2).fit()
np.around(sm.stats.anova_lm(model), 3)
df sum_sq mean_sq F PR(>F)
C(cl) 3.0 125.533 41.8443 1.12 0.334
Residual 1400.0 52308.303 37.363 NaN NaN

Two-Way ANOVA

For one-way anova, it is a method of estimating the relationship between each effect of the nominal explanatory variable (factor) and the continuous response variable. If there are more than one factor, the variation to consider increases. In this multi-factor ANOVA, the analysis of data containing two factors is called two-way anova, and the data structure is shown in Table 3.

Table 3. Data Structure for Two-Way ANOVA
factor 1(α)factor 2(α)summean
level 1 level 2level k
1$y_{11}$$y_{12}$$y_{1k}$$T_{1.}$$\bar{y}_{1.}$
2$y_{21}$$y_{22}$$y_{2k}$$T_{2.}$$\bar{y}_{2.}$
$\vdots$$\vdots$$\vdots$$\vdots$$\vdots$$\vdots$$\vdots$
n$y_{n1}$$y_{n2}$$y_{nk}$$T_{n.}$$\bar{y}_{n.}$
sum$T_{.1}$$T_{.2}$$T_{.k}$$T$
mean$\bar{y}_{.1}$$\bar{y}_{.2}$$\bar{y}_{.k}$$\bar{y}$
level= treatment

Table 3 summarizes the following:

  • factor(α, β, explanatory): 2
  • response variable(y)
  • treatment: n in factor 1 (α) and k in factor 2 (β)

A two-way ANOVA is based on the following assumptions:

  • The population of the response variable corresponds to a normal distribution.
  • The population variances between factors must be the same.
  • The population of each factor is independent.

Each response variable in Table 3 can be estimated by Equation 4:

$$\begin{align}\tag{4} &y_{ijl}=\mu + \alpha_i+\beta_j+(\alpha \beta_{ij})+e_{ijl}\\ &i: 1, 2, \cdots, n\\ &j: 1, 2, \cdot, k\\ &l:\text{Number of response variables corresponding to i, j} \end{align}$$

The model (Equation 3) consists of the main effect, which is the effect of the n-levels and the k-levels of the β, and the interaction effect, which consists of the error terms with each observation. The model can be represented by the sum of each variation, as shown in Equation 5.

$$\begin{align}\tag{5} &\text{SST}=\text{SSA}+\text{SSB}+\text{SSAB}+\text{SSE}\\ &\begin{aligned} \sum(y_{ijk}-\bar{y}_{...})^2&= \sum(y_{i..}-\bar{y}_{...})^2\\&+ \sum(y_{.j.}-\bar{y}_{...})^2\\&+ \sum(\bar{y}_{ij.}-\bar{y}_{i..}-\bar{y}_{.j.}+\bar{y}_{...})^2\\&+ \sum(y_{ijk}-\bar{y}_{i..}-\bar{y}_{.j.}-\bar{y}_{ij.})^2 \end{aligned}\\ & \text{SST:Total variation}\\ & \text{SSA: variation of factor 1}\\ & \text{SSB: variation of factor 2}\\ & \text{SSAB: variation related in interaction}\\ & \text{SSE: error variation} \end{align}$$

Test the following hypothesis based on the model above:

  • H0: Effects on factors do not exist.
  • H1: Effects exist on factors.

The analysis above should not have interactions between factors. That is, the results combined between the specific factors must equal the sum of each value. If this assumption is violated, the interaction between the two factors should be considered, and it is generally considered reasonable if the ratio of the maximum to minimum mean of the data is not greater than 0.5. This means that each variable is independent.

The results of a two-way ANOVA can be summarized in Table 4:

Table 4. Results of a two-way ANOVA
Effectsample #dfSSMSS$=\frac{\text{SS}}{\text{df}}$F
factor A aa-1SSAMSA$\frac{\text{MSA}}{\text{MSE}} \sim \text{F}$
factor B bb-1SSBMSB$\frac{\text{MSB}}{\text{MSE}} \sim \text{F}$
interaction AB ab(a-1)(b-1)SSABMSAB$\frac{\text{MSAB}}{\text{MSE}} \sim \text{F}$
Error Variation nDifference in df SSEMSE
Total Variation nn-1SST

As shown in the two-way ANOVA model, the same process as a one-way ANOVA is developed except for the addition of the main effects. Two-way ANOVA can also use the anova.anova_lm() function.

Example 4)
  The following are the results of three training methods for each of the five groups. Tests whether there are effects between groups and training methods for this material.

groupType
A BC
1 7,13,8,9,8,711,8,8,7,14,10 13,8,11,10,8,12
2 11,14,8,8,14,7 12,9,9,10,8,13 12,141,2,11,13,10
3 8,8,7,9,9,10 14,9, 8,13,8,12 9,7,11,12,13,8
4 14,11,11,9,10,7 14,11,14,11,13,13 9,9,8,7,13,9
5 8,12,12,12,11,7 10,7,8,13,14,11, 7,11,7,9,12,9

To apply the anova\_lm() function, you must clarify the descriptive and response variables and pass them to arguments in the regression model in the following format:

$$\text{response} \sim \text{descriptive}$$

Therefore, in the table above, each factor group and type is a descriptive variable, and the values corresponding to the factors are responses, so you must switch the data structure as shown in the results below. The transition of this data structure is applied by various functions such as np.repeat(), np.tile(), and the process is as shown in Example 3.

data=np.array([ 7, 13,  8,  9,  8,  7, 11, 14,  8,  8, 14,  
               7,  8,  8,  7,  9,  9, 10, 14, 11, 11,  9, 10,  
               7,  8, 12, 12, 12, 11,  7, 11,  8,  8,  7, 
               14, 10, 12,  9,  9, 10,  8, 13, 14,  9,  8, 13, 
               8, 12, 14, 11, 14, 11, 13, 13, 10,  7,  8, 13, 14, 
               11, 13,  8, 11, 10,  8, 12, 12, 14, 12, 11, 13, 10,  
               9,  7, 11, 12, 13,  8,  9,  9,  8,  7, 13,  9,  7, 11,  
               7,  9, 12,  9])
d={'type':np.repeat(['a','b','c'], 30), 
   'group': np.tile(np.repeat([1,2,3,4,5],6),3), 
   'val':data}
df=pd.DataFrame.from_dict(d)
df.head(5)
type group val
0 a 1 7
1 a 1 13
2 a 1 8
3 a 1 9
4 a 1 8

The regression model passed to the anova_lm() function is created by the ols() class. The interaction between each factor in the model is represented by *.<\p> $$\begin{align}&\text{val} \sim \text{C(group)+C(type)+C(group)*C(type)}\\ & \qquad || \\ &\text{val} \sim \text{C(group)*C(type)} \end{align}$$

import statsmodels.api as sm
from statsmodels.formula.api import ols
mod=ols('val~C(group)*C(type)', data=df).fit()
np.around(sm.stats.anova_lm(mod),3)
df sum_sq mean_sq F PR(>F)
C(group) 4.0 24.333 6.083 1.239 0.302
C(type) 2.0 18.200 9.100 1.854 0.164
C(group):C(type) 8.0 61.800 7.725 1.574 0.147
Residual 75.0 368.167 4.909 NaN NaN

All of the following null hypotheses cannot be rejected because the above results show that the significance probability for the main effects for both factor groups and types is greater than the significance level of 0.05, and the interaction is also greater than 0.05.

$$\begin{align}&\text{H0}_1:\text{The effect of the group does not exist.} \\ &\text{H0}_2:\text{The effect of the type does not exist.}\\ &\text{H0}_3:\text{The effect of type and group interaction does not exist.} \end{align}$$

Example 5)
  Determines the effect of the increase in daily changes in nasdaq(na) and VIX(vix) on the daily changes in Google(go).

Generates nominal variables indicating the daily variation of na and vix as increasing and decreasing. This process uses the pd.cut()$^{\ref{cut}}$ function. This nominal variable is used in the model and each group (class) must be numerically represented. For this example, -1 and 1 are assigned to show decrease and increase, respectively.

st=pd.Timestamp(2020,5, 10)
et=pd.Timestamp(2021, 12, 20)
naO=fdr.DataReader("IXIC",st, et)
na1=(naO["Close"]-naO["Open"])/naO["Open"]*100
vixO=fdr.DataReader("VIX",st, et)
vix1=(vixO["Close"]-vixO["Open"])/vixO["Open"]*100
goO=fdr.DataReader("GOOGL",st, et)
go=(goO["Close"]-goO["Open"])/goO["Open"]*100
go.tail(2)
Date
2021-12-17   -0.543860
2021-12-20    1.147857
dtype: float64
na=pd.cut(na1, bins=[-30,0,30], labels=[-1, 1])
vix=pd.cut(vix1, bins=[-30,0,30], labels=[-1, 1])
vix[-2:]
Date
2021-12-17    1
2021-12-20   -1
dtype: category
Categories (2, int64): [-1 < 1]
na[-2:]
Date
2021-12-17    1
2021-12-20    1
dtype: category
Categories (2, int64): [-1 < 1] 

Apply the pd.DataFrame() function to merge each data.

da=pd.DataFrame([na, vix, go]).T
da.columns=['na','vix','go']
da.tail(3)
na vix go
Date
2021-12-16 -1.0 1.0 -1.851934
2021-12-17 1.0 1.0 -0.543860
2021-12-20 1.0 -1.0 1.147857

This process may include missing values in the material. The inclusion of missing values can be determined by DataFrame.isna. In this example, the DataFrame.replace() function is applied to replace the immediate value with a missing value.

da1=da.replace(np.nan, method="ffill")
da1=da.replace(0, method="ffill")
np.where(da1.isna())
(array([ 22,  38,  38, 141, 141, 162, 162, 183, 203, 229, 229, 395, 395]),
 array([1, 0, 2, 0, 2, 0, 2, 1, 1, 0, 2, 0, 2]))

Create a regression model using the ols() class for data da1 and run an analysis of variance of that model using anova_lm() function.

import statsmodels.api as sm
from statsmodels.formula.api import ols
model=ols('go~C(na)*C(vix)', data=da1).fit()
np.around(sm.stats.anova_lm(model), 3)
df sum_sq mean_sq F PR(>F)
C(na) 1.0 202.242 202.242 187.090 0.000
C(vix) 1.0 9.746 9.746 9.015 0.003
C(na):C(vix) 1.0 2.957 2.957 2.736 0.099
Residual 401.0 433.476 1.081 NaN NaN

Based on the above results, the null hypothesis of the main effects of na, vix and the effect of the interaction of the two variables based on the significance levels of 0.05 may be rejected. In other words, they affect the daily rate of change in go.

댓글

이 블로그의 인기 게시물

[Linear Algebra] 유사변환(Similarity transformation)

유사변환(Similarity transformation) n×n 차원의 정방 행렬 A, B 그리고 가역 행렬 P 사이에 식 1의 관계가 성립하면 행렬 A와 B는 유사행렬(similarity matrix)이 되며 행렬 A를 가역행렬 P와 B로 분해하는 것을 유사 변환(similarity transformation) 이라고 합니다. $$\tag{1} A = PBP^{-1} \Leftrightarrow P^{-1}AP = B $$ 식 2는 식 1의 양변에 B의 고유값을 고려한 것입니다. \begin{align}\tag{식 2} B - \lambda I &= P^{-1}AP – \lambda P^{-1}P\\ &= P^{-1}(AP – \lambda P)\\ &= P^{-1}(A - \lambda I)P \end{align} 식 2의 행렬식은 식 3과 같이 정리됩니다. \begin{align} &\begin{aligned}\textsf{det}(B - \lambda I ) & = \textsf{det}(P^{-1}(AP – \lambda P))\\ &= \textsf{det}(P^{-1}) \textsf{det}((A – \lambda I)) \textsf{det}(P)\\ &= \textsf{det}(P^{-1}) \textsf{det}(P) \textsf{det}((A – \lambda I))\\ &= \textsf{det}(A – \lambda I)\end{aligned}\\ &\begin{aligned}\because \; \textsf{det}(P^{-1}) \textsf{det}(P) &= \textsf{det}(P^{-1}P)\\ &= \textsf{det}(I)\end{aligned}\end{align} 유사행렬의 특성 유사행렬인 두 정방행렬 A와 B는 'A ~ B' 와 같

[matplotlib] 히스토그램(Histogram)

히스토그램(Histogram) 히스토그램은 확률분포의 그래픽적인 표현이며 막대그래프의 종류입니다. 이 그래프가 확률분포와 관계가 있으므로 통계적 요소를 나타내기 위해 많이 사용됩니다. plt.hist(X, bins=10)함수를 사용합니다. x=np.random.randn(1000) plt.hist(x, 10) plt.show() 위 그래프의 y축은 각 구간에 해당하는 갯수이다. 빈도수 대신 확률밀도를 나타내기 위해서는 위 함수의 매개변수 normed=True로 조정하여 나타낼 수 있다. 또한 매개변수 bins의 인수를 숫자로 전달할 수 있지만 리스트 객체로 지정할 수 있다. 막대그래프의 경우와 마찬가지로 각 막대의 폭은 매개변수 width에 의해 조정된다. y=np.linspace(min(x)-1, max(x)+1, 10) y array([-4.48810153, -3.54351935, -2.59893717, -1.65435499, -0.70977282, 0.23480936, 1.17939154, 2.12397372, 3.0685559 , 4.01313807]) plt.hist(x, y, normed=True) plt.show()

R 미분과 적분

내용 expression 미분 2차 미분 mosaic를 사용한 미분 적분 미분과 적분 R에서의 미분과 적분 함수는 expression()함수에 의해 생성된 표현식을 대상으로 합니다. expression expression(문자, 또는 식) 이 표현식의 평가는 eval() 함수에 의해 실행됩니다. > ex1<-expression(1+0:9) > ex1 expression(1 + 0:9) > eval(ex1) [1] 1 2 3 4 5 6 7 8 9 10 > ex2<-expression(u, 2, u+0:9) > ex2 expression(u, 2, u + 0:9) > ex2[1] expression(u) > ex2[2] expression(2) > ex2[3] expression(u + 0:9) > u<-0.9 > eval(ex2[3]) [1] 0.9 1.9 2.9 3.9 4.9 5.9 6.9 7.9 8.9 9.9 미분 D(표현식, 미분 변수) 함수로 미분을 실행합니다. 이 함수의 표현식은 expression() 함수로 생성된 객체이며 미분 변수는 다음 식의 분모의 변수를 의미합니다. $$\frac{d}{d \text{변수}}\text{표현식}$$ 이 함수는 어떤 함수의 미분의 결과를 표현식으로 반환합니다. > D(expression(2*x^3), "x") 2 * (3 * x^2) > eq<-expression(log(x)) > eq expression(log(x)) > D(eq, "x") 1/x > eq2<-expression(a/(1+b*exp(-d*x))); eq2 expression(a/(1 + b * exp(-d * x))) > D(eq2, "x") a * (b * (exp(-d * x) * d))/(1 + b