Contents
Analysis of variance
Analysis of variance(ANOVA) is a statistical method that tests the null hypothesis that all groups have the same mean by comparing the variation within and between groups in two or more groups. The tests for two groups have applied a normal or t distribution, but to compare more groups, use an F distribution that compares the degree of variability between groups.
The data for ANOVA consist of the nominal variables (factors) that are being compared and the values for each factor, i.e., response variable. Each factor can be classified into several sub-groups and the factors in this group are called treatment (factor levels. The analysis of a single response corresponding to the factor level is called one-way ANOVAA} (anova), or multiple responses are called manova Multivariate analysis is beyond the scope of this book, but one-way and two-way variance analysis can be the foundation for that analysis.
The null hypothesis of ANOVA is as follows:
$$\text{H0}: \mu_1 = \mu_2 = \cdots = \mu_n$$TAnalysis of variance assumes:
- Each population follows a normal distribution.
- All populations have the same variance.
- Observations must be independent.
ANOVA
As shown in Table 1, one-way ANOVA has no distinction between factor and treatment. Therefore, treatment becomes a descriptive variable and analyzes the differences between the variables.
Treatment | 1 | 2 | $\cdots$ | t |
---|---|---|---|---|
sample | $x_{11}$ | $x_{12}$ | $\cdots$ | $x_{1t}$ |
$x_{21}$ | $x_{22}$ | $\cdots$ | $x_{2t}$ | |
$\vdots$ | $\vdots$ | $\ddots$ | $\vdots$ | |
$x_{n1}$ | $x_{n2}$ | $\cdots$ | $x_{nt}$ | |
Mean | $\bar{X_1}$ | $\bar{X_2}$ | $\cdots$ | $\bar{X_t}$ |
Grand mean | $\bar{X_{..}}$ |
From Table 1, each value can be represented by a generalized model, as shown in Equation 1.
$$\begin{align}\tag{1} &\text{Individual value} = \text{Group mean + error}\\ &x_{ij} = \mu_{i} + e_{ij} \\ & i: 1, 2, \cdots, n\\ & j: 1, 2, \cdots, t \end{align}$$The model assumes that each variable is independent and conforms to a normal distribution, so the error term (e) also follows a normal distribution with zero mean and a constant variance. In other words, it can be expressed as follows:
$$\text{e} \sim N(0, \sigma^2)$$This error term can be represented by decomposition as shown in Equation 2.
$$\begin{align}\tag{2}&\text{total error} = \text{intergroup error}-\text{within-group error}\\ &(x_{ij}- x) = (\overline{X_{j}}-\overline{X}) + (x_{ij} - \overline{X_j}) \end{align} $$The sum of the squares on both sides of Equation 2 is as follows:
$$\begin{align}&\sum^t_{j=1}\sum^n_{i=1}(x_{ij}-\bar{X})^2\\ &=\sum^t_{j=1}\sum^n_{i=1}(\bar{x_j}-\bar{x})^2+\sum^t_{j=1}\sum^n_{i=1}(x_{ij}-\bar{X_j})^2+\sum \text{Cov} \\ &\sum \text{Cov} :\text{Sum of covariances between two groups}\end{align}$$Because ANOVA assumes that each group is independent, the terms associated with covariance in the expression above are zero. Thus, a relationship of Equation 3 is established between sum of squares.
$$\begin{align}\sum^t_{j=1}\sum^n_{i=1}(x_{ij}-\bar{X})^2 &=\sum^t_{j=1}\sum^n_{i=1}(\bar{x_j}-\bar{x})^2+\sum^t_{j=1}\sum^n_{i=1}(x_{ij}-\bar{X_j})^2\\ \text{SST}&= \text{SSTr}+\text{SSE}\\ \text{SST}: &\quad\text{sum of squares total}\\ \text{SSTr}:&\quad \text{sum of squares treatment} \\ \text{SSE}: &\quad\text{sum of squares error} \end{align}$$The sum of squares in 3 divided by the corresponding degrees of freedom is called mean of squares and defined as in Equation 4.
$$\begin{align}\tag{4}&\begin{aligned} \text{MST}&=\frac{\sum^t_{j=1}\sum^n_{i=1}(x_{ij}-\bar{X})^2}{n \cdot t-1}\\&=\frac{\text{SST}}{n \cdot t-1}\\ \text{MSTr}&=\frac{\sum^t_{j=1}\sum^n_{i=1}(\bar{x_j}-\bar{x})^2}{t-1}\\&=\frac{\text{SSTr}}{t-1}\\ \text{MSE}&=\frac{\sum^t_{j=1}\sum^n_{i=1}(x_{ij}-\bar{X_j})^2}{n-t}\\&=\frac{\text{SSE}}{n\cdot t-t} \end{aligned}\\ &\text{MST}: \text{mean of squares total}\\&\text{MSTr}: \text{mean of squares treatment}\\&\text{MSE}: \text{mean of squares error} \end{align}$$Example 1)
Let's calculate str, sse, mstr, and mse for arbitrary matrices in 10 × 4 dimensions.
import numpy as np import pandas as pd from scipy import stats import matplotlib.pyplot as plt import FinanceDataReader as fdr
np.random.seed(1) da=np.random.randint(1, 11, size=(10, 4)) da
array([[ 6, 9, 10, 6], [ 1, 1, 2, 8], [ 7, 10, 3, 5], [ 6, 3, 5, 3], [ 5, 8, 8, 10], [ 2, 8, 1, 7], [10, 10, 8, 7], [10, 2, 1, 2], [ 9, 9, 4, 10], [ 9, 8, 4, 7]])
Calculate each statistic for the above data da.
n, t=da.shape N=n*t print(f'total size: {N}, # of variable: {t}, size per var: {n}')
total size: 40, # of variable: 4, size per var: 10
grandMean=da.mean(axis=None) groupMean=np.mean(da, axis=0) print(f'grand mean: {np.around(grandMean, 3)}, group mean: {groupMean}')
grand mean: 6.1, group mean: [6.5 6.8 4.6 6.5]
sst=np.sum((da.flatten()-grandMean)**2) mst=sst/(N-1) print(f'SST:{round(sst,3)}, MST:{round(mst,3)}')
SST:377.6, MST:9.682
sstr=np.sum((groupMean-grandMean)**2*np.repeat(len(da), 4)) mstr=sstr/(t-1) print(f'SSTr:{round(sstr,3)}, MSTr:{round(mstr,3)}')
SSTr:30.6, MSTr:10.2
sse=np.sum((da-groupMean)**2) mse=sse/(N-t) print(f'SSE:{round(sse,3)}, MSE:{round(mse,3)}')
SSE:347.0, MSE:9.639
Determine whether or not the difference is significant by testing the difference between the intergroup and within-group errors among the overall errors. Use the F distribution for this determination. In other words, the data for ANOVA follow a normal distribution that is independent, so their squared sum or squared means follow a chi-square distribution. Also, the ratio of each variable that follows this distribution is F distribution. Thus, the ratio of the intergroup squared mean (MSTr) to the within-group squared mean (MSE) of the total error mean (MST), as shown in Equation 3, follows the F distribution.
$$\begin{align}\tag{3}F&=\frac{\text{MSTr}}{\text{MSE}}\\&={\Large {\frac{\frac{SSTr}{t-1}}{\frac{SSE}{n \cdot t-t}}}}\end{align}$$The F value, the result of Equation 3 is the test statistic and is tested based on the confidence interval or the significant probability (p-value)
Example 2)
Perform a ANOVA on the data da in Example 1.
Fratio=mstr/mse print(f'F ratio: {round(Fratio, 3)}')
F ratio: 1.058
ci=stats.f.interval(0.95, t-1, N-t) print(f'lower: {round(ci[0],3)}, upper: {round(ci[1],3)}')
lower: 0.071, upper: 3.505
pVal=stats.f.sf(Fratio, t-1, N-t) print(f'p-value: {round(pVal, 3)}')
p-value: 0.379
One way-Anova analysis can be analyzed using the function scipy.stats.f_oneway()
re=stats.f_oneway(da[:,0],da[:,1],da[:,2],da[:,3]) print(f'statistics: {round(re[0], 3)}, p-value: {round(re[1], 3)}')
statistics: 1.058, p-value: 0.379
Based on the results above, the null hypothesis cannot be rejected because the F ratio exists within the confidence interval and the p value of the test statistic is also greater than the significance level (0.05). That is, the population means for each group are the same.
Example 3)
Let's run a variance analysis of the daily changes in Dow Jones (dj), nasdaq (na), S\&P 500 (snp), and VIX (vix) over a period of time.
This data can be called by the following code, and the data for analysis (da) is generated by calculating the daily rate of change in each data's Open and Close.
st=pd.Timestamp(2020,5, 10) et=pd.Timestamp(2021, 12, 16) code=["DJI", "IXIC", "INX", "VIX"] nme=['dj','na','snp','vix'] da=pd.DataFrame() for i in code: x=fdr.DataReader(i,st, et) da=pd.concat([da, (x["Close"]-x["Open"])/x["Open"]*100], axis=1) da.columns=nme da.head(2)
dj | na | snp | vix | |
---|---|---|---|---|
2020-05-11 00:00:00 | -0.142065 | 1.51741 | 23.076923 | -3.127196 |
2020-05-12 00:00:00 | -2.173727 | -2.41404 | 37.500000 | 16.051985 |
If there are missing values in the above data, the data should be improved by replacing them with other values or deleting them. You can use the pd.isna()
and np.where()
functions to locate missing values. It also applies the .dropna$^{\ref{dropna}}$ method to delete rows or columns where missing values exist.
If the material contains zero, it can be treated as missing values in various statistical functions, which can cause problems in calculations. In addition, for this example, 0 can be replaced by a value immediately before without a rate of change. Alternatively, you can use the dropna()
method to delete rows or columns containing missing values. In this case, dropna() does not recognize zero as a missing value, so replace this value with np.nan
and apply replace()
method. However, this can lead to data loss problems.
len(np.where(da.isna())[0])
58
da1=da.replace(0, method='ffill') da1=da1.dropna() np.where(da1==da1.isna())
(array([], dtype=int64), array([], dtype=int64))
One way ANOVA can apply stats.f_oneway()
function.
re=stats.f_oneway(da1['dj'],da1['na'],da1['snp'],da1['vix']) print(f'statistics: {round(re[0], 3)}, p-value:{round(re[1], 3)}')
statistics: 1.12, p-value:0.34
Based on the significance level of 0.05, the above results show that the p-value is larger than the significance level, so the null hypothesis that all groups have the same population cannot be rejected.
The test statistics above are calculated by Equation 3. The ratio of numerate to denominator in the equation follows F distribution. The confidence intervals can be calculated based on the F distribution in which the degrees of freedom of numerator and denominator are parameters.
df1=da.shape[1]-1 df2=da.shape[0]*da.shape[1]-da.shape[1] df1, df2
(3, 1660)
#confidence interval ci=stats.f.interval(0.95, df1, df2) print(f'lower: {round(ci[0], 3)}, upper: {round(ci[1], 3)}')
lower: 0.072, upper: 3.124
#p-Values pVal=stats.f.sf(re[0], df1, df2) print(f'p-value: {round(pVal, 3)}')
p-value: 0.34
The null hypothesis cannot be rejected because the test statistic exists within the confidence interval. This conclusion is the same as by p-value. The above results do not provide information such as sum of squares and mean squares for analysis. Instead, the anova_lm()
function in the statmodels.api module provides more information about ANOVA. statmodels is a typical statistical package of python containing various statistical functions.
The argument of anova_lm() becomes the class instance of the regression model to be introduced in the next section. You can create objects to use the methods and properties contained in the class, which are called instance. For example, you can use the ols
class within the statsmodels.formula.api module in the statmodels package to create a regression model for the relationship between a independent variable(descriptive variable) and a response variable as follows:
In the data for ANOVA, treatment is an independent variable, and the value corresponding to each treatment is a response.
The data in this example are rearranged into independent variables and response variables, shown in Table 2.
treatment | value |
---|---|
dj | -0.142065 |
$\vdots$ | $\vdots$ |
na | 1.51741 |
$\vdots$ | $\vdots$ |
snp | 23.076923 |
$\vdots$ | $\vdots$ |
vix | -3.127196 |
$\vdots$ | $\vdots$ |
To apply the anova_lm()
function, you must create a regression model using the following steps:
- import statsmodels.api as sm
- from statsmodels.formula.api import ols
- model =ols('response ∼ C(independent)', data) #class instance
- re=model.fit()
- sm.anova\_lm(re)
The C()
function was applied in the above relationship for argument delivery of function ols()
. C() means that the variable is nominal.
To apply the anova_lm()
function, switch the data in this example to the format shown in Table 3. Take the following steps:
- Each variable dj, na, snp, and vix are represented by 1,2,3,4.: apply
np.repeat()
- The values of each variable are converted to a one-dimensional vector.:
np.ravel()
application - Switch the above generated objects to pd.DataFrame objects.
da2=pd.DataFrame({'cl':np.repeat([1,2,3,4],da1.shape[0]), 'd':np.ravel(da1.values, order="F")}) da2.head(3)
cl | d | |
---|---|---|
0 | 1 | -0.142065 |
1 | 1 | -2.173727 |
2 | 1 | -1.916239 |
da.shape
(416, 4)
Apply anova_lm() function to the data. The results are the same as above.
import statsmodels.api as sm from statsmodels.formula.api import ols model=ols('d ~ C(cl)', data=da2).fit() np.around(sm.stats.anova_lm(model), 3)
df | sum_sq | mean_sq | F | PR(>F) | |
---|---|---|---|---|---|
C(cl) | 3.0 | 125.533 | 41.8443 | 1.12 | 0.334 |
Residual | 1400.0 | 52308.303 | 37.363 | NaN | NaN |
Two-Way ANOVA
For one-way anova, it is a method of estimating the relationship between each effect of the nominal explanatory variable (factor) and the continuous response variable. If there are more than one factor, the variation to consider increases. In this multi-factor ANOVA, the analysis of data containing two factors is called two-way anova, and the data structure is shown in Table 3.
factor 1(α) | factor 2(α) | sum | mean | |||
---|---|---|---|---|---|---|
level 1 | level 2 | … | level k | |||
1 | $y_{11}$ | $y_{12}$ | … | $y_{1k}$ | $T_{1.}$ | $\bar{y}_{1.}$ |
2 | $y_{21}$ | $y_{22}$ | … | $y_{2k}$ | $T_{2.}$ | $\bar{y}_{2.}$ |
$\vdots$ | $\vdots$ | $\vdots$ | $\vdots$ | $\vdots$ | $\vdots$ | $\vdots$ |
n | $y_{n1}$ | $y_{n2}$ | … | $y_{nk}$ | $T_{n.}$ | $\bar{y}_{n.}$ |
sum | $T_{.1}$ | $T_{.2}$ | … | $T_{.k}$ | $T$ | |
mean | $\bar{y}_{.1}$ | $\bar{y}_{.2}$ | … | $\bar{y}_{.k}$ | $\bar{y}$ | |
level= treatment |
Table 3 summarizes the following:
- factor(α, β, explanatory): 2
- response variable(y)
- treatment: n in factor 1 (α) and k in factor 2 (β)
A two-way ANOVA is based on the following assumptions:
- The population of the response variable corresponds to a normal distribution.
- The population variances between factors must be the same.
- The population of each factor is independent.
Each response variable in Table 3 can be estimated by Equation 4:
$$\begin{align}\tag{4} &y_{ijl}=\mu + \alpha_i+\beta_j+(\alpha \beta_{ij})+e_{ijl}\\ &i: 1, 2, \cdots, n\\ &j: 1, 2, \cdot, k\\ &l:\text{Number of response variables corresponding to i, j} \end{align}$$The model (Equation 3) consists of the main effect, which is the effect of the n-levels and the k-levels of the β, and the interaction effect, which consists of the error terms with each observation. The model can be represented by the sum of each variation, as shown in Equation 5.
$$\begin{align}\tag{5} &\text{SST}=\text{SSA}+\text{SSB}+\text{SSAB}+\text{SSE}\\ &\begin{aligned} \sum(y_{ijk}-\bar{y}_{...})^2&= \sum(y_{i..}-\bar{y}_{...})^2\\&+ \sum(y_{.j.}-\bar{y}_{...})^2\\&+ \sum(\bar{y}_{ij.}-\bar{y}_{i..}-\bar{y}_{.j.}+\bar{y}_{...})^2\\&+ \sum(y_{ijk}-\bar{y}_{i..}-\bar{y}_{.j.}-\bar{y}_{ij.})^2 \end{aligned}\\ & \text{SST:Total variation}\\ & \text{SSA: variation of factor 1}\\ & \text{SSB: variation of factor 2}\\ & \text{SSAB: variation related in interaction}\\ & \text{SSE: error variation} \end{align}$$Test the following hypothesis based on the model above:
- H0: Effects on factors do not exist.
- H1: Effects exist on factors.
The analysis above should not have interactions between factors. That is, the results combined between the specific factors must equal the sum of each value. If this assumption is violated, the interaction between the two factors should be considered, and it is generally considered reasonable if the ratio of the maximum to minimum mean of the data is not greater than 0.5. This means that each variable is independent.
The results of a two-way ANOVA can be summarized in Table 4:
Effect | sample # | df | SS | MSS$=\frac{\text{SS}}{\text{df}}$ | F |
---|---|---|---|---|---|
factor A | a | a-1 | SSA | MSA | $\frac{\text{MSA}}{\text{MSE}} \sim \text{F}$ |
factor B | b | b-1 | SSB | MSB | $\frac{\text{MSB}}{\text{MSE}} \sim \text{F}$ |
interaction AB | ab | (a-1)(b-1) | SSAB | MSAB | $\frac{\text{MSAB}}{\text{MSE}} \sim \text{F}$ |
Error Variation | n | Difference in df | SSE | MSE | |
Total Variation | n | n-1 | SST |
As shown in the two-way ANOVA model, the same process as a one-way ANOVA is developed except for the addition of the main effects. Two-way ANOVA can also use the anova.anova_lm()
function.
Example 4)
The following are the results of three training methods for each of the five groups. Tests whether there are effects between groups and training methods for this material.
group | Type | ||
---|---|---|---|
A | B | C | |
1 | 7,13,8,9,8,7 | 11,8,8,7,14,10 | 13,8,11,10,8,12 |
2 | 11,14,8,8,14,7 | 12,9,9,10,8,13 | 12,141,2,11,13,10 |
3 | 8,8,7,9,9,10 | 14,9, 8,13,8,12 | 9,7,11,12,13,8 |
4 | 14,11,11,9,10,7 | 14,11,14,11,13,13 | 9,9,8,7,13,9 |
5 | 8,12,12,12,11,7 | 10,7,8,13,14,11, | 7,11,7,9,12,9 |
To apply the anova\_lm() function, you must clarify the descriptive and response variables and pass them to arguments in the regression model in the following format:
$$\text{response} \sim \text{descriptive}$$Therefore, in the table above, each factor group and type is a descriptive variable, and the values corresponding to the factors are responses, so you must switch the data structure as shown in the results below. The transition of this data structure is applied by various functions such as np.repeat()
, np.tile()
, and the process is as shown in Example 3.
data=np.array([ 7, 13, 8, 9, 8, 7, 11, 14, 8, 8, 14, 7, 8, 8, 7, 9, 9, 10, 14, 11, 11, 9, 10, 7, 8, 12, 12, 12, 11, 7, 11, 8, 8, 7, 14, 10, 12, 9, 9, 10, 8, 13, 14, 9, 8, 13, 8, 12, 14, 11, 14, 11, 13, 13, 10, 7, 8, 13, 14, 11, 13, 8, 11, 10, 8, 12, 12, 14, 12, 11, 13, 10, 9, 7, 11, 12, 13, 8, 9, 9, 8, 7, 13, 9, 7, 11, 7, 9, 12, 9]) d={'type':np.repeat(['a','b','c'], 30), 'group': np.tile(np.repeat([1,2,3,4,5],6),3), 'val':data} df=pd.DataFrame.from_dict(d) df.head(5)
type | group | val | |
---|---|---|---|
0 | a | 1 | 7 |
1 | a | 1 | 13 |
2 | a | 1 | 8 |
3 | a | 1 | 9 |
4 | a | 1 | 8 |
The regression model passed to the anova_lm()
function is created by the ols() class. The interaction between each factor in the model is represented by *.<\p>
$$\begin{align}&\text{val} \sim \text{C(group)+C(type)+C(group)*C(type)}\\
& \qquad || \\
&\text{val} \sim \text{C(group)*C(type)} \end{align}$$
import statsmodels.api as sm from statsmodels.formula.api import ols mod=ols('val~C(group)*C(type)', data=df).fit() np.around(sm.stats.anova_lm(mod),3)
df | sum_sq | mean_sq | F | PR(>F) | |
---|---|---|---|---|---|
C(group) | 4.0 | 24.333 | 6.083 | 1.239 | 0.302 |
C(type) | 2.0 | 18.200 | 9.100 | 1.854 | 0.164 |
C(group):C(type) | 8.0 | 61.800 | 7.725 | 1.574 | 0.147 |
Residual | 75.0 | 368.167 | 4.909 | NaN | NaN |
All of the following null hypotheses cannot be rejected because the above results show that the significance probability for the main effects for both factor groups and types is greater than the significance level of 0.05, and the interaction is also greater than 0.05.
$$\begin{align}&\text{H0}_1:\text{The effect of the group does not exist.} \\ &\text{H0}_2:\text{The effect of the type does not exist.}\\ &\text{H0}_3:\text{The effect of type and group interaction does not exist.} \end{align}$$Example 5)
Determines the effect of the increase in daily changes in nasdaq(na) and VIX(vix) on the daily changes in Google(go).
Generates nominal variables indicating the daily variation of na and vix as increasing and decreasing. This process uses the pd.cut()$^{\ref{cut}}$ function. This nominal variable is used in the model and each group (class) must be numerically represented. For this example, -1 and 1 are assigned to show decrease and increase, respectively.
st=pd.Timestamp(2020,5, 10) et=pd.Timestamp(2021, 12, 20) naO=fdr.DataReader("IXIC",st, et) na1=(naO["Close"]-naO["Open"])/naO["Open"]*100 vixO=fdr.DataReader("VIX",st, et) vix1=(vixO["Close"]-vixO["Open"])/vixO["Open"]*100 goO=fdr.DataReader("GOOGL",st, et) go=(goO["Close"]-goO["Open"])/goO["Open"]*100 go.tail(2)
Date 2021-12-17 -0.543860 2021-12-20 1.147857 dtype: float64
na=pd.cut(na1, bins=[-30,0,30], labels=[-1, 1]) vix=pd.cut(vix1, bins=[-30,0,30], labels=[-1, 1]) vix[-2:]
Date 2021-12-17 1 2021-12-20 -1 dtype: category Categories (2, int64): [-1 < 1]
na[-2:]
Date 2021-12-17 1 2021-12-20 1 dtype: category Categories (2, int64): [-1 < 1]
Apply the pd.DataFrame()
function to merge each data.
da=pd.DataFrame([na, vix, go]).T da.columns=['na','vix','go'] da.tail(3)
na | vix | go | |
---|---|---|---|
Date | |||
2021-12-16 | -1.0 | 1.0 | -1.851934 |
2021-12-17 | 1.0 | 1.0 | -0.543860 |
2021-12-20 | 1.0 | -1.0 | 1.147857 |
This process may include missing values in the material. The inclusion of missing values can be determined by DataFrame.isna
. In this example, the DataFrame.replace()
function is applied to replace the immediate value with a missing value.
da1=da.replace(np.nan, method="ffill") da1=da.replace(0, method="ffill") np.where(da1.isna())
(array([ 22, 38, 38, 141, 141, 162, 162, 183, 203, 229, 229, 395, 395]), array([1, 0, 2, 0, 2, 0, 2, 1, 1, 0, 2, 0, 2]))
Create a regression model using the ols() class for data da1 and run an analysis of variance of that model using anova_lm() function.
import statsmodels.api as sm from statsmodels.formula.api import ols model=ols('go~C(na)*C(vix)', data=da1).fit() np.around(sm.stats.anova_lm(model), 3)
df | sum_sq | mean_sq | F | PR(>F) | |
---|---|---|---|---|---|
C(na) | 1.0 | 202.242 | 202.242 | 187.090 | 0.000 |
C(vix) | 1.0 | 9.746 | 9.746 | 9.015 | 0.003 |
C(na):C(vix) | 1.0 | 2.957 | 2.957 | 2.736 | 0.099 |
Residual | 401.0 | 433.476 | 1.081 | NaN | NaN |
Based on the above results, the null hypothesis of the main effects of na, vix and the effect of the interaction of the two variables based on the significance levels of 0.05 may be rejected. In other words, they affect the daily rate of change in go.
댓글
댓글 쓰기