Contents
Comparison of two independent groups
The means of two independent probability variables, X and Y, each of which follows a normal distribution, can be compared by applying a hypothesis test:
$$\begin{align}\bar{X}&=\frac{\sum^n_{i=1} X_i}{n_X} \sim N\left(\mu_X, \frac{\text{s}_X}{n_X}\right)\\ \bar{Y}&=\frac{\sum^n_{i=1} Y_i}{n_Y} \sim N\left(\mu_Y, \frac{\text{s}_Y}{n_Y}\right)\end{align}$$Even if each sample group does not assume a normal distribution, approximately normal distribution is satisfied according to the central limit theorem. To compare the two groups, set the following null hypothesis for the difference between each mean:
$$\text{H0} : \mu_X -\mu_Y =0$$The hypothetical test statistics are calculated from a combined distribution of two groups. That is, the mean and standard deviation of the combined probability distributions of X and Y are calculated as shown in Equation 1.
$$\begin{align}\tag{1} E(X-Y)&=E(X)-E(Y)\\ &=\mu_X -\mu_Y\\ \text{Var}(X-Y) &=\text{Var}(X)+\text{Var}(Y)-\text{Cov}(X,Y)\\ &=\frac{\sigma^2_X}{n_X}+\frac{\sigma^2_Y}{n_Y}\\ \text{Cov}(X,Y)&=0\quad \because \; X, Y: \text{independent}\end{align}$$In Equation 1, Cov stands for covariance. That is, taking into account the effect of the interaction between two groups of X and Y, which is zeroed by the assumption that the two groups are independent. As a result, the combined probability distribution of the two groups is as follows:
$$\text{N}\left(\mu_X-\mu_Y, \frac{\sigma^2_X}{n_X}+\frac{\sigma^2_Y}{n_Y}\right)$$Based on this distribution, the test statistic calculated by the above expression is tested at the specified confidence level. The probability distribution applied to the analysis is classified according to the sample size as shown in Table 1. In general, the normal distribution is used, but when the sample size is small, the t distribution is applied.
Sample size | Hypothesis | Method |
---|---|---|
small | same variance | t distribution |
pooled standard deviation | ||
Different variance | t distributions, use of each standard deviation | |
calculate degrees of freedom separately | ||
large | - | Use Normal Distribution |
Equal Variances in Small Samples
Typically, if the size of the data is less than or equal to 30, the t distribution is used instead of normal distribution. In addition, if you can assume the same variance as a sample from the same population or similar population, the combination distribution can be represented by the following expression:
$$\bar{X}-\bar{Y} \sim \text{N}\left( \mu_X-\mu_Y, \sigma^2 \left(\frac{1}{n_X}+\frac{1}{n_Y}\right)\right)$$For a standard normal distribution-based analysis, i.e., z-test, calculate z statistics as follows:
$$z=\frac{\bar{X}-\mu_x}{\frac{\sigma}{n_X}}$$However, in reality, the parent standard deviation (σ) is often unknown. In this case, use as a substitute for the sample variance, and if the data are small, the z statistic is based on a t distribution with the degree of freedom as a parameter. Therefore, the following statistics are calculated based on the t distribution:
$$\begin{aligned}&t=\frac{\bar{X}-\mu_X}{\frac{\text{s}}{n_X}} \sim t(x:0, 1, n_x-1)\\ &\bar{X}=\frac{\bar{X}_1+\bar{X}_2+\cdots+\bar{X}_3}{n_X}\\&\text{s}^2=\frac{\sum^n_{i=1}(\bar{X}_i-\bar{X})^2}{n_X-1} \end{aligned}$$If you can assume that two distributions have the same variance, use the pooled standard deviation as the combined standard deviation of the two groups. As shown in Equation 2, the pooled standard deviation represents the mean spread over the group means of all data, which is the weighted mean for each group's standard deviation.
$$\begin{equation}\tag{2} \text{s}^2_p=\frac{(n_1-1)s^2_1+(n_2-1)s^2_2}{n_1+n_2-2} \end{equation}$$As a result, if you can assume that the data are small and that the distributions are the same, the distributions of the two groups are:
$$\bar{X}-\bar{Y} \sim N\left(\mu_X-\mu_Y, \text{s}^2_p\left(\frac{1}{n_X}-\frac{1}{n_Y}\right)\right)$$ Test the following hypothesis based on the distribution above: $$H0: \mu_X=\mu_Y, H1: \mu_X \neq \mu_Y$$These tests are called t-test and can be analyzed using Python's method of scipy.stats.
Example 1)
The following data are the rate of change in the closing price of the two stocks.
import numpy as np import pandas as pd import matplotlib.pyplot as plt from scipy import stats import FinanceDataReader as fdr
st=pd.Timestamp(2021,4, 1) et=pd.Timestamp(2021, 12,16) na=fdr.DataReader("IXIC",st, et)["Close"] go=fdr.DataReader("GOOGL", st, et)["Close"] nac=na.pct_change()[1:]*100 goc=go.pct_change()[1:]*100 nac.head(2)
Date 2021-04-05 1.672836 2021-04-06 -0.052533 Name: Close, dtype: float64
goc.head(2)
Date 2021-04-05 4.187287 2021-04-06 -0.437142 Name: Close, dtype: float64
Suppose the variances of each population in the two samples above are equal, and test the following null hypothesis:
$$H0: \mu_{\text{nac}} = \mu_{\text{goc}},\; H1: \mu_{\text{nac}} \neq \mu_{\text{goc}}$$val=pd.DataFrame([[nac.mean(), nac.std(), len(nac)],[goc.mean(), goc.std(), len(goc)]]) val.index=['nac','goc'] val.columns=['mean', 'std', 'size'] np.around(val, 4)
mean | std | size | |
---|---|---|---|
nac | 0.0707 | 0.9736 | 180 |
goc | 0.1784 | 1.3401 | 180 |
Calculate the combined standard deviation (sp) and standard error (se) of the two data using the unbiased estimate, i.e., the standard deviation of each sample.
n1=val.iloc[0,2] n2=val.iloc[1,2] s1=val.iloc[0,1] s2=val.iloc[1,1] sp=np.sqrt(((n1-1)*s1**2+(n2-1)*s2**2)/(n1+n2-2)) round(sp, 4)
1.1712
se=sp*np.sqrt((1/n1+1/n2)) round(se, 4)
0.1235
The test statistic standardizes the difference between the two means.
tP,tN=(val.iloc[0,0]-val.iloc[1,0])/se,(val.iloc[1,0]-val.iloc[0,0])/se round(tP, 4), round(tN, 4)
(-0.8721, 0.8721)
df=n1+n2-2 ci=stats.t.interval(0.95, df) print(f"Lower bound:{round(ci[0],4)}, Upper bound: {round(ci[1], 4)}")
Lower bound:-1.9666, Upper bound: 1.9666
Based on the results above, the confidence interval contains all of the test statistics. In other words, the difference between the means of the two data can occur in accidental circumstances, and so the means of the two populations are not different. Therefore, there is no reasonable reason to reject the null hypothesis.
Let's calculate the p-value for the test statistics calculated above. Because it is a two-sided test, the significance probabilities for the two test statistics are symmetrical. Therefore, calculate twice the probability of significance on one side as follows:
pVal=2*stats.t.sf(tN, df) round(pVal, 4)
0.3837
You cannot reject the null hypothesis because the p value is also significantly greater than the significance level of 0.05. The above analysis can yield results directly by the stats.test_ind()
function.
val, pv=stats.ttest_ind(nac.values, goc.values) print(f'statistics: {round(val, 3)} , p-value: {round(pv, 3)}')
statistics: -0.872 , p-value: 0.384
Different Variances in Small Samples
For small samples, perform a hypothesis test based on the t distribution under two assumptions:
Assumption 1: | each population follows a normal distribution |
Assumption 2: | two populations equal |
For assumption 1, if the population is large, it is reasonable to assume a normal distribution by central polar theorem. However, in the case of Assumption 2, it is determined by the degree of variance in each sample. If the ratio of the two variances exists between 0.5 and 2, you can assume that they are homogeneous.
$$0.5 \le \frac{\text{s}_1}{\text{s}_2} \le 2$$If you do not meet the conditions of the expression above, or if the assumption of homogeneity is difficult to apply for other reasons, you cannot apply the pooled variance assumed to be equal. Instead, apply the variance of the bond probability distribution. Because it is a small sample, it is based on the t distribution, so there is a problem of choosing the degree of freedom. For the pooled variance, the degrees of freedom apply $n_1 +n_2-2$, but for each sample variance, use the modified degrees of freedom. Typically, we apply Welch's revised degree of freedom (df) (Equation 3).
$$\begin{equation}\tag{3} df=\frac{\left(\frac{s^2_1}{n_1}+\frac{s^2_2}{n_2} \right)^2}{\frac{1}{n_1-1}\left(\frac{s^2_1}{n_1}\right)^2+\frac{1}{n_2-1}\left(\frac{s^2_2}{n_2}\right)^2} \end{equation}$$If the value of the degree of freedom is not an integer, round it at the first place of the decimal point. Alternatively, you can apply a small value of $n_1 -1,\; n_2 -1$ for freedom.
Example 2)
Test the Dow Jones (dji) and Nasdaq stock data for the same population mean for the daily change rate between open and close from values of 15 and 17 days each.
st=pd.Timestamp(2021,4, 10) st2=pd.Timestamp(2021,4, 20) et=pd.Timestamp(2021,12,16) dji=fdr.DataReader('DJI',st, et)["Close"] na=fdr.DataReader('IXIC', st2, et)["Close"] djic=dji.pct_change()[1:]*100 nac=na.pct_change()[1:]*100 djic.tail(2)
Date 2021-12-15 1.072085 2021-12-16 -0.084756 Name: Close, dtype: float64
The average, standard deviation, and size of the two data are as follows.
val=pd.DataFrame([[nac.mean(), nac.std(), len(nac)],[djic.mean(), djic.std(), len(djic)]]) val.index=['dac','djic'] val.columns=['mean', 'std', 'size'] np.around(val, 4)
mean | std | size | |
---|---|---|---|
dac | 0.0621 | 0.9790 | 168 |
djic | 0.0383 | 0.7533 | 174 |
As shown in the results of the following code, the variance ratio for both materials is not assumed to be equal. Therefore, you cannot apply pulled variation.
n1=val.iloc[0,2] n2=val.iloc[1,2] s1=val.iloc[0,1] s2=val.iloc[1,1] ratio=s1**2/s2**2 print(f'ratio:{round(ratio, 4)}')
ratio:1.6889
The standard deviation, degrees of freedom, and T statistics are calculated as follows:
sp=np.sqrt(s1**2/(n1-1)+s2**2/(n2-1)) print(f'Variance of joint distribution: {round(sp, 4)}')
Variance of joint distribution: 0.095
df=int(sp**2/(1/(n1-1)*(s1**2/n1)**2+1/(n2-1)*(s2**2/n2))) print(f'd.g: {df}')
d.g: 473
t=abs(val.iloc[0,0]-val.iloc[1,0])/sp print(f't statistics: {round(t, 4)}')
t statistics: 0.2505
The statistics cannot reject the null hypothesis because they are contained within confidence intervals corresponding to 0.05 significance levels in a distribution of t with degrees of freedom df and mean and variance of 0 and 1, respectively.
ci=stats.t.interval(0.9, df) print(f'Lower: {round(ci[0],4)}, Upper: {round(ci[1],4)}')
Lower: -1.6481, Upper: 1.6481
The p-value corresponding to the T statistic is much greater than the significance level of 0.05, so it is the same as the conclusion by the confidence interval above.
pVal=2*stats.t.sf(t, df) print(f" p-value: {round(pVal, 4)}")
p-value: 0.8023
The above process can be calculated using the scipy.stats.test_ind()
function. For this example, the homogeneity cannot be assumed, so the function specifies the parameter equal_var=False. However, the data being analyzed using this function must be of the same type and dimension. In this example, the size of the two materials is different, so the resulting values applied to this function show slight differences, but you can reach the same conclusion.
re=stats.ttest_ind(djic.values, nac.values, equal_var=False) print(f"t-statistics: {round(re[0],4)}, p-value: {round(re[1],4)}")
t-statistics: -0.2512, p-value: 0.8018
Large sample
According to the central pole limit theorem$^{\ref{clt}}$, large samples fit the normal distribution. Generally, if the number of data is more than 30, it is assumed to follow a normal distribution. In this case, the assumption that the population variances are equal is not necessary, and the difference in means from the two samples can also be assumed to be normal distribution. Therefore, the mean and variance of the combined distribution of X-Y in large samples is calculated as Equation 4:
$$\begin{align}\tag{4} &\hat{\mu}=\mu_x-\mu_y\\ &\hat{\sigma^2}=\begin{cases}\frac{\sigma^2_x}{n_x}+\frac{\sigma^2_y}{n_y} &\sigma^2(\text{pop-variance}): \text{known}\\\frac{\text{s}^2_x}{n_x}+\frac{\text{s}^2_y}{n_y} & \sigma^2: \text{unknown}\end{cases} \end{align}$$Example 3)
Test the following null hypothesis about the rate of daily change between the begin and the closing of Dow Jones(dji) and Nasdaq(na) in the same period.
st=pd.Timestamp(2020,4, 10) et=pd.Timestamp(2021, 12, 16) dji=fdr.DataReader('DJI',st, et)["Close"] na=fdr.DataReader('IXIC', st, et)["Close"] djic=dji.pct_change()[1:]*100 nac=na.pct_change()[1:]*100
This is the same as the t-test for heterogenous variance in small samples, except that for large data, a normal distribution is applied.
val=pd.DataFrame([[djic.mean(), djic.std(), len(djic)],[nac.mean(), nac.std(), len(nac)]]) val.index=['dji','na'] val.columns=['mean', 'std', 'size'] np.around(val, 4)
mean | std | size | |
---|---|---|---|
dji | 0.1065 | 1.0645 | 425 |
na | 0.1537 | 1.3036 | 425 |
n1=val.iloc[0,2] n2=val.iloc[1,2] s1=val.iloc[0,1] s2=val.iloc[1,1] sp=np.sqrt(s1**2/(n1-1)+s2**2/(n2-1)) print(f'Variance of joint Dist: {round(sp, 4)}')
Variance of joint Dist: 0.0817
z=abs(val.iloc[0,0]-val.iloc[1,0])/sp print(f'z statistics: {round(t, 4)}')
Lower: -1.96, Upper: 1.96
Based on the results above, the z statistic exists within the confidence intervals corresponding to 0.05 significance levels in the standard normal distribution. Therefore, the null hypothesis that the population means of the two groups are the same cannot be rejected. These conclusions can be found in the following p-value:
pVal=2*stats.norm.sf(t) print(f" p-value: {round(pVal, 4)}")
p-value: 0.8022
댓글
댓글 쓰기