기본 콘텐츠로 건너뛰기

벡터와 행렬에 관련된 그림들

Comparison of two independent groups

Contents

  1. Equal Variances in Small Sampless>
  2. Different Variances in Small Samples
  3. Large sample

Comparison of two independent groups

The means of two independent probability variables, X and Y, each of which follows a normal distribution, can be compared by applying a hypothesis test:

$$\begin{align}\bar{X}&=\frac{\sum^n_{i=1} X_i}{n_X} \sim N\left(\mu_X, \frac{\text{s}_X}{n_X}\right)\\ \bar{Y}&=\frac{\sum^n_{i=1} Y_i}{n_Y} \sim N\left(\mu_Y, \frac{\text{s}_Y}{n_Y}\right)\end{align}$$

Even if each sample group does not assume a normal distribution, approximately normal distribution is satisfied according to the central limit theorem. To compare the two groups, set the following null hypothesis for the difference between each mean:

$$\text{H0} : \mu_X -\mu_Y =0$$

The hypothetical test statistics are calculated from a combined distribution of two groups. That is, the mean and standard deviation of the combined probability distributions of X and Y are calculated as shown in Equation 1.

$$\begin{align}\tag{1} E(X-Y)&=E(X)-E(Y)\\ &=\mu_X -\mu_Y\\ \text{Var}(X-Y) &=\text{Var}(X)+\text{Var}(Y)-\text{Cov}(X,Y)\\ &=\frac{\sigma^2_X}{n_X}+\frac{\sigma^2_Y}{n_Y}\\ \text{Cov}(X,Y)&=0\quad \because \; X, Y: \text{independent}\end{align}$$

In Equation 1, Cov stands for covariance. That is, taking into account the effect of the interaction between two groups of X and Y, which is zeroed by the assumption that the two groups are independent. As a result, the combined probability distribution of the two groups is as follows:

$$\text{N}\left(\mu_X-\mu_Y, \frac{\sigma^2_X}{n_X}+\frac{\sigma^2_Y}{n_Y}\right)$$

Based on this distribution, the test statistic calculated by the above expression is tested at the specified confidence level. The probability distribution applied to the analysis is classified according to the sample size as shown in Table 1. In general, the normal distribution is used, but when the sample size is small, the t distribution is applied.

Table 1. Comparing Two Samples
Sample size Hypothesis Method
smallsame variancet distribution
pooled standard deviation
Different variancet distributions, use of each standard deviation
calculate degrees of freedom separately
large - Use Normal Distribution

Equal Variances in Small Samples

Typically, if the size of the data is less than or equal to 30, the t distribution is used instead of normal distribution. In addition, if you can assume the same variance as a sample from the same population or similar population, the combination distribution can be represented by the following expression:

$$\bar{X}-\bar{Y} \sim \text{N}\left( \mu_X-\mu_Y, \sigma^2 \left(\frac{1}{n_X}+\frac{1}{n_Y}\right)\right)$$

For a standard normal distribution-based analysis, i.e., z-test, calculate z statistics as follows:

$$z=\frac{\bar{X}-\mu_x}{\frac{\sigma}{n_X}}$$

However, in reality, the parent standard deviation (σ) is often unknown. In this case, use as a substitute for the sample variance, and if the data are small, the z statistic is based on a t distribution with the degree of freedom as a parameter. Therefore, the following statistics are calculated based on the t distribution:

$$\begin{aligned}&t=\frac{\bar{X}-\mu_X}{\frac{\text{s}}{n_X}} \sim t(x:0, 1, n_x-1)\\ &\bar{X}=\frac{\bar{X}_1+\bar{X}_2+\cdots+\bar{X}_3}{n_X}\\&\text{s}^2=\frac{\sum^n_{i=1}(\bar{X}_i-\bar{X})^2}{n_X-1} \end{aligned}$$

If you can assume that two distributions have the same variance, use the pooled standard deviation as the combined standard deviation of the two groups. As shown in Equation 2, the pooled standard deviation represents the mean spread over the group means of all data, which is the weighted mean for each group's standard deviation.

$$\begin{equation}\tag{2} \text{s}^2_p=\frac{(n_1-1)s^2_1+(n_2-1)s^2_2}{n_1+n_2-2} \end{equation}$$

As a result, if you can assume that the data are small and that the distributions are the same, the distributions of the two groups are:

$$\bar{X}-\bar{Y} \sim N\left(\mu_X-\mu_Y, \text{s}^2_p\left(\frac{1}{n_X}-\frac{1}{n_Y}\right)\right)$$ Test the following hypothesis based on the distribution above: $$H0: \mu_X=\mu_Y, H1: \mu_X \neq \mu_Y$$

These tests are called t-test and can be analyzed using Python's method of scipy.stats.

Example 1)
  The following data are the rate of change in the closing price of the two stocks.

import numpy as np
import pandas as pd 
import matplotlib.pyplot as plt
from scipy import stats
import FinanceDataReader as fdr
st=pd.Timestamp(2021,4, 1)
et=pd.Timestamp(2021, 12,16)
na=fdr.DataReader("IXIC",st, et)["Close"]
go=fdr.DataReader("GOOGL", st, et)["Close"]
nac=na.pct_change()[1:]*100
goc=go.pct_change()[1:]*100
nac.head(2)
Date
2021-04-05    1.672836
2021-04-06   -0.052533
Name: Close, dtype: float64
goc.head(2)
Date
2021-04-05    4.187287
2021-04-06   -0.437142
Name: Close, dtype: float64

Suppose the variances of each population in the two samples above are equal, and test the following null hypothesis:

$$H0: \mu_{\text{nac}} = \mu_{\text{goc}},\; H1: \mu_{\text{nac}} \neq \mu_{\text{goc}}$$
val=pd.DataFrame([[nac.mean(), nac.std(), len(nac)],[goc.mean(), goc.std(), len(goc)]])
val.index=['nac','goc']
val.columns=['mean', 'std', 'size']
np.around(val, 4)
mean std size
nac 0.0707 0.9736 180
goc0.1784 1.3401 180

Calculate the combined standard deviation (sp) and standard error (se) of the two data using the unbiased estimate, i.e., the standard deviation of each sample.

n1=val.iloc[0,2]
n2=val.iloc[1,2]
s1=val.iloc[0,1]
s2=val.iloc[1,1]
sp=np.sqrt(((n1-1)*s1**2+(n2-1)*s2**2)/(n1+n2-2))
round(sp, 4)
1.1712
se=sp*np.sqrt((1/n1+1/n2))
round(se, 4)
0.1235

The test statistic standardizes the difference between the two means.

tP,tN=(val.iloc[0,0]-val.iloc[1,0])/se,(val.iloc[1,0]-val.iloc[0,0])/se
round(tP, 4), round(tN, 4)
(-0.8721, 0.8721)
df=n1+n2-2
ci=stats.t.interval(0.95, df)
print(f"Lower bound:{round(ci[0],4)}, Upper bound: {round(ci[1], 4)}")
Lower bound:-1.9666, Upper bound: 1.9666

Based on the results above, the confidence interval contains all of the test statistics. In other words, the difference between the means of the two data can occur in accidental circumstances, and so the means of the two populations are not different. Therefore, there is no reasonable reason to reject the null hypothesis.

Let's calculate the p-value for the test statistics calculated above. Because it is a two-sided test, the significance probabilities for the two test statistics are symmetrical. Therefore, calculate twice the probability of significance on one side as follows:

pVal=2*stats.t.sf(tN, df)
round(pVal, 4)
0.3837

You cannot reject the null hypothesis because the p value is also significantly greater than the significance level of 0.05. The above analysis can yield results directly by the stats.test_ind() function.

val, pv=stats.ttest_ind(nac.values, goc.values)
print(f'statistics: {round(val, 3)} , p-value: {round(pv, 3)}')
statistics: -0.872 , p-value: 0.384

Different Variances in Small Samples

For small samples, perform a hypothesis test based on the t distribution under two assumptions:

Assumption 1: each population follows a normal distribution
Assumption 2: two populations equal

For assumption 1, if the population is large, it is reasonable to assume a normal distribution by central polar theorem. However, in the case of Assumption 2, it is determined by the degree of variance in each sample. If the ratio of the two variances exists between 0.5 and 2, you can assume that they are homogeneous.

$$0.5 \le \frac{\text{s}_1}{\text{s}_2} \le 2$$

If you do not meet the conditions of the expression above, or if the assumption of homogeneity is difficult to apply for other reasons, you cannot apply the pooled variance assumed to be equal. Instead, apply the variance of the bond probability distribution. Because it is a small sample, it is based on the t distribution, so there is a problem of choosing the degree of freedom. For the pooled variance, the degrees of freedom apply $n_1 +n_2-2$, but for each sample variance, use the modified degrees of freedom. Typically, we apply Welch's revised degree of freedom (df) (Equation 3).

$$\begin{equation}\tag{3} df=\frac{\left(\frac{s^2_1}{n_1}+\frac{s^2_2}{n_2} \right)^2}{\frac{1}{n_1-1}\left(\frac{s^2_1}{n_1}\right)^2+\frac{1}{n_2-1}\left(\frac{s^2_2}{n_2}\right)^2} \end{equation}$$

If the value of the degree of freedom is not an integer, round it at the first place of the decimal point. Alternatively, you can apply a small value of $n_1 -1,\; n_2 -1$ for freedom.

Example 2)
  Test the Dow Jones (dji) and Nasdaq stock data for the same population mean for the daily change rate between open and close from values of 15 and 17 days each.

$$H0: \mu_{kl}-\mu_{ki}=0, \; H1: \mu_{kl}-\mu_{ki} \neq 0$$
st=pd.Timestamp(2021,4, 10)
st2=pd.Timestamp(2021,4, 20)
et=pd.Timestamp(2021,12,16)
dji=fdr.DataReader('DJI',st, et)["Close"]
na=fdr.DataReader('IXIC', st2, et)["Close"]
djic=dji.pct_change()[1:]*100
nac=na.pct_change()[1:]*100
djic.tail(2)
Date
2021-12-15    1.072085
2021-12-16   -0.084756
Name: Close, dtype: float64

The average, standard deviation, and size of the two data are as follows.

val=pd.DataFrame([[nac.mean(), nac.std(), len(nac)],[djic.mean(), djic.std(), len(djic)]])
val.index=['dac','djic']
val.columns=['mean', 'std', 'size']
np.around(val, 4)
mean std size
dac 0.0621 0.9790 168
djic 0.0383 0.7533 174

As shown in the results of the following code, the variance ratio for both materials is not assumed to be equal. Therefore, you cannot apply pulled variation.

n1=val.iloc[0,2]
n2=val.iloc[1,2]
s1=val.iloc[0,1]
s2=val.iloc[1,1]
ratio=s1**2/s2**2
print(f'ratio:{round(ratio, 4)}')
ratio:1.6889

The standard deviation, degrees of freedom, and T statistics are calculated as follows:

sp=np.sqrt(s1**2/(n1-1)+s2**2/(n2-1))
print(f'Variance of joint distribution: {round(sp, 4)}')
Variance of joint distribution: 0.095
df=int(sp**2/(1/(n1-1)*(s1**2/n1)**2+1/(n2-1)*(s2**2/n2)))
print(f'd.g: {df}')
d.g: 473
t=abs(val.iloc[0,0]-val.iloc[1,0])/sp
print(f't statistics: {round(t, 4)}')
t statistics: 0.2505

The statistics cannot reject the null hypothesis because they are contained within confidence intervals corresponding to 0.05 significance levels in a distribution of t with degrees of freedom df and mean and variance of 0 and 1, respectively.

ci=stats.t.interval(0.9, df)
print(f'Lower: {round(ci[0],4)}, Upper: {round(ci[1],4)}')
Lower: -1.6481, Upper: 1.6481

The p-value corresponding to the T statistic is much greater than the significance level of 0.05, so it is the same as the conclusion by the confidence interval above.

pVal=2*stats.t.sf(t, df)
print(f" p-value: {round(pVal, 4)}")
 p-value: 0.8023

The above process can be calculated using the scipy.stats.test_ind() function. For this example, the homogeneity cannot be assumed, so the function specifies the parameter equal_var=False. However, the data being analyzed using this function must be of the same type and dimension. In this example, the size of the two materials is different, so the resulting values applied to this function show slight differences, but you can reach the same conclusion.

re=stats.ttest_ind(djic.values, nac.values, equal_var=False)
print(f"t-statistics: {round(re[0],4)}, p-value: {round(re[1],4)}")
t-statistics: -0.2512, p-value: 0.8018

Large sample

According to the central pole limit theorem$^{\ref{clt}}$, large samples fit the normal distribution. Generally, if the number of data is more than 30, it is assumed to follow a normal distribution. In this case, the assumption that the population variances are equal is not necessary, and the difference in means from the two samples can also be assumed to be normal distribution. Therefore, the mean and variance of the combined distribution of X-Y in large samples is calculated as Equation 4:

$$\begin{align}\tag{4} &\hat{\mu}=\mu_x-\mu_y\\ &\hat{\sigma^2}=\begin{cases}\frac{\sigma^2_x}{n_x}+\frac{\sigma^2_y}{n_y} &\sigma^2(\text{pop-variance}): \text{known}\\\frac{\text{s}^2_x}{n_x}+\frac{\text{s}^2_y}{n_y} & \sigma^2: \text{unknown}\end{cases} \end{align}$$

Example 3)
  Test the following null hypothesis about the rate of daily change between the begin and the closing of Dow Jones(dji) and Nasdaq(na) in the same period.

$$H0: \mu_{dji}-\mu_{na}=0$$
st=pd.Timestamp(2020,4, 10)
et=pd.Timestamp(2021, 12, 16)
dji=fdr.DataReader('DJI',st, et)["Close"]
na=fdr.DataReader('IXIC', st, et)["Close"]
djic=dji.pct_change()[1:]*100
nac=na.pct_change()[1:]*100

This is the same as the t-test for heterogenous variance in small samples, except that for large data, a normal distribution is applied.

val=pd.DataFrame([[djic.mean(), djic.std(), len(djic)],[nac.mean(), nac.std(), len(nac)]])
val.index=['dji','na']
val.columns=['mean', 'std', 'size']
np.around(val, 4)
mean std size
dji 0.1065 1.0645 425
na 0.1537 1.3036 425
n1=val.iloc[0,2]
n2=val.iloc[1,2]
s1=val.iloc[0,1]
s2=val.iloc[1,1]
sp=np.sqrt(s1**2/(n1-1)+s2**2/(n2-1))
print(f'Variance of joint Dist: {round(sp, 4)}')
Variance of joint Dist: 0.0817
z=abs(val.iloc[0,0]-val.iloc[1,0])/sp
print(f'z statistics: {round(t, 4)}')
Lower: -1.96, Upper: 1.96

Based on the results above, the z statistic exists within the confidence intervals corresponding to 0.05 significance levels in the standard normal distribution. Therefore, the null hypothesis that the population means of the two groups are the same cannot be rejected. These conclusions can be found in the following p-value:

pVal=2*stats.norm.sf(t)
print(f" p-value: {round(pVal, 4)}")
 p-value: 0.8022

댓글

이 블로그의 인기 게시물

[Linear Algebra] 유사변환(Similarity transformation)

유사변환(Similarity transformation) n×n 차원의 정방 행렬 A, B 그리고 가역 행렬 P 사이에 식 1의 관계가 성립하면 행렬 A와 B는 유사행렬(similarity matrix)이 되며 행렬 A를 가역행렬 P와 B로 분해하는 것을 유사 변환(similarity transformation) 이라고 합니다. $$\tag{1} A = PBP^{-1} \Leftrightarrow P^{-1}AP = B $$ 식 2는 식 1의 양변에 B의 고유값을 고려한 것입니다. \begin{align}\tag{식 2} B - \lambda I &= P^{-1}AP – \lambda P^{-1}P\\ &= P^{-1}(AP – \lambda P)\\ &= P^{-1}(A - \lambda I)P \end{align} 식 2의 행렬식은 식 3과 같이 정리됩니다. \begin{align} &\begin{aligned}\textsf{det}(B - \lambda I ) & = \textsf{det}(P^{-1}(AP – \lambda P))\\ &= \textsf{det}(P^{-1}) \textsf{det}((A – \lambda I)) \textsf{det}(P)\\ &= \textsf{det}(P^{-1}) \textsf{det}(P) \textsf{det}((A – \lambda I))\\ &= \textsf{det}(A – \lambda I)\end{aligned}\\ &\begin{aligned}\because \; \textsf{det}(P^{-1}) \textsf{det}(P) &= \textsf{det}(P^{-1}P)\\ &= \textsf{det}(I)\end{aligned}\end{align} 유사행렬의 특성 유사행렬인 두 정방행렬 A와 B는 'A ~ B' 와 같...

[sympy] Sympy객체의 표현을 위한 함수들

Sympy객체의 표현을 위한 함수들 General simplify(x): 식 x(sympy 객체)를 간단히 정리 합니다. import numpy as np from sympy import * x=symbols("x") a=sin(x)**2+cos(x)**2 a $\sin^{2}{\left(x \right)} + \cos^{2}{\left(x \right)}$ simplify(a) 1 simplify(b) $\frac{x^{3} + x^{2} - x - 1}{x^{2} + 2 x + 1}$ simplify(b) x - 1 c=gamma(x)/gamma(x-2) c $\frac{\Gamma\left(x\right)}{\Gamma\left(x - 2\right)}$ simplify(c) $\displaystyle \left(x - 2\right) \left(x - 1\right)$ 위의 예들 중 객체 c의 감마함수(gamma(x))는 확률분포 등 여러 부분에서 사용되는 표현식으로 다음과 같이 정의 됩니다. 감마함수는 음이 아닌 정수를 제외한 모든 수에서 정의됩니다. 식 1과 같이 자연수에서 감마함수는 factorial(!), 부동소수(양의 실수)인 경우 적분을 적용하여 계산합니다. $$\tag{식 1}\Gamma(n) =\begin{cases}(n-1)!& n:\text{자연수}\\\int^\infty_0x^{n-1}e^{-x}\,dx& n:\text{부동소수}\end{cases}$$ x=symbols('x') gamma(x).subs(x,4) $\displaystyle 6$ factorial 계산은 math.factorial() 함수를 사용할 수 있습니다. import math math.factorial(3) 6 a=gamma(x).subs(x,4.5) a.evalf(3) 11.6 simpilfy() 함수의 알고리즘은 식에서 공통사항을 찾아 정리하...

sympy.solvers로 방정식해 구하기

sympy.solvers로 방정식해 구하기 대수 방정식을 해를 계산하기 위해 다음 함수를 사용합니다. sympy.solvers.solve(f, *symbols, **flags) f=0, 즉 동차방정식에 대해 지정한 변수의 해를 계산 f : 식 또는 함수 symbols: 식의 해를 계산하기 위한 변수, 변수가 하나인 경우는 생략가능(자동으로 인식) flags: 계산 또는 결과의 방식을 지정하기 위한 인수들 dict=True: {x:3, y:1}같이 사전형식, 기본값 = False set=True :{(x,3),(y,1)}같이 집합형식, 기본값 = False ratioal=True : 실수를 유리수로 반환, 기본값 = False positive=True: 해들 중에 양수만을 반환, 기본값 = False 예 $x^2=1$의 해를 결정합니다. solve() 함수에 적용하기 위해서는 다음과 같이 식의 한쪽이 0이 되는 형태인 동차식으로 구성되어야 합니다. $$x^2-1=0$$ import numpy as np from sympy import * x = symbols('x') solve(x**2-1, x) [-1, 1] 위 식은 계산 과정은 다음과 같습니다. $$\begin{aligned}x^2-1=0 \rightarrow (x+1)(x-1)=0 \\ x=1 \; \text{or}\; -1\end{aligned}$$ 예 $x^4=1$의 해를 결정합니다. solve() 함수의 인수 set=True를 지정하였으므로 결과는 집합(set)형으로 반환됩니다. eq=x**4-1 solve(eq, set=True) ([x], {(-1,), (-I,), (1,), (I,)}) 위의 경우 I는 복소수입니다.즉 위 결과의 과정은 다음과 같습니다. $$x^4-1=(x^2+1)(x+1)(x-1)=0 \rightarrow x=\pm \sqrt{-1}, \; \pm 1=\pm i,\; \pm1$$ 실수...