기본 콘텐츠로 건너뛰기

[ML] 결정트리(Decision Tree) 모델

Comparison of two independent groups

Contents

  1. Equal Variances in Small Sampless>
  2. Different Variances in Small Samples
  3. Large sample

Comparison of two independent groups

The means of two independent probability variables, X and Y, each of which follows a normal distribution, can be compared by applying a hypothesis test:

$$\begin{align}\bar{X}&=\frac{\sum^n_{i=1} X_i}{n_X} \sim N\left(\mu_X, \frac{\text{s}_X}{n_X}\right)\\ \bar{Y}&=\frac{\sum^n_{i=1} Y_i}{n_Y} \sim N\left(\mu_Y, \frac{\text{s}_Y}{n_Y}\right)\end{align}$$

Even if each sample group does not assume a normal distribution, approximately normal distribution is satisfied according to the central limit theorem. To compare the two groups, set the following null hypothesis for the difference between each mean:

$$\text{H0} : \mu_X -\mu_Y =0$$

The hypothetical test statistics are calculated from a combined distribution of two groups. That is, the mean and standard deviation of the combined probability distributions of X and Y are calculated as shown in Equation 1.

$$\begin{align}\tag{1} E(X-Y)&=E(X)-E(Y)\\ &=\mu_X -\mu_Y\\ \text{Var}(X-Y) &=\text{Var}(X)+\text{Var}(Y)-\text{Cov}(X,Y)\\ &=\frac{\sigma^2_X}{n_X}+\frac{\sigma^2_Y}{n_Y}\\ \text{Cov}(X,Y)&=0\quad \because \; X, Y: \text{independent}\end{align}$$

In Equation 1, Cov stands for covariance. That is, taking into account the effect of the interaction between two groups of X and Y, which is zeroed by the assumption that the two groups are independent. As a result, the combined probability distribution of the two groups is as follows:

$$\text{N}\left(\mu_X-\mu_Y, \frac{\sigma^2_X}{n_X}+\frac{\sigma^2_Y}{n_Y}\right)$$

Based on this distribution, the test statistic calculated by the above expression is tested at the specified confidence level. The probability distribution applied to the analysis is classified according to the sample size as shown in Table 1. In general, the normal distribution is used, but when the sample size is small, the t distribution is applied.

Table 1. Comparing Two Samples
Sample size Hypothesis Method
smallsame variancet distribution
pooled standard deviation
Different variancet distributions, use of each standard deviation
calculate degrees of freedom separately
large - Use Normal Distribution

Equal Variances in Small Samples

Typically, if the size of the data is less than or equal to 30, the t distribution is used instead of normal distribution. In addition, if you can assume the same variance as a sample from the same population or similar population, the combination distribution can be represented by the following expression:

$$\bar{X}-\bar{Y} \sim \text{N}\left( \mu_X-\mu_Y, \sigma^2 \left(\frac{1}{n_X}+\frac{1}{n_Y}\right)\right)$$

For a standard normal distribution-based analysis, i.e., z-test, calculate z statistics as follows:

$$z=\frac{\bar{X}-\mu_x}{\frac{\sigma}{n_X}}$$

However, in reality, the parent standard deviation (σ) is often unknown. In this case, use as a substitute for the sample variance, and if the data are small, the z statistic is based on a t distribution with the degree of freedom as a parameter. Therefore, the following statistics are calculated based on the t distribution:

$$\begin{aligned}&t=\frac{\bar{X}-\mu_X}{\frac{\text{s}}{n_X}} \sim t(x:0, 1, n_x-1)\\ &\bar{X}=\frac{\bar{X}_1+\bar{X}_2+\cdots+\bar{X}_3}{n_X}\\&\text{s}^2=\frac{\sum^n_{i=1}(\bar{X}_i-\bar{X})^2}{n_X-1} \end{aligned}$$

If you can assume that two distributions have the same variance, use the pooled standard deviation as the combined standard deviation of the two groups. As shown in Equation 2, the pooled standard deviation represents the mean spread over the group means of all data, which is the weighted mean for each group's standard deviation.

$$\begin{equation}\tag{2} \text{s}^2_p=\frac{(n_1-1)s^2_1+(n_2-1)s^2_2}{n_1+n_2-2} \end{equation}$$

As a result, if you can assume that the data are small and that the distributions are the same, the distributions of the two groups are:

$$\bar{X}-\bar{Y} \sim N\left(\mu_X-\mu_Y, \text{s}^2_p\left(\frac{1}{n_X}-\frac{1}{n_Y}\right)\right)$$ Test the following hypothesis based on the distribution above: $$H0: \mu_X=\mu_Y, H1: \mu_X \neq \mu_Y$$

These tests are called t-test and can be analyzed using Python's method of scipy.stats.

Example 1)
  The following data are the rate of change in the closing price of the two stocks.

import numpy as np
import pandas as pd 
import matplotlib.pyplot as plt
from scipy import stats
import FinanceDataReader as fdr
st=pd.Timestamp(2021,4, 1)
et=pd.Timestamp(2021, 12,16)
na=fdr.DataReader("IXIC",st, et)["Close"]
go=fdr.DataReader("GOOGL", st, et)["Close"]
nac=na.pct_change()[1:]*100
goc=go.pct_change()[1:]*100
nac.head(2)
Date
2021-04-05    1.672836
2021-04-06   -0.052533
Name: Close, dtype: float64
goc.head(2)
Date
2021-04-05    4.187287
2021-04-06   -0.437142
Name: Close, dtype: float64

Suppose the variances of each population in the two samples above are equal, and test the following null hypothesis:

$$H0: \mu_{\text{nac}} = \mu_{\text{goc}},\; H1: \mu_{\text{nac}} \neq \mu_{\text{goc}}$$
val=pd.DataFrame([[nac.mean(), nac.std(), len(nac)],[goc.mean(), goc.std(), len(goc)]])
val.index=['nac','goc']
val.columns=['mean', 'std', 'size']
np.around(val, 4)
mean std size
nac 0.0707 0.9736 180
goc0.1784 1.3401 180

Calculate the combined standard deviation (sp) and standard error (se) of the two data using the unbiased estimate, i.e., the standard deviation of each sample.

n1=val.iloc[0,2]
n2=val.iloc[1,2]
s1=val.iloc[0,1]
s2=val.iloc[1,1]
sp=np.sqrt(((n1-1)*s1**2+(n2-1)*s2**2)/(n1+n2-2))
round(sp, 4)
1.1712
se=sp*np.sqrt((1/n1+1/n2))
round(se, 4)
0.1235

The test statistic standardizes the difference between the two means.

tP,tN=(val.iloc[0,0]-val.iloc[1,0])/se,(val.iloc[1,0]-val.iloc[0,0])/se
round(tP, 4), round(tN, 4)
(-0.8721, 0.8721)
df=n1+n2-2
ci=stats.t.interval(0.95, df)
print(f"Lower bound:{round(ci[0],4)}, Upper bound: {round(ci[1], 4)}")
Lower bound:-1.9666, Upper bound: 1.9666

Based on the results above, the confidence interval contains all of the test statistics. In other words, the difference between the means of the two data can occur in accidental circumstances, and so the means of the two populations are not different. Therefore, there is no reasonable reason to reject the null hypothesis.

Let's calculate the p-value for the test statistics calculated above. Because it is a two-sided test, the significance probabilities for the two test statistics are symmetrical. Therefore, calculate twice the probability of significance on one side as follows:

pVal=2*stats.t.sf(tN, df)
round(pVal, 4)
0.3837

You cannot reject the null hypothesis because the p value is also significantly greater than the significance level of 0.05. The above analysis can yield results directly by the stats.test_ind() function.

val, pv=stats.ttest_ind(nac.values, goc.values)
print(f'statistics: {round(val, 3)} , p-value: {round(pv, 3)}')
statistics: -0.872 , p-value: 0.384

Different Variances in Small Samples

For small samples, perform a hypothesis test based on the t distribution under two assumptions:

Assumption 1: each population follows a normal distribution
Assumption 2: two populations equal

For assumption 1, if the population is large, it is reasonable to assume a normal distribution by central polar theorem. However, in the case of Assumption 2, it is determined by the degree of variance in each sample. If the ratio of the two variances exists between 0.5 and 2, you can assume that they are homogeneous.

$$0.5 \le \frac{\text{s}_1}{\text{s}_2} \le 2$$

If you do not meet the conditions of the expression above, or if the assumption of homogeneity is difficult to apply for other reasons, you cannot apply the pooled variance assumed to be equal. Instead, apply the variance of the bond probability distribution. Because it is a small sample, it is based on the t distribution, so there is a problem of choosing the degree of freedom. For the pooled variance, the degrees of freedom apply $n_1 +n_2-2$, but for each sample variance, use the modified degrees of freedom. Typically, we apply Welch's revised degree of freedom (df) (Equation 3).

$$\begin{equation}\tag{3} df=\frac{\left(\frac{s^2_1}{n_1}+\frac{s^2_2}{n_2} \right)^2}{\frac{1}{n_1-1}\left(\frac{s^2_1}{n_1}\right)^2+\frac{1}{n_2-1}\left(\frac{s^2_2}{n_2}\right)^2} \end{equation}$$

If the value of the degree of freedom is not an integer, round it at the first place of the decimal point. Alternatively, you can apply a small value of $n_1 -1,\; n_2 -1$ for freedom.

Example 2)
  Test the Dow Jones (dji) and Nasdaq stock data for the same population mean for the daily change rate between open and close from values of 15 and 17 days each.

$$H0: \mu_{kl}-\mu_{ki}=0, \; H1: \mu_{kl}-\mu_{ki} \neq 0$$
st=pd.Timestamp(2021,4, 10)
st2=pd.Timestamp(2021,4, 20)
et=pd.Timestamp(2021,12,16)
dji=fdr.DataReader('DJI',st, et)["Close"]
na=fdr.DataReader('IXIC', st2, et)["Close"]
djic=dji.pct_change()[1:]*100
nac=na.pct_change()[1:]*100
djic.tail(2)
Date
2021-12-15    1.072085
2021-12-16   -0.084756
Name: Close, dtype: float64

The average, standard deviation, and size of the two data are as follows.

val=pd.DataFrame([[nac.mean(), nac.std(), len(nac)],[djic.mean(), djic.std(), len(djic)]])
val.index=['dac','djic']
val.columns=['mean', 'std', 'size']
np.around(val, 4)
mean std size
dac 0.0621 0.9790 168
djic 0.0383 0.7533 174

As shown in the results of the following code, the variance ratio for both materials is not assumed to be equal. Therefore, you cannot apply pulled variation.

n1=val.iloc[0,2]
n2=val.iloc[1,2]
s1=val.iloc[0,1]
s2=val.iloc[1,1]
ratio=s1**2/s2**2
print(f'ratio:{round(ratio, 4)}')
ratio:1.6889

The standard deviation, degrees of freedom, and T statistics are calculated as follows:

sp=np.sqrt(s1**2/(n1-1)+s2**2/(n2-1))
print(f'Variance of joint distribution: {round(sp, 4)}')
Variance of joint distribution: 0.095
df=int(sp**2/(1/(n1-1)*(s1**2/n1)**2+1/(n2-1)*(s2**2/n2)))
print(f'd.g: {df}')
d.g: 473
t=abs(val.iloc[0,0]-val.iloc[1,0])/sp
print(f't statistics: {round(t, 4)}')
t statistics: 0.2505

The statistics cannot reject the null hypothesis because they are contained within confidence intervals corresponding to 0.05 significance levels in a distribution of t with degrees of freedom df and mean and variance of 0 and 1, respectively.

ci=stats.t.interval(0.9, df)
print(f'Lower: {round(ci[0],4)}, Upper: {round(ci[1],4)}')
Lower: -1.6481, Upper: 1.6481

The p-value corresponding to the T statistic is much greater than the significance level of 0.05, so it is the same as the conclusion by the confidence interval above.

pVal=2*stats.t.sf(t, df)
print(f" p-value: {round(pVal, 4)}")
 p-value: 0.8023

The above process can be calculated using the scipy.stats.test_ind() function. For this example, the homogeneity cannot be assumed, so the function specifies the parameter equal_var=False. However, the data being analyzed using this function must be of the same type and dimension. In this example, the size of the two materials is different, so the resulting values applied to this function show slight differences, but you can reach the same conclusion.

re=stats.ttest_ind(djic.values, nac.values, equal_var=False)
print(f"t-statistics: {round(re[0],4)}, p-value: {round(re[1],4)}")
t-statistics: -0.2512, p-value: 0.8018

Large sample

According to the central pole limit theorem$^{\ref{clt}}$, large samples fit the normal distribution. Generally, if the number of data is more than 30, it is assumed to follow a normal distribution. In this case, the assumption that the population variances are equal is not necessary, and the difference in means from the two samples can also be assumed to be normal distribution. Therefore, the mean and variance of the combined distribution of X-Y in large samples is calculated as Equation 4:

$$\begin{align}\tag{4} &\hat{\mu}=\mu_x-\mu_y\\ &\hat{\sigma^2}=\begin{cases}\frac{\sigma^2_x}{n_x}+\frac{\sigma^2_y}{n_y} &\sigma^2(\text{pop-variance}): \text{known}\\\frac{\text{s}^2_x}{n_x}+\frac{\text{s}^2_y}{n_y} & \sigma^2: \text{unknown}\end{cases} \end{align}$$

Example 3)
  Test the following null hypothesis about the rate of daily change between the begin and the closing of Dow Jones(dji) and Nasdaq(na) in the same period.

$$H0: \mu_{dji}-\mu_{na}=0$$
st=pd.Timestamp(2020,4, 10)
et=pd.Timestamp(2021, 12, 16)
dji=fdr.DataReader('DJI',st, et)["Close"]
na=fdr.DataReader('IXIC', st, et)["Close"]
djic=dji.pct_change()[1:]*100
nac=na.pct_change()[1:]*100

This is the same as the t-test for heterogenous variance in small samples, except that for large data, a normal distribution is applied.

val=pd.DataFrame([[djic.mean(), djic.std(), len(djic)],[nac.mean(), nac.std(), len(nac)]])
val.index=['dji','na']
val.columns=['mean', 'std', 'size']
np.around(val, 4)
mean std size
dji 0.1065 1.0645 425
na 0.1537 1.3036 425
n1=val.iloc[0,2]
n2=val.iloc[1,2]
s1=val.iloc[0,1]
s2=val.iloc[1,1]
sp=np.sqrt(s1**2/(n1-1)+s2**2/(n2-1))
print(f'Variance of joint Dist: {round(sp, 4)}')
Variance of joint Dist: 0.0817
z=abs(val.iloc[0,0]-val.iloc[1,0])/sp
print(f'z statistics: {round(t, 4)}')
Lower: -1.96, Upper: 1.96

Based on the results above, the z statistic exists within the confidence intervals corresponding to 0.05 significance levels in the standard normal distribution. Therefore, the null hypothesis that the population means of the two groups are the same cannot be rejected. These conclusions can be found in the following p-value:

pVal=2*stats.norm.sf(t)
print(f" p-value: {round(pVal, 4)}")
 p-value: 0.8022

댓글

이 블로그의 인기 게시물

[Linear Algebra] 유사변환(Similarity transformation)

유사변환(Similarity transformation) n×n 차원의 정방 행렬 A, B 그리고 가역 행렬 P 사이에 식 1의 관계가 성립하면 행렬 A와 B는 유사행렬(similarity matrix)이 되며 행렬 A를 가역행렬 P와 B로 분해하는 것을 유사 변환(similarity transformation) 이라고 합니다. $$\tag{1} A = PBP^{-1} \Leftrightarrow P^{-1}AP = B $$ 식 2는 식 1의 양변에 B의 고유값을 고려한 것입니다. \begin{align}\tag{식 2} B - \lambda I &= P^{-1}AP – \lambda P^{-1}P\\ &= P^{-1}(AP – \lambda P)\\ &= P^{-1}(A - \lambda I)P \end{align} 식 2의 행렬식은 식 3과 같이 정리됩니다. \begin{align} &\begin{aligned}\textsf{det}(B - \lambda I ) & = \textsf{det}(P^{-1}(AP – \lambda P))\\ &= \textsf{det}(P^{-1}) \textsf{det}((A – \lambda I)) \textsf{det}(P)\\ &= \textsf{det}(P^{-1}) \textsf{det}(P) \textsf{det}((A – \lambda I))\\ &= \textsf{det}(A – \lambda I)\end{aligned}\\ &\begin{aligned}\because \; \textsf{det}(P^{-1}) \textsf{det}(P) &= \textsf{det}(P^{-1}P)\\ &= \textsf{det}(I)\end{aligned}\end{align} 유사행렬의 특성 유사행렬인 두 정방행렬 A와 B는 'A ~ B' 와 같

[matplotlib] 히스토그램(Histogram)

히스토그램(Histogram) 히스토그램은 확률분포의 그래픽적인 표현이며 막대그래프의 종류입니다. 이 그래프가 확률분포와 관계가 있으므로 통계적 요소를 나타내기 위해 많이 사용됩니다. plt.hist(X, bins=10)함수를 사용합니다. x=np.random.randn(1000) plt.hist(x, 10) plt.show() 위 그래프의 y축은 각 구간에 해당하는 갯수이다. 빈도수 대신 확률밀도를 나타내기 위해서는 위 함수의 매개변수 normed=True로 조정하여 나타낼 수 있다. 또한 매개변수 bins의 인수를 숫자로 전달할 수 있지만 리스트 객체로 지정할 수 있다. 막대그래프의 경우와 마찬가지로 각 막대의 폭은 매개변수 width에 의해 조정된다. y=np.linspace(min(x)-1, max(x)+1, 10) y array([-4.48810153, -3.54351935, -2.59893717, -1.65435499, -0.70977282, 0.23480936, 1.17939154, 2.12397372, 3.0685559 , 4.01313807]) plt.hist(x, y, normed=True) plt.show()

R 미분과 적분

내용 expression 미분 2차 미분 mosaic를 사용한 미분 적분 미분과 적분 R에서의 미분과 적분 함수는 expression()함수에 의해 생성된 표현식을 대상으로 합니다. expression expression(문자, 또는 식) 이 표현식의 평가는 eval() 함수에 의해 실행됩니다. > ex1<-expression(1+0:9) > ex1 expression(1 + 0:9) > eval(ex1) [1] 1 2 3 4 5 6 7 8 9 10 > ex2<-expression(u, 2, u+0:9) > ex2 expression(u, 2, u + 0:9) > ex2[1] expression(u) > ex2[2] expression(2) > ex2[3] expression(u + 0:9) > u<-0.9 > eval(ex2[3]) [1] 0.9 1.9 2.9 3.9 4.9 5.9 6.9 7.9 8.9 9.9 미분 D(표현식, 미분 변수) 함수로 미분을 실행합니다. 이 함수의 표현식은 expression() 함수로 생성된 객체이며 미분 변수는 다음 식의 분모의 변수를 의미합니다. $$\frac{d}{d \text{변수}}\text{표현식}$$ 이 함수는 어떤 함수의 미분의 결과를 표현식으로 반환합니다. > D(expression(2*x^3), "x") 2 * (3 * x^2) > eq<-expression(log(x)) > eq expression(log(x)) > D(eq, "x") 1/x > eq2<-expression(a/(1+b*exp(-d*x))); eq2 expression(a/(1 + b * exp(-d * x))) > D(eq2, "x") a * (b * (exp(-d * x) * d))/(1 + b