기본 콘텐츠로 건너뛰기

[ML] 결정트리(Decision Tree) 모델

Hypothesis test

Contents

  1. Null and Alternative hypotheses>
  2. One-sided and two-sided tests

Hypothesis

Statistical inference consists of establishing tentative hypotheses about the parameters of a population based on statistics calculated from a sample, and testing steps to accept or reject the hypothesis. In the test stage, the statistic of the sample that is the basis for judgment is called test statistic. The brobability of a more extreme statistic based on that test statistic is called p-value. By comparing the p-value with the significance level, acceptance or rejection of the statistic is determined.

  • p-value < significance level: reject the hypothesis assumed to be true
  • p-value > significance level: Failed to reject the hypothesis assumed to be true
Power and Sample size
Power is the probability of rejecting a false hypothesis. For example, a power of 90% indicates that there is a 10% chance of accepting an incorrect hypothesis. This is a type 2 error shown in Table 1.
  This power increases as the sample size increases. Therefore, in order to obtain the desired power, it is necessary to have an appropriate number of samples.

Null and Alternative hypotheses

Analysts can hypothesize that the mean of the sample means is used as an estimate of the population mean and test the statistical validity of this hypothesis. The analyst does not expect this hypothesis to be rejected because it does not show a statistically significant difference. This hypothesis is called null hypothesis (H0). The hypothesis that is expected to be rejected corresponding to this null hypothesis is called alternative hypothesis(Ha). The test of the null hypothesis is based on information from the sample, that is, the test statistic. Therefore, it includes the possibility of errors such as:

Table 1. Type of error
H0 True Ha True
H0 AcceptRight decision type II error
H0 Rejecttype I error, $\alpha$ Right decision

The significance level is the probability of making a type I error of rejecting the null hypothesis when the null hypothesis is true. By specifically setting the significance level, the analyst can control the probability of making a type I error. In other words, as a result, increasing the significance level narrows the margin of error and increases the number of cases where the null hypothesis is accepted only when it is true. In contrast, type II errors that are related to power cannot be adjusted by the analyst. As such, a hypothesis test that controls only Type I errors is called significance test.

In summary, hypothesis testing is the process of establishing a hypothesis for the estimated parameter with the statistics of the sample and determining whether the hypothesis is appropriate. This method consists of the following steps:

  1. Establishing a hypothesis.
    • A hypothesis consists of a null hypothesis (H0) and an alternative hypothesis (H1 or Ha ).
    • An alternative hypothesis is a hypothesis that requires proof, and a claim that opposes the alternative hypothesis or an existing claim is called the null hypothesis.
    • For example, a claim that the population mean ($\mu$) ``will be greater than the sample mean" requires proof. In this case, this hypothesis would be an alternative hypothesis. Conversely, ``the population mean is less than or equal to the sample mean" is the null hypothesis.
      H0: μ ≤ X, H1: μ ≥ X
  2. Calculation of basic statistics of samples, that is, obtained data: sample mean, standard deviation, etc.
  3. Calculate the test statistic
    • The values to test the hypothesis on
    • For example, if the population can be assumed to be normally distributed, the standard score z can be used as the basis for the test. In this case, z is the test statistic. $$Z=\frac{\bar{X}-\mu}{\frac{\sigma}{\sqrt{n}}} \; \text{OR} \;\frac{\bar{X}-\mu}{\frac{\text{s}}{\sqrt{n}}}{}$$
  4. Set the area to reject the null hypothesis
    • The x value at which this region starts is called threshold for rejection and the region to be rejected is called critical region.
    • For example, if the test is performed at a 95% confidence level, the confidence interval for the test statistic is expressed as follows. $$\bar{X} - 1.96\frac{\sigma}{\sqrt{n}} \le \mu \le \bar{X} + 1.96\frac{\sigma}{\sqrt{n}} $$ If it exists outside this interval, the null hypothesis can be rejected.
  5. Conclusion

Example 1)
  The probability of an increase in the daily rate of change between the opening and closing prices of a stock is 0.53. In the analysis of some samples, it is said that the increase in change occurs 10 times in 30 days. Can these results be generalized?

This problem can assume a negative binomial distribution, that is, a distribution representing the probability change up to the r-th success while iterating the Bernoulli trial. In this distribution, the total number of trials becomes the random variable x, and the number of successes (r) and probability (p) are the parameters.

$$x \sim NB(10, 0.53)$$

The mean and variance of this negative binomial distribution are

import numpy as np
import pandas as pd 
import matplotlib.pyplot as plt
from scipy import stats
r, p=10, 0.53
mu, var=stats.nbinom.stats(r, p, moments='mv')
print(f'mean: {np.round(mu,3)}, variance: {np.round(var, 3)}')
mean: 8.868, variance: 16.732

The null hypothesis for this problem is:

$$H0: x=30$$

The interval() method that calculates the confidence interval for each distribution class in the scipy.stats module returns the lower and upper bounds as a two-sided test. This is reasonable when the reference distribution is a symmetric distribution. However, this method cannot be used for negative binomial and one-sided distributions, because only the upper or lower bounds can be represented. Instead, use the ppf() method, which returns the variable value corresponding to the probability. Therefore, the confidence interval at the significance level α = 0.05 is calculated as

cv=stats.nbinom.ppf(0.95, r, p)
print(f'critical value: {round(cv, 3)}')
critical value: 16.0

For a significance level of 0.05, the threshold of the confidence interval is 16+10=26. The null hypothesis has an x value of 30, so it is outside the confidence interval. That is, it is difficult to accept the null hypothesis.

The above result can be confirmed using the significance probability (p value). The probability means the probability in an extreme state than the above hypothesis, and can be calculated by survival function. This function is equivalent to subtracting the cumulative probability from the whole to the variable x. The following code can apply the sf() method on each distribution class in the stats module.

k=30-r
pVal=stats.nbinom.sf(k, r, p)
print(f'p value: {np.round(pVal, 4)}')
p value: 0.0093

Compared with the significance level of 0.05, the above significance probability is very small. That is, the null hypothesis can be rejected.

A visualization of the above results is shown in Figure 5.7.

point=stats.nbinom.ppf(1-pVal, r, p)
point
20.0
plt.figure(figsize=(6,3))
x=range(41)
y=[stats.nbinom.pmf(10, i, 0.53) for i in x]
plt.plot(x, y, label="NB(10, 0.53)")
plt.fill_between(x, 0, y, where=(x<=cv), facecolor="skyblue", label="1-α")
plt.fill_between(x, 0, y, where=(x>=cv), facecolor="silver", label="α")
plt.fill_between(x, 0, y, where=(x>=point), facecolor="teal", label="p value", alpha=0.1)
plt.axvline(k, linestyle="--", color='red', label="k(20)")
plt.legend(loc="best")
plt.xlabel("x", size="13", weight="bold")
plt.ylabel("pmf", size="13", weight="bold")
plt.text(8,0.02, '0.95', size="13", weight="bold")
plt.text(18,0.02, '0.05', size="13", weight="bold")
plt.text(20,0.01, '0.0093', size="13", weight="bold", color="teal")
plt.show()
Figure 1. Confidence intervals and significance probability in Negative Binomial Distribution(α=0.05).

One-sided and two-sided tests

Here are some recent data from one stock. From this data, the null hypothesis for estimating the population mean can be written as

$$\begin{align} \text{hypothesis 1} \quad & \text{H0}: \mu=\bar{X}, \; \text{H1}: \mu \neq \bar{X}\\ \text{hypothesis 2} \quad & \text{H0}: \mu\ge \bar{X}, \; \text{H1}: \mu \le \bar{X} \end{align}$$

In the case of Hypothesis 1 above, it tests whether the population mean agrees with the sample mean, and the direction is irrelevant. That is, it doesn't matter if it exists to the left or right of the mean of the normal distribution. This case is called two-tailed test. In contrast, Hypothesis 2 can be oriented. The population mean tests whether it exists at a location larger than the sample mean. It is called one-sided test.

Example 2)
  The following is the closing price change data of the Philadelphia Semiconductor Index (sox). Using this data as the population, two-sided and one-sided tests are performed to estimate whether the sample mean of the sampled distribution can be used as an unbiased estimate of the population mean.

import FinanceDataReader as fdr
st=pd.Timestamp(2021,1, 1)
et=pd.Timestamp(2021, 12, 17)
da=fdr.DataReader('SOXX', st, et)["Close"]
da1=da.pct_change()[1:]*100
da1.index=range(len(da1))
da1.head(2)
0    2.044546
1   -0.324414
Name: Close, dtype: float64
mu=da1.mean()
std=da1.std()
print(f'Pop.mean: {round(mu, 4)}, Pop.std: {round(std, 4)}')
Pop.mean: 0.1469, Pop.std: 1.9027

100 samples in the above data are sampled. This process applies the pd object.sample() method. The sample mean, standard deviation, and standard error are calculated as follows. Confidence intervals are calculated by applying stats.norm.interval() at a significance level of 0.05.

smplData=np.array([da1.sample(n=10).mean()])
for i in range(100):
    smplData=np.append(smplData, da1.sample(n=10).mean())
smplData[:3]
array([ 0.59807416, -0.94670731,  0.60892061])
BarX=smplData.mean()
round(BarX,4)
0.0883
lb, ub=stats.norm.interval(0.95, mu, std) 
np.around(pd.DataFrame([lb, ub], index=['Lower', 'Upper']), 4)
0
Lower -3.5824
Upper 3.8763

The interval() function performs a two-sided test as shown in Figure 2. According to the result, the sample mean falls within the confidence interval of the normal distribution with the population mean and population standard deviation as parameters. That is, the null hypothesis cannot be rejected at the significance level of 0.05.

plt.figure(figsize=(6,3))
x=np.linspace(-10, 10.01, 1000)
y=[stats.norm.pdf(i, mu, std) for i in x]
plt.plot(x, y, label=f"Norm({round(mu, 2)}, {round(std, 2)})")
plt.fill_between(x, 0, y, where=(x>=lb)&(x<=ub), facecolor="skyblue", label=r"1-$\alpha$")
plt.fill_between(x, 0, y, where=(x<=lb) | (x>=ub), facecolor="red", alpha=0.3, label=r"$\alpha$")
plt.legend(loc="best")
plt.xlabel("x", size="13", weight='bold')
plt.ylabel("pdf", size="13", weight='bold')
plt.ylim(0, 0.22)
plt.xticks([])
plt.text(-0.5,  0.050, 0.95, size="13", weight='bold')
plt.text(-8 , 0.025, 0.025, size="13", weight='bold', color="red")
plt.text(5.1,  0.025, 0.025, size="13", weight='bold', color="red")
plt.text(lb-1,  -0.015, round(lb, 2), size="12", weight="bold", color="blue")
plt.text(ub-1,  -0.015, round(ub, 2), size="12", weight="bold", color="blue")
plt.show()
Figure 2. Two-tailed test for the null hypothesis `Sample mean = Population mean'.

For a one-sided test, the significance level exists on one side, so if the significance level is 0.05, as shown in Figure 3, the standard score of the threshold is Z0.05 or Z1-0.05. In this example, the null hypothesis (H0) is μ > $\bar{X}$, so Z1-0.05 is the threshold.

In a one-sided test, the critical value can be calculated using the stats.normal.ppf() method.

CP=stats.norm.ppf(1-0.05, mu, std)
print(f'Critical Point: {round(CP, 4)}')
Critical Point: 3.2767

As shown in Figure 3, the sample mean 0.1319 is included in the confidence interval x ≤ 4.17. Therefore, the null hypothesis cannot be rejected. Also, the significance probability (p-value) for the sample mean is as follows.

pval=stats.norm.sf(BarX, mu, std)
print(f'p value: {round(pval, 4)}')
0.5162
plt.figure(figsize=(6,3))
x=np.linspace(-10, 10.01, 1000)
y=[stats.norm.pdf(i, mu, std) for i in x]
plt.plot(x, y, label=f"Norm({round(mu, 2)}, {round(std, 2)})")
plt.fill_between(x, 0, y, where=(x <=CP), facecolor="skyblue", label=r"1-$\alpha$")
plt.fill_between(x, 0, y, where=(x >=CP), facecolor="red", alpha=0.3, label=r"$\alpha$")
plt.legend(loc="best")
plt.xlabel("x", size="13", weight='bold')
plt.ylabel("pdf", size="13", weight='bold')
plt.ylim(0, 0.22)
plt.xticks([])
plt.text(-0.5,  0.050, 0.95, size="13", weight='bold')
plt.text(4.1,  0.025, 0.05, size="13", weight='bold', color="red")
plt.text(CP-1,  -0.015, round(CP, 2), size="12", weight="bold", color="blue")
plt.show()
Figure 3. One-sided test for the null hypothesis "sample mean > population mean".

As shown in the above result, the p value is large compared to the significance level. Therefore, it is the same as the conclusion based on the confidence interval.

댓글

이 블로그의 인기 게시물

[Linear Algebra] 유사변환(Similarity transformation)

유사변환(Similarity transformation) n×n 차원의 정방 행렬 A, B 그리고 가역 행렬 P 사이에 식 1의 관계가 성립하면 행렬 A와 B는 유사행렬(similarity matrix)이 되며 행렬 A를 가역행렬 P와 B로 분해하는 것을 유사 변환(similarity transformation) 이라고 합니다. $$\tag{1} A = PBP^{-1} \Leftrightarrow P^{-1}AP = B $$ 식 2는 식 1의 양변에 B의 고유값을 고려한 것입니다. \begin{align}\tag{식 2} B - \lambda I &= P^{-1}AP – \lambda P^{-1}P\\ &= P^{-1}(AP – \lambda P)\\ &= P^{-1}(A - \lambda I)P \end{align} 식 2의 행렬식은 식 3과 같이 정리됩니다. \begin{align} &\begin{aligned}\textsf{det}(B - \lambda I ) & = \textsf{det}(P^{-1}(AP – \lambda P))\\ &= \textsf{det}(P^{-1}) \textsf{det}((A – \lambda I)) \textsf{det}(P)\\ &= \textsf{det}(P^{-1}) \textsf{det}(P) \textsf{det}((A – \lambda I))\\ &= \textsf{det}(A – \lambda I)\end{aligned}\\ &\begin{aligned}\because \; \textsf{det}(P^{-1}) \textsf{det}(P) &= \textsf{det}(P^{-1}P)\\ &= \textsf{det}(I)\end{aligned}\end{align} 유사행렬의 특성 유사행렬인 두 정방행렬 A와 B는 'A ~ B' 와 같

[matplotlib] 히스토그램(Histogram)

히스토그램(Histogram) 히스토그램은 확률분포의 그래픽적인 표현이며 막대그래프의 종류입니다. 이 그래프가 확률분포와 관계가 있으므로 통계적 요소를 나타내기 위해 많이 사용됩니다. plt.hist(X, bins=10)함수를 사용합니다. x=np.random.randn(1000) plt.hist(x, 10) plt.show() 위 그래프의 y축은 각 구간에 해당하는 갯수이다. 빈도수 대신 확률밀도를 나타내기 위해서는 위 함수의 매개변수 normed=True로 조정하여 나타낼 수 있다. 또한 매개변수 bins의 인수를 숫자로 전달할 수 있지만 리스트 객체로 지정할 수 있다. 막대그래프의 경우와 마찬가지로 각 막대의 폭은 매개변수 width에 의해 조정된다. y=np.linspace(min(x)-1, max(x)+1, 10) y array([-4.48810153, -3.54351935, -2.59893717, -1.65435499, -0.70977282, 0.23480936, 1.17939154, 2.12397372, 3.0685559 , 4.01313807]) plt.hist(x, y, normed=True) plt.show()

R 미분과 적분

내용 expression 미분 2차 미분 mosaic를 사용한 미분 적분 미분과 적분 R에서의 미분과 적분 함수는 expression()함수에 의해 생성된 표현식을 대상으로 합니다. expression expression(문자, 또는 식) 이 표현식의 평가는 eval() 함수에 의해 실행됩니다. > ex1<-expression(1+0:9) > ex1 expression(1 + 0:9) > eval(ex1) [1] 1 2 3 4 5 6 7 8 9 10 > ex2<-expression(u, 2, u+0:9) > ex2 expression(u, 2, u + 0:9) > ex2[1] expression(u) > ex2[2] expression(2) > ex2[3] expression(u + 0:9) > u<-0.9 > eval(ex2[3]) [1] 0.9 1.9 2.9 3.9 4.9 5.9 6.9 7.9 8.9 9.9 미분 D(표현식, 미분 변수) 함수로 미분을 실행합니다. 이 함수의 표현식은 expression() 함수로 생성된 객체이며 미분 변수는 다음 식의 분모의 변수를 의미합니다. $$\frac{d}{d \text{변수}}\text{표현식}$$ 이 함수는 어떤 함수의 미분의 결과를 표현식으로 반환합니다. > D(expression(2*x^3), "x") 2 * (3 * x^2) > eq<-expression(log(x)) > eq expression(log(x)) > D(eq, "x") 1/x > eq2<-expression(a/(1+b*exp(-d*x))); eq2 expression(a/(1 + b * exp(-d * x))) > D(eq2, "x") a * (b * (exp(-d * x) * d))/(1 + b