기본 콘텐츠로 건너뛰기

벡터와 행렬에 관련된 그림들

Hypothesis test

Contents

  1. Null and Alternative hypotheses>
  2. One-sided and two-sided tests

Hypothesis

Statistical inference consists of establishing tentative hypotheses about the parameters of a population based on statistics calculated from a sample, and testing steps to accept or reject the hypothesis. In the test stage, the statistic of the sample that is the basis for judgment is called test statistic. The brobability of a more extreme statistic based on that test statistic is called p-value. By comparing the p-value with the significance level, acceptance or rejection of the statistic is determined.

  • p-value < significance level: reject the hypothesis assumed to be true
  • p-value > significance level: Failed to reject the hypothesis assumed to be true
Power and Sample size
Power is the probability of rejecting a false hypothesis. For example, a power of 90% indicates that there is a 10% chance of accepting an incorrect hypothesis. This is a type 2 error shown in Table 1.
  This power increases as the sample size increases. Therefore, in order to obtain the desired power, it is necessary to have an appropriate number of samples.

Null and Alternative hypotheses

Analysts can hypothesize that the mean of the sample means is used as an estimate of the population mean and test the statistical validity of this hypothesis. The analyst does not expect this hypothesis to be rejected because it does not show a statistically significant difference. This hypothesis is called null hypothesis (H0). The hypothesis that is expected to be rejected corresponding to this null hypothesis is called alternative hypothesis(Ha). The test of the null hypothesis is based on information from the sample, that is, the test statistic. Therefore, it includes the possibility of errors such as:

Table 1. Type of error
H0 True Ha True
H0 AcceptRight decision type II error
H0 Rejecttype I error, $\alpha$ Right decision

The significance level is the probability of making a type I error of rejecting the null hypothesis when the null hypothesis is true. By specifically setting the significance level, the analyst can control the probability of making a type I error. In other words, as a result, increasing the significance level narrows the margin of error and increases the number of cases where the null hypothesis is accepted only when it is true. In contrast, type II errors that are related to power cannot be adjusted by the analyst. As such, a hypothesis test that controls only Type I errors is called significance test.

In summary, hypothesis testing is the process of establishing a hypothesis for the estimated parameter with the statistics of the sample and determining whether the hypothesis is appropriate. This method consists of the following steps:

  1. Establishing a hypothesis.
    • A hypothesis consists of a null hypothesis (H0) and an alternative hypothesis (H1 or Ha ).
    • An alternative hypothesis is a hypothesis that requires proof, and a claim that opposes the alternative hypothesis or an existing claim is called the null hypothesis.
    • For example, a claim that the population mean ($\mu$) ``will be greater than the sample mean" requires proof. In this case, this hypothesis would be an alternative hypothesis. Conversely, ``the population mean is less than or equal to the sample mean" is the null hypothesis.
      H0: μ ≤ X, H1: μ ≥ X
  2. Calculation of basic statistics of samples, that is, obtained data: sample mean, standard deviation, etc.
  3. Calculate the test statistic
    • The values to test the hypothesis on
    • For example, if the population can be assumed to be normally distributed, the standard score z can be used as the basis for the test. In this case, z is the test statistic. $$Z=\frac{\bar{X}-\mu}{\frac{\sigma}{\sqrt{n}}} \; \text{OR} \;\frac{\bar{X}-\mu}{\frac{\text{s}}{\sqrt{n}}}{}$$
  4. Set the area to reject the null hypothesis
    • The x value at which this region starts is called threshold for rejection and the region to be rejected is called critical region.
    • For example, if the test is performed at a 95% confidence level, the confidence interval for the test statistic is expressed as follows. $$\bar{X} - 1.96\frac{\sigma}{\sqrt{n}} \le \mu \le \bar{X} + 1.96\frac{\sigma}{\sqrt{n}} $$ If it exists outside this interval, the null hypothesis can be rejected.
  5. Conclusion

Example 1)
  The probability of an increase in the daily rate of change between the opening and closing prices of a stock is 0.53. In the analysis of some samples, it is said that the increase in change occurs 10 times in 30 days. Can these results be generalized?

This problem can assume a negative binomial distribution, that is, a distribution representing the probability change up to the r-th success while iterating the Bernoulli trial. In this distribution, the total number of trials becomes the random variable x, and the number of successes (r) and probability (p) are the parameters.

$$x \sim NB(10, 0.53)$$

The mean and variance of this negative binomial distribution are

import numpy as np
import pandas as pd 
import matplotlib.pyplot as plt
from scipy import stats
r, p=10, 0.53
mu, var=stats.nbinom.stats(r, p, moments='mv')
print(f'mean: {np.round(mu,3)}, variance: {np.round(var, 3)}')
mean: 8.868, variance: 16.732

The null hypothesis for this problem is:

$$H0: x=30$$

The interval() method that calculates the confidence interval for each distribution class in the scipy.stats module returns the lower and upper bounds as a two-sided test. This is reasonable when the reference distribution is a symmetric distribution. However, this method cannot be used for negative binomial and one-sided distributions, because only the upper or lower bounds can be represented. Instead, use the ppf() method, which returns the variable value corresponding to the probability. Therefore, the confidence interval at the significance level α = 0.05 is calculated as

cv=stats.nbinom.ppf(0.95, r, p)
print(f'critical value: {round(cv, 3)}')
critical value: 16.0

For a significance level of 0.05, the threshold of the confidence interval is 16+10=26. The null hypothesis has an x value of 30, so it is outside the confidence interval. That is, it is difficult to accept the null hypothesis.

The above result can be confirmed using the significance probability (p value). The probability means the probability in an extreme state than the above hypothesis, and can be calculated by survival function. This function is equivalent to subtracting the cumulative probability from the whole to the variable x. The following code can apply the sf() method on each distribution class in the stats module.

k=30-r
pVal=stats.nbinom.sf(k, r, p)
print(f'p value: {np.round(pVal, 4)}')
p value: 0.0093

Compared with the significance level of 0.05, the above significance probability is very small. That is, the null hypothesis can be rejected.

A visualization of the above results is shown in Figure 5.7.

point=stats.nbinom.ppf(1-pVal, r, p)
point
20.0
plt.figure(figsize=(6,3))
x=range(41)
y=[stats.nbinom.pmf(10, i, 0.53) for i in x]
plt.plot(x, y, label="NB(10, 0.53)")
plt.fill_between(x, 0, y, where=(x<=cv), facecolor="skyblue", label="1-α")
plt.fill_between(x, 0, y, where=(x>=cv), facecolor="silver", label="α")
plt.fill_between(x, 0, y, where=(x>=point), facecolor="teal", label="p value", alpha=0.1)
plt.axvline(k, linestyle="--", color='red', label="k(20)")
plt.legend(loc="best")
plt.xlabel("x", size="13", weight="bold")
plt.ylabel("pmf", size="13", weight="bold")
plt.text(8,0.02, '0.95', size="13", weight="bold")
plt.text(18,0.02, '0.05', size="13", weight="bold")
plt.text(20,0.01, '0.0093', size="13", weight="bold", color="teal")
plt.show()
Figure 1. Confidence intervals and significance probability in Negative Binomial Distribution(α=0.05).

One-sided and two-sided tests

Here are some recent data from one stock. From this data, the null hypothesis for estimating the population mean can be written as

$$\begin{align} \text{hypothesis 1} \quad & \text{H0}: \mu=\bar{X}, \; \text{H1}: \mu \neq \bar{X}\\ \text{hypothesis 2} \quad & \text{H0}: \mu\ge \bar{X}, \; \text{H1}: \mu \le \bar{X} \end{align}$$

In the case of Hypothesis 1 above, it tests whether the population mean agrees with the sample mean, and the direction is irrelevant. That is, it doesn't matter if it exists to the left or right of the mean of the normal distribution. This case is called two-tailed test. In contrast, Hypothesis 2 can be oriented. The population mean tests whether it exists at a location larger than the sample mean. It is called one-sided test.

Example 2)
  The following is the closing price change data of the Philadelphia Semiconductor Index (sox). Using this data as the population, two-sided and one-sided tests are performed to estimate whether the sample mean of the sampled distribution can be used as an unbiased estimate of the population mean.

import FinanceDataReader as fdr
st=pd.Timestamp(2021,1, 1)
et=pd.Timestamp(2021, 12, 17)
da=fdr.DataReader('SOXX', st, et)["Close"]
da1=da.pct_change()[1:]*100
da1.index=range(len(da1))
da1.head(2)
0    2.044546
1   -0.324414
Name: Close, dtype: float64
mu=da1.mean()
std=da1.std()
print(f'Pop.mean: {round(mu, 4)}, Pop.std: {round(std, 4)}')
Pop.mean: 0.1469, Pop.std: 1.9027

100 samples in the above data are sampled. This process applies the pd object.sample() method. The sample mean, standard deviation, and standard error are calculated as follows. Confidence intervals are calculated by applying stats.norm.interval() at a significance level of 0.05.

smplData=np.array([da1.sample(n=10).mean()])
for i in range(100):
    smplData=np.append(smplData, da1.sample(n=10).mean())
smplData[:3]
array([ 0.59807416, -0.94670731,  0.60892061])
BarX=smplData.mean()
round(BarX,4)
0.0883
lb, ub=stats.norm.interval(0.95, mu, std) 
np.around(pd.DataFrame([lb, ub], index=['Lower', 'Upper']), 4)
0
Lower -3.5824
Upper 3.8763

The interval() function performs a two-sided test as shown in Figure 2. According to the result, the sample mean falls within the confidence interval of the normal distribution with the population mean and population standard deviation as parameters. That is, the null hypothesis cannot be rejected at the significance level of 0.05.

plt.figure(figsize=(6,3))
x=np.linspace(-10, 10.01, 1000)
y=[stats.norm.pdf(i, mu, std) for i in x]
plt.plot(x, y, label=f"Norm({round(mu, 2)}, {round(std, 2)})")
plt.fill_between(x, 0, y, where=(x>=lb)&(x<=ub), facecolor="skyblue", label=r"1-$\alpha$")
plt.fill_between(x, 0, y, where=(x<=lb) | (x>=ub), facecolor="red", alpha=0.3, label=r"$\alpha$")
plt.legend(loc="best")
plt.xlabel("x", size="13", weight='bold')
plt.ylabel("pdf", size="13", weight='bold')
plt.ylim(0, 0.22)
plt.xticks([])
plt.text(-0.5,  0.050, 0.95, size="13", weight='bold')
plt.text(-8 , 0.025, 0.025, size="13", weight='bold', color="red")
plt.text(5.1,  0.025, 0.025, size="13", weight='bold', color="red")
plt.text(lb-1,  -0.015, round(lb, 2), size="12", weight="bold", color="blue")
plt.text(ub-1,  -0.015, round(ub, 2), size="12", weight="bold", color="blue")
plt.show()
Figure 2. Two-tailed test for the null hypothesis `Sample mean = Population mean'.

For a one-sided test, the significance level exists on one side, so if the significance level is 0.05, as shown in Figure 3, the standard score of the threshold is Z0.05 or Z1-0.05. In this example, the null hypothesis (H0) is μ > $\bar{X}$, so Z1-0.05 is the threshold.

In a one-sided test, the critical value can be calculated using the stats.normal.ppf() method.

CP=stats.norm.ppf(1-0.05, mu, std)
print(f'Critical Point: {round(CP, 4)}')
Critical Point: 3.2767

As shown in Figure 3, the sample mean 0.1319 is included in the confidence interval x ≤ 4.17. Therefore, the null hypothesis cannot be rejected. Also, the significance probability (p-value) for the sample mean is as follows.

pval=stats.norm.sf(BarX, mu, std)
print(f'p value: {round(pval, 4)}')
0.5162
plt.figure(figsize=(6,3))
x=np.linspace(-10, 10.01, 1000)
y=[stats.norm.pdf(i, mu, std) for i in x]
plt.plot(x, y, label=f"Norm({round(mu, 2)}, {round(std, 2)})")
plt.fill_between(x, 0, y, where=(x <=CP), facecolor="skyblue", label=r"1-$\alpha$")
plt.fill_between(x, 0, y, where=(x >=CP), facecolor="red", alpha=0.3, label=r"$\alpha$")
plt.legend(loc="best")
plt.xlabel("x", size="13", weight='bold')
plt.ylabel("pdf", size="13", weight='bold')
plt.ylim(0, 0.22)
plt.xticks([])
plt.text(-0.5,  0.050, 0.95, size="13", weight='bold')
plt.text(4.1,  0.025, 0.05, size="13", weight='bold', color="red")
plt.text(CP-1,  -0.015, round(CP, 2), size="12", weight="bold", color="blue")
plt.show()
Figure 3. One-sided test for the null hypothesis "sample mean > population mean".

As shown in the above result, the p value is large compared to the significance level. Therefore, it is the same as the conclusion based on the confidence interval.

댓글

이 블로그의 인기 게시물

[Linear Algebra] 유사변환(Similarity transformation)

유사변환(Similarity transformation) n×n 차원의 정방 행렬 A, B 그리고 가역 행렬 P 사이에 식 1의 관계가 성립하면 행렬 A와 B는 유사행렬(similarity matrix)이 되며 행렬 A를 가역행렬 P와 B로 분해하는 것을 유사 변환(similarity transformation) 이라고 합니다. $$\tag{1} A = PBP^{-1} \Leftrightarrow P^{-1}AP = B $$ 식 2는 식 1의 양변에 B의 고유값을 고려한 것입니다. \begin{align}\tag{식 2} B - \lambda I &= P^{-1}AP – \lambda P^{-1}P\\ &= P^{-1}(AP – \lambda P)\\ &= P^{-1}(A - \lambda I)P \end{align} 식 2의 행렬식은 식 3과 같이 정리됩니다. \begin{align} &\begin{aligned}\textsf{det}(B - \lambda I ) & = \textsf{det}(P^{-1}(AP – \lambda P))\\ &= \textsf{det}(P^{-1}) \textsf{det}((A – \lambda I)) \textsf{det}(P)\\ &= \textsf{det}(P^{-1}) \textsf{det}(P) \textsf{det}((A – \lambda I))\\ &= \textsf{det}(A – \lambda I)\end{aligned}\\ &\begin{aligned}\because \; \textsf{det}(P^{-1}) \textsf{det}(P) &= \textsf{det}(P^{-1}P)\\ &= \textsf{det}(I)\end{aligned}\end{align} 유사행렬의 특성 유사행렬인 두 정방행렬 A와 B는 'A ~ B' 와 같...

[sympy] Sympy객체의 표현을 위한 함수들

Sympy객체의 표현을 위한 함수들 General simplify(x): 식 x(sympy 객체)를 간단히 정리 합니다. import numpy as np from sympy import * x=symbols("x") a=sin(x)**2+cos(x)**2 a $\sin^{2}{\left(x \right)} + \cos^{2}{\left(x \right)}$ simplify(a) 1 simplify(b) $\frac{x^{3} + x^{2} - x - 1}{x^{2} + 2 x + 1}$ simplify(b) x - 1 c=gamma(x)/gamma(x-2) c $\frac{\Gamma\left(x\right)}{\Gamma\left(x - 2\right)}$ simplify(c) $\displaystyle \left(x - 2\right) \left(x - 1\right)$ 위의 예들 중 객체 c의 감마함수(gamma(x))는 확률분포 등 여러 부분에서 사용되는 표현식으로 다음과 같이 정의 됩니다. 감마함수는 음이 아닌 정수를 제외한 모든 수에서 정의됩니다. 식 1과 같이 자연수에서 감마함수는 factorial(!), 부동소수(양의 실수)인 경우 적분을 적용하여 계산합니다. $$\tag{식 1}\Gamma(n) =\begin{cases}(n-1)!& n:\text{자연수}\\\int^\infty_0x^{n-1}e^{-x}\,dx& n:\text{부동소수}\end{cases}$$ x=symbols('x') gamma(x).subs(x,4) $\displaystyle 6$ factorial 계산은 math.factorial() 함수를 사용할 수 있습니다. import math math.factorial(3) 6 a=gamma(x).subs(x,4.5) a.evalf(3) 11.6 simpilfy() 함수의 알고리즘은 식에서 공통사항을 찾아 정리하...

sympy.solvers로 방정식해 구하기

sympy.solvers로 방정식해 구하기 대수 방정식을 해를 계산하기 위해 다음 함수를 사용합니다. sympy.solvers.solve(f, *symbols, **flags) f=0, 즉 동차방정식에 대해 지정한 변수의 해를 계산 f : 식 또는 함수 symbols: 식의 해를 계산하기 위한 변수, 변수가 하나인 경우는 생략가능(자동으로 인식) flags: 계산 또는 결과의 방식을 지정하기 위한 인수들 dict=True: {x:3, y:1}같이 사전형식, 기본값 = False set=True :{(x,3),(y,1)}같이 집합형식, 기본값 = False ratioal=True : 실수를 유리수로 반환, 기본값 = False positive=True: 해들 중에 양수만을 반환, 기본값 = False 예 $x^2=1$의 해를 결정합니다. solve() 함수에 적용하기 위해서는 다음과 같이 식의 한쪽이 0이 되는 형태인 동차식으로 구성되어야 합니다. $$x^2-1=0$$ import numpy as np from sympy import * x = symbols('x') solve(x**2-1, x) [-1, 1] 위 식은 계산 과정은 다음과 같습니다. $$\begin{aligned}x^2-1=0 \rightarrow (x+1)(x-1)=0 \\ x=1 \; \text{or}\; -1\end{aligned}$$ 예 $x^4=1$의 해를 결정합니다. solve() 함수의 인수 set=True를 지정하였으므로 결과는 집합(set)형으로 반환됩니다. eq=x**4-1 solve(eq, set=True) ([x], {(-1,), (-I,), (1,), (I,)}) 위의 경우 I는 복소수입니다.즉 위 결과의 과정은 다음과 같습니다. $$x^4-1=(x^2+1)(x+1)(x-1)=0 \rightarrow x=\pm \sqrt{-1}, \; \pm 1=\pm i,\; \pm1$$ 실수...