기본 콘텐츠로 건너뛰기

벡터와 행렬에 관련된 그림들

Normality Test

Contents

  1. Q-Q plot
  2. shapiro-Wilk test
  3. Kolmogorov-Smirnov Test

Normality Test

The central limit theorem approaches normal distribution as the number of samples in the data increases. In particular, the distribution of sample means corresponds to the normal distribution. However, for raw data other than the mean, it is sometimes important to match the normal distribution. For example, in regression, the difference between the observed values and the predicted values by the regression model is called residuals, and is performed on the assumption that the residuals conform to the normal distribution. Whether it fits the assumption or not is the basis for determining the fit of the established model.

The normality test uses the following methods:

  • Quantile-Quantile plot: Determination by visual analysis
  • Shaprio-Wilks test: primarily used for number of samples (n < 2000)
  • Kolmogoroves-Smrinov test: used when n>2000

Q-Q plot

The Q-Q (Quadrant) plot is illustrated by inverse cumulative distribution by separating two data into fractions and then comparing the distributions of the two groups, which are charts created for the same fraction values. Inverse cumulative distribution is the inverse function of the cumulative distribution. For example, the cumulative and inverse cumulative distributions of standard normal distributions can be represented as Figure 1.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy import stats 
from sympy import*
rng=np.linspace(-3, 3, 1000)
pdf=[stats.norm.pdf(i) for i in rng]
cdf=[stats.norm.cdf(i) for i in rng]
inverCdf=[stats.norm.ppf(i) for i in cdf]
plt.figure(figsize=(10, 4))
plt.subplot(1,2,1)
plt.title("(A) PDF & CDF", size=12, weight="bold")
plt.plot(rng, pdf, label="PDF")
plt.plot(rng, cdf, label="CDF")
plt.xlabel("X", size=12, weight="bold")
plt.ylabel("Probability", size=12, weight="bold")
plt.legend(loc="best")
plt.subplot(1,2,2)
plt.title("(B) Inverse CDF = I(X)", size=12, weight="bold")
plt.plot(cdf, inverCdf, label="Inverse CDF")
plt.xlabel("CDF", size=12, weight="bold")
plt.ylabel("Probability", size=12, weight="bold")
plt.legend(loc="best")
plt.show()
Figure 1. (a) The pdf, cdf and (b) inverse cumulative probability distributions in Standard normal distributions.

If two groups A and B have the same distribution, the p-values for each inverse cumulative distribution will have the following linear relationship:

$$\begin{align}I_A(x) &= mI_B(x)+n\\ m,n &\in \mathbb{R} \end{align}$$

If B follows a normal distribution, A follows a normal distribution. Of course, this relationship cannot numerically represent the degree of conformity in a qualitative analysis method, but it can determine approximate normality.

The following is a Q-Q plot of the random number of intervals. The Q-Q plot is created by the scipy.stats.probplot() function. This function represents the standardized value of the analytical target, the original value, the graph, the slope and deviation (intercept) of the regression line, and the coefficient of determination (R2). The coefficient of determination is the square of the correlation coefficient and has a value in the range [0, 1]. In the resulting plot of the following code, if it fits perfectly with the regression line, 1, if no relationship is identified, then zero. Therefore, quantitative judgment is possible by this result.

x=np.random.rand(1000)
val, para=stats.probplot(x, plot=plt)
print(f'standardized value: {np.around(val[0][:3], 3)}\nraw data: {np.around(val[1][:3], 3)}')
standardized value: [-3.198 -2.932 -2.784]
raw data: [0.    0.001 0.003]
print(f'slope of line: {round(para[0],3)} \nintercept: {round(para[1],3) } \ndeter.coeff.: {round(para[2],3)}')
slope of line: 0.29 
intercept: 0.499 
deter.coeff.: 0.974

Example 1)
  Perform a normality test by Q-Q plot on the daily variation rate for the nasdaq index (na) and the S&P Volatility index (vix) over a period of time.

import FinanceDataReader as fdr
st=pd.Timestamp(2020,6,1)
et=pd.Timestamp(2021, 12, 16)
naO=fdr.DataReader("IXIC",st, et)
vixO=fdr.DataReader("VIX", st, et)
na=(naO["Close"]-naO["Open"])/naO["Open"]*100
vix=(vixO["Close"]-vixO["Open"])/vixO["Open"]*100
na.tail(2)
Date
2021-12-15    2.198848
2021-12-16   -2.870927
dtype: float64
vix.tail(2)
Date
2021-12-15   -10.694444
2021-12-16    10.829741
dtype: float64
val, para=stats.probplot(na, plot=plt)
np.around(para, 3)
array([0.99 , 0.004, 0.983])
val2, para2=stats.probplot(vix, plot=plt)
np.around(para2,3)
array([ 7.109, -0.646,  0.931])

shapiro-Wilk test

Test samples $x_1, x_2, \cdots, x_n$ for conformity with the normal distribution and perform a null hypothesis (H0: follow the normal distribution) on the following W statistics:

$$\begin{align}&W=\frac{\left(\sum^n_{i=1} a_ix_{(i)}\right)^2}{\sum^n_{i=1} (x_i-\bar{x})^2}\\ &x_{(i)}=x_{n+1-i}-x_i\\ &n: \text{sample size} \end{align}$$

In the expression above, ai is the weight for the difference between the two values. The weights are calculated based on statistics, such as the mean and standard deviation of the sorted data, and can be determined from the shapiro-Wilk table. In addition, m is calculated based on the number of samples as follows: (You can test it by the function scipy.stats.shapiro(). Therefore, this book does not include shapiro-Wilk table. In the following example, only some parts of the table were used to illustrate the process.)

$$m=\begin{cases}\frac{n}{2}& \text{for n =even}\\\frac{n-1}{2} & \text{for n =odd} \end{cases}$$

In the above W statistic calculation, the i in $x_{(i)}$ are integers between [0, m]. That is, the W value represents the ratio of the difference between each small and large value of the overall degree of spread. The W for the Shapiro-Wilk test is calculated by the following process:

  1. data sorting
  2. SS calculation \[SS=\sum^n_{i=1} (x_i-\bar{x})^2\]
  3. Calculate b, the numderator of W $$b=\sum^m_{i=1}a_i(x_{n+1-i}-x_i)$$
  4. Calculate Test Statistics (W) $$ W=\frac{b^2}{SS}$$
  5. Calculate p-value based on the shapiro-Wilk table
    Shapiro-Wilk table is a table of weights ($a_i$) for the number of samples and W values corresponding to a specific significance probability.

Example 2)
 Let's conduct a shapiro-Wilk test on whether the data for the following ages meet the normal distribution.

age=np.array([65,61,63,86,70,55,74,35,72,68,45,58])
ageSort=np.sort(age)
da=pd.DataFrame([age, ageSort], index=['age', 'ageSort'])
da
0 1 2 3 4 5 6 7 8 9 10 11
age 65 61 63 86 70 55 74 35 72 68 45 58
ageSort 35 45 55 58 61 63 65 68 70 72 74 86

The data is sample count (n) 12, so there are six weights as $m=\frac{12}{2}=6$. Each weight is determined in the shapiro-Wilk table as follows:

n=len(age)
m=np.array([n/2 if n%2==0 else (n-1)/2], dtype=np.int16)
m
array([6], dtype=int16)
ss=np.sum((ageSort-np.mean(ageSort))**2)
round(ss,3)
2008.667
a=np.array([0.5475,0.3325,0.2347,0.1586,0.0922,0.0303])
age1=[]
for i in range(m[0]):
    age1.append(ageSort[n-(i+1)]-ageSort[i])
age1 
[51, 29, 17, 12, 7, 2]
b=np.sum(a*age1)
b
44.1641
w=b**2/ss
round(w, 4)
0.971

If n=12 in the shapiro-Wilk table, the significance probability and W values are as follows: The following is an estimate of the significance probability for the linearly calculated w based on the determined value: First, determine a straight line expression from the following values: This straight line is sufficient to be handwritten. The following is a solve() function of the sympy module:

tab=pd.DataFrame([[0.5, 0.943],[0.9, 0.973]], columns=['significant Prob.','W'])
tab
significant Prob. W
0 0.5 0.943
1 0.9 0.973
tab1=tab.values
slope=(tab1[0,0]-tab1[1,0])/(tab1[0,1]-tab1[1,1])
round(slope, 4)
13.333
x, bias=symbols('x bias')
eq=slope*x+bias
print(eq)
bias + 13.3333333333333*x
eq1=eq.subs(x,tab1[0,1])
eq1
bias+12.5733333333333
bias1=solve(Eq(eq1, tab1[0,0]), bias)
bias1
[-12.0733333333333]
linear=eq.subs(bias, bias1[0])
N(linear,4)
13.33x−12.07

The above calculation process can be substituted by the function scpy.stats.shapiro(). This function returns the w value and p-value.

statics, pVal=stats.shapiro(age)
print(f'statistics: {round(statics, 3)}, p-value: {round(pVal, 3)}')
statistics: 0.971, p-value: 0.922

The p value of the above results differs from the approximate p value previously calculated by interpolation. However, because both values are greater than the significance level of 0.05 or 0.1, you cannot reject the null hypothesis that the above data age follows a normal distribution. Compare these results to QQ plot analysis and:

val, para=stats.probplot(age, plot=plt, rvalue=True)
print(f'reg.coeff.: {np.around(para[0], 3)} \nbias: {np.around(para[1], 3)}\ndeter.coeff.: {np.around(para[2], 3)}')
reg.coeff.: 14.264 
bias: 62.667
deter.coeff.: 0.979

Example 3)
 Let's run a sapiro-Wilk test on the na and vix used in Example 2.

statics, pVal=stats.shapiro(na.values)
print(f'statistics: {round(statics, 3)}, p-value: {round(pVal, 6)}')
statistics: 0.967, p-value: 0.0
print(f'statistics: {round(statics, 3)}, p-value: {round(pVal, 6)}')
statistics: 0.967, p-value: 0.0

At a significance level of 0.05, the above results, unlike the Q-Q plot, cannot adopt the null hypothesis that they conform to a normal distribution.

Kolmogorov-Smirnov Test

The Kolmogorov-Smirnov(K-S) test is used to determine whether a sample comes from a population with a specific distribution. The K-S test is based on the empirical distribution function (EECDF) and given N sorted data points Y1 , Y2, … , YN , ECDF is defined as follows:

$$\text{ECDF}=\frac{n(i)}{N}$$

In the expression above, n(i) n(i) is the number of points less than $Y_i$ and $Y_i$ is sorted in ascending order. ECDF is a step function that increases by $\displaystyle \frac{1}{N}$ at the value of each sorted data point.

Figure 5.16 shows plots of normal cumulative and empirical distribution functions for 100 normal random numbers. The K-S test is based on the maximum distance between these two curves.

np.random.seed(3)
da=np.sort(np.random.randn(100))
np.around(da[:3], 3)
array([-2.916, -2.419, -2.248])
ECDF=[]
N=len(da)
for i in range(N):
    ECDF.append((i+1)/N)
ECDF[:3]
[0.01, 0.02, 0.03]
NCDF=[stats.norm.cdf(i) for i in da]
np.around(NCDF[:3], 3)
array([0.002, 0.008, 0.012])
plt.figure(figsize=(6, 4))
plt.plot(da, ECDF, color="blue", label="ECDF")
plt.plot(da, NCDF, color="red", label="NCDF")
plt.xlabel("X", size="13")
plt.ylabel("Cumul. Prob.", size="13")
plt.legend(loc="best")
plt.title("100 random number", size="14")
plt.show()
Figure 2. Experience cumulative distribution (ECDF) and normal cumulative distribution function (NCDF) for regular random numbers..

This test has several important limitations.

  • Apply to continuous distributions only
  • More sensitive near center of distribution than tail
  • The distribution must be fully specified. In other words, estimating location, scale, and shape parameters from the data no longer makes the critical area of the K-S test valid. Generally, it should be determined by simulation.

The hypotheses for the K-S test are as follows:

  • H0: Data follow a specific distribution.
  • H1: Data do not follow a particular distribution.
$$\begin{align}&\text{test statistics}\\ &\text{D}=\underset{1 \le i \le N}{\text{max}} \left(F(Y_i)-\frac{i-1}{N},\, \frac{i}{N}-F(Y_i) \right)\end{align}$$

F is a theoretical cumulative distribution function of a continuous variable, and the parameters mean, scale, and shape are not estimated from the data. Apply the critical value of the Kolmogorov table to determine whether to reject statistics based on the significance level. The critical value is determined by the size and significance level of the sample by the K-S table, but if the sample size is greater than 50, it is calculated as follows:

n|α0.001 0.01 0.02 0.05 0.1 0.15 0.2
n > 50$\frac{1.95}{n}$$\frac{1.63}{n}$$\frac{1.52}{n}$$\frac{1.36}{n}$$\frac{1.22}{n}$$\frac{1.14}{n}$$\frac{1.07}{n}$

The comparison of test statistics D and critical value leads to the following conclusions:

$$ D > \text{critical value} \rightarrow H0: \text{reject}$$

Example 4)
K-S test for the na and vix used in Example 5.61.

Analyze into the following courses:

  1. Standardization of data
  2. Sort by Ascending
  3. Calculate the standard normal cumulative and empirical cumulative distributions
  4. Calculate Test Statistics
  5. Calculating thresholds

1) Nasdaq data

#standardization and sort of data
naN=(na-na.mean())/na.std()
naN2=np.sort(naN)  
np.around(naN2, 4)[:3]
array([-3.4007, -3.1598, -3.1574])
emCDF1=[]    # emperical CDF 
N1=len(naN2)    # sample size
for i in range(N1): 
    emCDF1.append((i+1)/N1) 
np.around(emCDF1[:3],3)
array([0.003, 0.005, 0.008])
norCDF1=[stats.norm.cdf(i) for i in naN2]
np.around(norCDF1[:3],3)
array([0.   , 0.001, 0.001])
D1=np.max(abs(np.array(emCDF1)-np.array(norCDF1))) 
print(f'statistics: {round(D1,3)}')
statistics: 0.063
crit1=1.36/np.sqrt(N1)
print(f'critical value: {round(crit1,3)}')
critical value: 0.069

Test statistic D is less than the critical value. In other words, H0 cannot be rejected . The above analysis can be run by scipy.stats.kstest(). This function returns test statistics and p-value. The results of this function show a very high p-value relative to the significance level, so the same conclusion is made.

statics1, pVal1=stats.kstest(naN2, 'norm')
print(f'statistics: {round(statics1, 3)}, critical value:{round(pVal1, 3)}')
statistics: 0.066, critical value:0.065

2) vix data

vixN=(vix-vix.mean())/vix.std()
vixN2=np.sort(vixN) # sorting
np.around(vixN2, 4)[:3]
array([-2.8079, -2.4982, -2.3345])
emCDF=[]
N=len(vixN2)  # sample size
for i in range(N): 
    emCDF.append((i+1)/N) 
np.around(emCDF[:3],3)
array([0.003, 0.005, 0.008])
norCDF=[stats.norm.cdf(i) for i in vixN2]
np.around(norCDF[:3],3)
array([0.002, 0.006, 0.01 ])
D=np.max(abs(np.array(emCDF)-np.array(norCDF))) 
print(f'statistics: {round(D,3)}')
statistics: 0.1
crit=1.36/np.sqrt(N)
print(f'critical value: {round(crit,3)}')
critical value: 0.068
statics, pVal=stats.kstest(vixN2, 'norm')
print(f'statistics: {round(statics, 3)}, critical value:{round(pVal, 3)}')
statistics: 0.1, critical value:0.001

D > critical value, so the null hypothesis can be rejected. This result is the same as the result by p-value of stats.kstest().

Figure 3 shows the standard normal and experience cumulative distributions for na and vix.

plt.figure(figsize=(10, 4))
plt.subplot(1,2,1)
plt.plot(naN2, emCDF1, color="blue", label="emCDF")
plt.plot(naN2, norCDF1, color="red", label="norCDF")
plt.xlabel("X", size="13")
plt.ylabel("Cumul. Prob.", size="13")
plt.legend(loc="best")
plt.title("Nasdaq K-S Test", size="14")
plt.subplot(1,2,2)
plt.plot(vixN2, emCDF, color="blue", label="emCDF")
plt.plot(vixN2, norCDF, color="red", label="norCDF")
plt.xlabel("X", size="13")
plt.legend(loc="best")
plt.title("VIX K-S Test", size="14")
plt.show()
Figure 3. Experience cumulative distribution and normal cumulative distribution of the daily variation of nasdaq and vix over a period of time.

Generally, shapiro tests are very conservative compared to other methods and are useful in small-scale materials, but are size sensitive for larger than a certain size.

댓글

이 블로그의 인기 게시물

[Linear Algebra] 유사변환(Similarity transformation)

유사변환(Similarity transformation) n×n 차원의 정방 행렬 A, B 그리고 가역 행렬 P 사이에 식 1의 관계가 성립하면 행렬 A와 B는 유사행렬(similarity matrix)이 되며 행렬 A를 가역행렬 P와 B로 분해하는 것을 유사 변환(similarity transformation) 이라고 합니다. $$\tag{1} A = PBP^{-1} \Leftrightarrow P^{-1}AP = B $$ 식 2는 식 1의 양변에 B의 고유값을 고려한 것입니다. \begin{align}\tag{식 2} B - \lambda I &= P^{-1}AP – \lambda P^{-1}P\\ &= P^{-1}(AP – \lambda P)\\ &= P^{-1}(A - \lambda I)P \end{align} 식 2의 행렬식은 식 3과 같이 정리됩니다. \begin{align} &\begin{aligned}\textsf{det}(B - \lambda I ) & = \textsf{det}(P^{-1}(AP – \lambda P))\\ &= \textsf{det}(P^{-1}) \textsf{det}((A – \lambda I)) \textsf{det}(P)\\ &= \textsf{det}(P^{-1}) \textsf{det}(P) \textsf{det}((A – \lambda I))\\ &= \textsf{det}(A – \lambda I)\end{aligned}\\ &\begin{aligned}\because \; \textsf{det}(P^{-1}) \textsf{det}(P) &= \textsf{det}(P^{-1}P)\\ &= \textsf{det}(I)\end{aligned}\end{align} 유사행렬의 특성 유사행렬인 두 정방행렬 A와 B는 'A ~ B' 와 같...

[sympy] Sympy객체의 표현을 위한 함수들

Sympy객체의 표현을 위한 함수들 General simplify(x): 식 x(sympy 객체)를 간단히 정리 합니다. import numpy as np from sympy import * x=symbols("x") a=sin(x)**2+cos(x)**2 a $\sin^{2}{\left(x \right)} + \cos^{2}{\left(x \right)}$ simplify(a) 1 simplify(b) $\frac{x^{3} + x^{2} - x - 1}{x^{2} + 2 x + 1}$ simplify(b) x - 1 c=gamma(x)/gamma(x-2) c $\frac{\Gamma\left(x\right)}{\Gamma\left(x - 2\right)}$ simplify(c) $\displaystyle \left(x - 2\right) \left(x - 1\right)$ 위의 예들 중 객체 c의 감마함수(gamma(x))는 확률분포 등 여러 부분에서 사용되는 표현식으로 다음과 같이 정의 됩니다. 감마함수는 음이 아닌 정수를 제외한 모든 수에서 정의됩니다. 식 1과 같이 자연수에서 감마함수는 factorial(!), 부동소수(양의 실수)인 경우 적분을 적용하여 계산합니다. $$\tag{식 1}\Gamma(n) =\begin{cases}(n-1)!& n:\text{자연수}\\\int^\infty_0x^{n-1}e^{-x}\,dx& n:\text{부동소수}\end{cases}$$ x=symbols('x') gamma(x).subs(x,4) $\displaystyle 6$ factorial 계산은 math.factorial() 함수를 사용할 수 있습니다. import math math.factorial(3) 6 a=gamma(x).subs(x,4.5) a.evalf(3) 11.6 simpilfy() 함수의 알고리즘은 식에서 공통사항을 찾아 정리하...

sympy.solvers로 방정식해 구하기

sympy.solvers로 방정식해 구하기 대수 방정식을 해를 계산하기 위해 다음 함수를 사용합니다. sympy.solvers.solve(f, *symbols, **flags) f=0, 즉 동차방정식에 대해 지정한 변수의 해를 계산 f : 식 또는 함수 symbols: 식의 해를 계산하기 위한 변수, 변수가 하나인 경우는 생략가능(자동으로 인식) flags: 계산 또는 결과의 방식을 지정하기 위한 인수들 dict=True: {x:3, y:1}같이 사전형식, 기본값 = False set=True :{(x,3),(y,1)}같이 집합형식, 기본값 = False ratioal=True : 실수를 유리수로 반환, 기본값 = False positive=True: 해들 중에 양수만을 반환, 기본값 = False 예 $x^2=1$의 해를 결정합니다. solve() 함수에 적용하기 위해서는 다음과 같이 식의 한쪽이 0이 되는 형태인 동차식으로 구성되어야 합니다. $$x^2-1=0$$ import numpy as np from sympy import * x = symbols('x') solve(x**2-1, x) [-1, 1] 위 식은 계산 과정은 다음과 같습니다. $$\begin{aligned}x^2-1=0 \rightarrow (x+1)(x-1)=0 \\ x=1 \; \text{or}\; -1\end{aligned}$$ 예 $x^4=1$의 해를 결정합니다. solve() 함수의 인수 set=True를 지정하였으므로 결과는 집합(set)형으로 반환됩니다. eq=x**4-1 solve(eq, set=True) ([x], {(-1,), (-I,), (1,), (I,)}) 위의 경우 I는 복소수입니다.즉 위 결과의 과정은 다음과 같습니다. $$x^4-1=(x^2+1)(x+1)(x-1)=0 \rightarrow x=\pm \sqrt{-1}, \; \pm 1=\pm i,\; \pm1$$ 실수...