Normality Test

Q-Q plot
shapiro-Wilk test
Kolmogorov-Smirnov Test

Normality Test

The central limit theorem approaches normal distribution as the number of samples in the data increases. In particular, the distribution of sample means corresponds to the normal distribution. However, for raw data other than the mean, it is sometimes important to match the normal distribution. For example, in regression, the difference between the observed values and the predicted values by the regression model is called residuals, and is performed on the assumption that the residuals conform to the normal distribution. Whether it fits the assumption or not is the basis for determining the fit of the established model.

The normality test uses the following methods:

Quantile-Quantile plot: Determination by visual analysis
Shaprio-Wilks test: primarily used for number of samples (n < 2000)
Kolmogoroves-Smrinov test: used when n>2000

Q-Q plot

The Q-Q (Quadrant) plot is illustrated by inverse cumulative distribution by separating two data into fractions and then comparing the distributions of the two groups, which are charts created for the same fraction values. Inverse cumulative distribution is the inverse function of the cumulative distribution. For example, the cumulative and inverse cumulative distributions of standard normal distributions can be represented as Figure 1.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy import stats 
from sympy import*

rng=np.linspace(-3, 3, 1000)
pdf=[stats.norm.pdf(i) for i in rng]
cdf=[stats.norm.cdf(i) for i in rng]
inverCdf=[stats.norm.ppf(i) for i in cdf]
plt.figure(figsize=(10, 4))
plt.subplot(1,2,1)
plt.title("(A) PDF & CDF", size=12, weight="bold")
plt.plot(rng, pdf, label="PDF")
plt.plot(rng, cdf, label="CDF")
plt.xlabel("X", size=12, weight="bold")
plt.ylabel("Probability", size=12, weight="bold")
plt.legend(loc="best")
plt.subplot(1,2,2)
plt.title("(B) Inverse CDF = I(X)", size=12, weight="bold")
plt.plot(cdf, inverCdf, label="Inverse CDF")
plt.xlabel("CDF", size=12, weight="bold")
plt.ylabel("Probability", size=12, weight="bold")
plt.legend(loc="best")
plt.show()

Figure 1. (a) The pdf, cdf and (b) inverse cumulative probability distributions in Standard normal distributions.

If two groups A and B have the same distribution, the p-values for each inverse cumulative distribution will have the following linear relationship:

$$\begin{align}I_A(x) &= mI_B(x)+n\\ m,n &\in \mathbb{R} \end{align}$$

If B follows a normal distribution, A follows a normal distribution. Of course, this relationship cannot numerically represent the degree of conformity in a qualitative analysis method, but it can determine approximate normality.

The following is a Q-Q plot of the random number of intervals. The Q-Q plot is created by the scipy.stats.probplot() function. This function represents the standardized value of the analytical target, the original value, the graph, the slope and deviation (intercept) of the regression line, and the coefficient of determination (R²). The coefficient of determination is the square of the correlation coefficient and has a value in the range [0, 1]. In the resulting plot of the following code, if it fits perfectly with the regression line, 1, if no relationship is identified, then zero. Therefore, quantitative judgment is possible by this result.

x=np.random.rand(1000)
val, para=stats.probplot(x, plot=plt)
print(f'standardized value: {np.around(val[0][:3], 3)}\nraw data: {np.around(val[1][:3], 3)}')

standardized value: [-3.198 -2.932 -2.784]
raw data: [0.    0.001 0.003]

print(f'slope of line: {round(para[0],3)} \nintercept: {round(para[1],3) } \ndeter.coeff.: {round(para[2],3)}')

slope of line: 0.29 
intercept: 0.499 
deter.coeff.: 0.974

Example 1)
Perform a normality test by Q-Q plot on the daily variation rate for the nasdaq index (na) and the S&P Volatility index (vix) over a period of time.

import FinanceDataReader as fdr
st=pd.Timestamp(2020,6,1)
et=pd.Timestamp(2021, 12, 16)
naO=fdr.DataReader("IXIC",st, et)
vixO=fdr.DataReader("VIX", st, et)
na=(naO["Close"]-naO["Open"])/naO["Open"]*100
vix=(vixO["Close"]-vixO["Open"])/vixO["Open"]*100
na.tail(2)

Date
2021-12-15    2.198848
2021-12-16   -2.870927
dtype: float64

vix.tail(2)

Date
2021-12-15   -10.694444
2021-12-16    10.829741
dtype: float64

val, para=stats.probplot(na, plot=plt)
np.around(para, 3)

array([0.99 , 0.004, 0.983])

val2, para2=stats.probplot(vix, plot=plt)
np.around(para2,3)

array([ 7.109, -0.646,  0.931])

shapiro-Wilk test

Test samples $x_1, x_2, \cdots, x_n$ for conformity with the normal distribution and perform a null hypothesis (H0: follow the normal distribution) on the following W statistics:

$$\begin{align}&W=\frac{\left(\sum^n_{i=1} a_ix_{(i)}\right)^2}{\sum^n_{i=1} (x_i-\bar{x})^2}\\ &x_{(i)}=x_{n+1-i}-x_i\\ &n: \text{sample size} \end{align}$$

In the expression above, a_i is the weight for the difference between the two values. The weights are calculated based on statistics, such as the mean and standard deviation of the sorted data, and can be determined from the shapiro-Wilk table. In addition, m is calculated based on the number of samples as follows: (You can test it by the function scipy.stats.shapiro(). Therefore, this book does not include shapiro-Wilk table. In the following example, only some parts of the table were used to illustrate the process.)

$$m=\begin{cases}\frac{n}{2}& \text{for n =even}\\\frac{n-1}{2} & \text{for n =odd} \end{cases}$$

In the above W statistic calculation, the i in $x_{(i)}$ are integers between [0, m]. That is, the W value represents the ratio of the difference between each small and large value of the overall degree of spread. The W for the Shapiro-Wilk test is calculated by the following process:

data sorting
SS calculation \[SS=\sum^n_{i=1} (x_i-\bar{x})^2\]
Calculate b, the numderator of W $$b=\sum^m_{i=1}a_i(x_{n+1-i}-x_i)$$
Calculate Test Statistics (W) $$ W=\frac{b^2}{SS}$$
Calculate p-value based on the shapiro-Wilk table
Shapiro-Wilk table is a table of weights ($a_i$) for the number of samples and W values corresponding to a specific significance probability.

Example 2)
Let's conduct a shapiro-Wilk test on whether the data for the following ages meet the normal distribution.

age=np.array([65,61,63,86,70,55,74,35,72,68,45,58])
ageSort=np.sort(age)
da=pd.DataFrame([age, ageSort], index=['age', 'ageSort'])
da

	0	1	2	3	4	5	6	7	8	9	10	11
age	65	61	63	86	70	55	74	35	72	68	45	58
ageSort	35	45	55	58	61	63	65	68	70	72	74	86

The data is sample count (n) 12, so there are six weights as $m=\frac{12}{2}=6$. Each weight is determined in the shapiro-Wilk table as follows:

n=len(age)
m=np.array([n/2 if n%2==0 else (n-1)/2], dtype=np.int16)
m

array([6], dtype=int16)

ss=np.sum((ageSort-np.mean(ageSort))**2)
round(ss,3)

2008.667

a=np.array([0.5475,0.3325,0.2347,0.1586,0.0922,0.0303])
age1=[]
for i in range(m[0]):
    age1.append(ageSort[n-(i+1)]-ageSort[i])
age1

[51, 29, 17, 12, 7, 2]

b=np.sum(a*age1)
b

44.1641

w=b**2/ss
round(w, 4)

0.971

If n=12 in the shapiro-Wilk table, the significance probability and W values are as follows: The following is an estimate of the significance probability for the linearly calculated w based on the determined value: First, determine a straight line expression from the following values: This straight line is sufficient to be handwritten. The following is a solve() function of the sympy module:

tab=pd.DataFrame([[0.5, 0.943],[0.9, 0.973]], columns=['significant Prob.','W'])
tab

	significant Prob.	W
0	0.5	0.943
1	0.9	0.973

tab1=tab.values
slope=(tab1[0,0]-tab1[1,0])/(tab1[0,1]-tab1[1,1])
round(slope, 4)

13.333

x, bias=symbols('x bias')
eq=slope*x+bias
print(eq)

bias + 13.3333333333333*x

eq1=eq.subs(x,tab1[0,1])
eq1

bias+12.5733333333333

bias1=solve(Eq(eq1, tab1[0,0]), bias)
bias1

[-12.0733333333333]

linear=eq.subs(bias, bias1[0])
N(linear,4)

13.33x−12.07

The above calculation process can be substituted by the function scpy.stats.shapiro(). This function returns the w value and p-value.

statics, pVal=stats.shapiro(age)
print(f'statistics: {round(statics, 3)}, p-value: {round(pVal, 3)}')

statistics: 0.971, p-value: 0.922

The p value of the above results differs from the approximate p value previously calculated by interpolation. However, because both values are greater than the significance level of 0.05 or 0.1, you cannot reject the null hypothesis that the above data age follows a normal distribution. Compare these results to QQ plot analysis and:

val, para=stats.probplot(age, plot=plt, rvalue=True)
print(f'reg.coeff.: {np.around(para[0], 3)} \nbias: {np.around(para[1], 3)}\ndeter.coeff.: {np.around(para[2], 3)}')

reg.coeff.: 14.264 
bias: 62.667
deter.coeff.: 0.979

Example 3)
Let's run a sapiro-Wilk test on the na and vix used in Example 2.

statics, pVal=stats.shapiro(na.values)
print(f'statistics: {round(statics, 3)}, p-value: {round(pVal, 6)}')

statistics: 0.967, p-value: 0.0

print(f'statistics: {round(statics, 3)}, p-value: {round(pVal, 6)}')

statistics: 0.967, p-value: 0.0

At a significance level of 0.05, the above results, unlike the Q-Q plot, cannot adopt the null hypothesis that they conform to a normal distribution.

Kolmogorov-Smirnov Test

The Kolmogorov-Smirnov(K-S) test is used to determine whether a sample comes from a population with a specific distribution. The K-S test is based on the empirical distribution function (EECDF) and given N sorted data points Y₁ , Y₂, … , Y_N , ECDF is defined as follows:

$$\text{ECDF}=\frac{n(i)}{N}$$

In the expression above, n(i) n(i) is the number of points less than $Y_i$ and $Y_i$ is sorted in ascending order. ECDF is a step function that increases by $\displaystyle \frac{1}{N}$ at the value of each sorted data point.

Figure 5.16 shows plots of normal cumulative and empirical distribution functions for 100 normal random numbers. The K-S test is based on the maximum distance between these two curves.

np.random.seed(3)
da=np.sort(np.random.randn(100))
np.around(da[:3], 3)

array([-2.916, -2.419, -2.248])

ECDF=[]
N=len(da)
for i in range(N):
    ECDF.append((i+1)/N)
ECDF[:3]

[0.01, 0.02, 0.03]

NCDF=[stats.norm.cdf(i) for i in da]
np.around(NCDF[:3], 3)

array([0.002, 0.008, 0.012])

plt.figure(figsize=(6, 4))
plt.plot(da, ECDF, color="blue", label="ECDF")
plt.plot(da, NCDF, color="red", label="NCDF")
plt.xlabel("X", size="13")
plt.ylabel("Cumul. Prob.", size="13")
plt.legend(loc="best")
plt.title("100 random number", size="14")
plt.show()

Figure 2. Experience cumulative distribution (ECDF) and normal cumulative distribution function (NCDF) for regular random numbers..

This test has several important limitations.

Apply to continuous distributions only
More sensitive near center of distribution than tail
The distribution must be fully specified. In other words, estimating location, scale, and shape parameters from the data no longer makes the critical area of the K-S test valid. Generally, it should be determined by simulation.

The hypotheses for the K-S test are as follows:

H0: Data follow a specific distribution.
H1: Data do not follow a particular distribution.

$$\begin{align}&\text{test statistics}\\ &\text{D}=\underset{1 \le i \le N}{\text{max}} \left(F(Y_i)-\frac{i-1}{N},\, \frac{i}{N}-F(Y_i) \right)\end{align}$$

F is a theoretical cumulative distribution function of a continuous variable, and the parameters mean, scale, and shape are not estimated from the data. Apply the critical value of the Kolmogorov table to determine whether to reject statistics based on the significance level. The critical value is determined by the size and significance level of the sample by the K-S table, but if the sample size is greater than 50, it is calculated as follows:

n\|α	0.001	0.01	0.02	0.05	0.1	0.15	0.2
n > 50	$\frac{1.95}{n}$	$\frac{1.63}{n}$	$\frac{1.52}{n}$	$\frac{1.36}{n}$	$\frac{1.22}{n}$	$\frac{1.14}{n}$	$\frac{1.07}{n}$

The comparison of test statistics D and critical value leads to the following conclusions:

$$ D > \text{critical value} \rightarrow H0: \text{reject}$$

Example 4)
K-S test for the na and vix used in Example 5.61.

Analyze into the following courses:

Standardization of data
Sort by Ascending
Calculate the standard normal cumulative and empirical cumulative distributions
Calculate Test Statistics
Calculating thresholds

1) Nasdaq data

#standardization and sort of data
naN=(na-na.mean())/na.std()
naN2=np.sort(naN)  
np.around(naN2, 4)[:3]

array([-3.4007, -3.1598, -3.1574])

emCDF1=[]    # emperical CDF 
N1=len(naN2)    # sample size
for i in range(N1): 
    emCDF1.append((i+1)/N1) 
np.around(emCDF1[:3],3)

array([0.003, 0.005, 0.008])

norCDF1=[stats.norm.cdf(i) for i in naN2]
np.around(norCDF1[:3],3)

array([0.   , 0.001, 0.001])

D1=np.max(abs(np.array(emCDF1)-np.array(norCDF1))) 
print(f'statistics: {round(D1,3)}')

statistics: 0.063

crit1=1.36/np.sqrt(N1)
print(f'critical value: {round(crit1,3)}')

critical value: 0.069

Test statistic D is less than the critical value. In other words, H0 cannot be rejected . The above analysis can be run by scipy.stats.kstest(). This function returns test statistics and p-value. The results of this function show a very high p-value relative to the significance level, so the same conclusion is made.

statics1, pVal1=stats.kstest(naN2, 'norm')
print(f'statistics: {round(statics1, 3)}, critical value:{round(pVal1, 3)}')
statistics: 0.066, critical value:0.065

2) vix data

vixN=(vix-vix.mean())/vix.std()
vixN2=np.sort(vixN) # sorting
np.around(vixN2, 4)[:3]

array([-2.8079, -2.4982, -2.3345])

emCDF=[]
N=len(vixN2)  # sample size
for i in range(N): 
    emCDF.append((i+1)/N) 
np.around(emCDF[:3],3)

array([0.003, 0.005, 0.008])

norCDF=[stats.norm.cdf(i) for i in vixN2]
np.around(norCDF[:3],3)
array([0.002, 0.006, 0.01 ])
D=np.max(abs(np.array(emCDF)-np.array(norCDF))) 
print(f'statistics: {round(D,3)}')

statistics: 0.1

crit=1.36/np.sqrt(N)
print(f'critical value: {round(crit,3)}')

critical value: 0.068

statics, pVal=stats.kstest(vixN2, 'norm')
print(f'statistics: {round(statics, 3)}, critical value:{round(pVal, 3)}')

statistics: 0.1, critical value:0.001

D > critical value, so the null hypothesis can be rejected. This result is the same as the result by p-value of stats.kstest().

Figure 3 shows the standard normal and experience cumulative distributions for na and vix.

plt.figure(figsize=(10, 4))
plt.subplot(1,2,1)
plt.plot(naN2, emCDF1, color="blue", label="emCDF")
plt.plot(naN2, norCDF1, color="red", label="norCDF")
plt.xlabel("X", size="13")
plt.ylabel("Cumul. Prob.", size="13")
plt.legend(loc="best")
plt.title("Nasdaq K-S Test", size="14")
plt.subplot(1,2,2)
plt.plot(vixN2, emCDF, color="blue", label="emCDF")
plt.plot(vixN2, norCDF, color="red", label="norCDF")
plt.xlabel("X", size="13")
plt.legend(loc="best")
plt.title("VIX K-S Test", size="14")
plt.show()

Figure 3. Experience cumulative distribution and normal cumulative distribution of the daily variation of nasdaq and vix over a period of time.

Generally, shapiro tests are very conservative compared to other methods and are useful in small-scale materials, but are size sensitive for larger than a certain size.

sympy.solvers로 방정식해 구하기

sympy.solvers로 방정식해 구하기 대수 방정식을 해를 계산하기 위해 다음 함수를 사용합니다. sympy.solvers.solve(f, *symbols, **flags) f=0, 즉 동차방정식에 대해 지정한 변수의 해를 계산 f : 식 또는 함수 symbols: 식의 해를 계산하기 위한 변수, 변수가 하나인 경우는 생략가능(자동으로 인식) flags: 계산 또는 결과의 방식을 지정하기 위한 인수들 dict=True: {x:3, y:1}같이 사전형식, 기본값 = False set=True :{(x,3),(y,1)}같이 집합형식, 기본값 = False ratioal=True : 실수를 유리수로 반환, 기본값 = False positive=True: 해들 중에 양수만을 반환, 기본값 = False 예 $x^2=1$의 해를 결정합니다. solve() 함수에 적용하기 위해서는 다음과 같이 식의 한쪽이 0이 되는 형태인 동차식으로 구성되어야 합니다. $$x^2-1=0$$ import numpy as np from sympy import * x = symbols('x') solve(x**2-1, x) [-1, 1] 위 식은 계산 과정은 다음과 같습니다. $$\begin{aligned}x^2-1=0 \rightarrow (x+1)(x-1)=0 \\ x=1 \; \text{or}\; -1\end{aligned}$$ 예 $x^4=1$의 해를 결정합니다. solve() 함수의 인수 set=True를 지정하였으므로 결과는 집합(set)형으로 반환됩니다. eq=x**4-1 solve(eq, set=True) ([x], {(-1,), (-I,), (1,), (I,)}) 위의 경우 I는 복소수입니다.즉 위 결과의 과정은 다음과 같습니다. $$x^4-1=(x^2+1)(x+1)(x-1)=0 \rightarrow x=\pm \sqrt{-1}, \; \pm 1=\pm i,\; \pm1$$ 실수...

sons dataStory

이 블로그 검색

pandas_ta를 적용한 통계적 인덱스 지표

Normality Test

Contents

Normality Test

Q-Q plot

shapiro-Wilk test

Kolmogorov-Smirnov Test

태그

댓글

댓글 쓰기

이 블로그의 인기 게시물

[Linear Algebra] 유사변환(Similarity transformation)

[sympy] Sympy객체의 표현을 위한 함수들

sympy.solvers로 방정식해 구하기