Contents
Normality Test
The central limit theorem approaches normal distribution as the number of samples in the data increases. In particular, the distribution of sample means corresponds to the normal distribution. However, for raw data other than the mean, it is sometimes important to match the normal distribution. For example, in regression, the difference between the observed values and the predicted values by the regression model is called residuals, and is performed on the assumption that the residuals conform to the normal distribution. Whether it fits the assumption or not is the basis for determining the fit of the established model.
The normality test uses the following methods:
- Quantile-Quantile plot: Determination by visual analysis
- Shaprio-Wilks test: primarily used for number of samples (n < 2000)
- Kolmogoroves-Smrinov test: used when n>2000
Q-Q plot
The Q-Q (Quadrant) plot is illustrated by inverse cumulative distribution by separating two data into fractions and then comparing the distributions of the two groups, which are charts created for the same fraction values. Inverse cumulative distribution is the inverse function of the cumulative distribution. For example, the cumulative and inverse cumulative distributions of standard normal distributions can be represented as Figure 1.
import numpy as np import pandas as pd import matplotlib.pyplot as plt from scipy import stats from sympy import*
rng=np.linspace(-3, 3, 1000) pdf=[stats.norm.pdf(i) for i in rng] cdf=[stats.norm.cdf(i) for i in rng] inverCdf=[stats.norm.ppf(i) for i in cdf] plt.figure(figsize=(10, 4)) plt.subplot(1,2,1) plt.title("(A) PDF & CDF", size=12, weight="bold") plt.plot(rng, pdf, label="PDF") plt.plot(rng, cdf, label="CDF") plt.xlabel("X", size=12, weight="bold") plt.ylabel("Probability", size=12, weight="bold") plt.legend(loc="best") plt.subplot(1,2,2) plt.title("(B) Inverse CDF = I(X)", size=12, weight="bold") plt.plot(cdf, inverCdf, label="Inverse CDF") plt.xlabel("CDF", size=12, weight="bold") plt.ylabel("Probability", size=12, weight="bold") plt.legend(loc="best") plt.show()
If two groups A and B have the same distribution, the p-values for each inverse cumulative distribution will have the following linear relationship:
$$\begin{align}I_A(x) &= mI_B(x)+n\\ m,n &\in \mathbb{R} \end{align}$$If B follows a normal distribution, A follows a normal distribution. Of course, this relationship cannot numerically represent the degree of conformity in a qualitative analysis method, but it can determine approximate normality.
The following is a Q-Q plot of the random number of intervals. The Q-Q plot is created by the scipy.stats.probplot()
function. This function represents the standardized value of the analytical target, the original value, the graph, the slope and deviation (intercept) of the regression line, and the coefficient of determination (R2). The coefficient of determination is the square of the correlation coefficient and has a value in the range [0, 1]. In the resulting plot of the following code, if it fits perfectly with the regression line, 1, if no relationship is identified, then zero. Therefore, quantitative judgment is possible by this result.
x=np.random.rand(1000) val, para=stats.probplot(x, plot=plt) print(f'standardized value: {np.around(val[0][:3], 3)}\nraw data: {np.around(val[1][:3], 3)}')
standardized value: [-3.198 -2.932 -2.784] raw data: [0. 0.001 0.003]
print(f'slope of line: {round(para[0],3)} \nintercept: {round(para[1],3) } \ndeter.coeff.: {round(para[2],3)}')
slope of line: 0.29 intercept: 0.499 deter.coeff.: 0.974
Example 1)
Perform a normality test by Q-Q plot on the daily variation rate for the nasdaq index (na) and the S&P Volatility index (vix) over a period of time.
import FinanceDataReader as fdr st=pd.Timestamp(2020,6,1) et=pd.Timestamp(2021, 12, 16) naO=fdr.DataReader("IXIC",st, et) vixO=fdr.DataReader("VIX", st, et) na=(naO["Close"]-naO["Open"])/naO["Open"]*100 vix=(vixO["Close"]-vixO["Open"])/vixO["Open"]*100 na.tail(2)
Date 2021-12-15 2.198848 2021-12-16 -2.870927 dtype: float64
vix.tail(2)
Date 2021-12-15 -10.694444 2021-12-16 10.829741 dtype: float64
val, para=stats.probplot(na, plot=plt) np.around(para, 3)
array([0.99 , 0.004, 0.983])
val2, para2=stats.probplot(vix, plot=plt) np.around(para2,3)
array([ 7.109, -0.646, 0.931])
shapiro-Wilk test
Test samples $x_1, x_2, \cdots, x_n$ for conformity with the normal distribution and perform a null hypothesis (H0: follow the normal distribution) on the following W statistics:
$$\begin{align}&W=\frac{\left(\sum^n_{i=1} a_ix_{(i)}\right)^2}{\sum^n_{i=1} (x_i-\bar{x})^2}\\ &x_{(i)}=x_{n+1-i}-x_i\\ &n: \text{sample size} \end{align}$$In the expression above, ai is the weight for the difference between the two values. The weights are calculated based on statistics, such as the mean and standard deviation of the sorted data, and can be determined from the shapiro-Wilk table. In addition, m is calculated based on the number of samples as follows: (You can test it by the function scipy.stats.shapiro()
. Therefore, this book does not include shapiro-Wilk table. In the following example, only some parts of the table were used to illustrate the process.)
$$m=\begin{cases}\frac{n}{2}& \text{for n =even}\\\frac{n-1}{2} & \text{for n =odd} \end{cases}$$
In the above W statistic calculation, the i in $x_{(i)}$ are integers between [0, m]. That is, the W value represents the ratio of the difference between each small and large value of the overall degree of spread. The W for the Shapiro-Wilk test is calculated by the following process:
- data sorting
- SS calculation \[SS=\sum^n_{i=1} (x_i-\bar{x})^2\]
- Calculate b, the numderator of W $$b=\sum^m_{i=1}a_i(x_{n+1-i}-x_i)$$
- Calculate Test Statistics (W) $$ W=\frac{b^2}{SS}$$
- Calculate p-value based on the shapiro-Wilk table
Shapiro-Wilk table is a table of weights ($a_i$) for the number of samples and W values corresponding to a specific significance probability.
Example 2)
Let's conduct a shapiro-Wilk test on whether the data for the following ages meet the normal distribution.
age=np.array([65,61,63,86,70,55,74,35,72,68,45,58]) ageSort=np.sort(age) da=pd.DataFrame([age, ageSort], index=['age', 'ageSort']) da
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
age | 65 | 61 | 63 | 86 | 70 | 55 | 74 | 35 | 72 | 68 | 45 | 58 |
ageSort | 35 | 45 | 55 | 58 | 61 | 63 | 65 | 68 | 70 | 72 | 74 | 86 |
The data is sample count (n) 12, so there are six weights as $m=\frac{12}{2}=6$. Each weight is determined in the shapiro-Wilk table as follows:
n=len(age) m=np.array([n/2 if n%2==0 else (n-1)/2], dtype=np.int16) m
array([6], dtype=int16)
ss=np.sum((ageSort-np.mean(ageSort))**2) round(ss,3)
2008.667
a=np.array([0.5475,0.3325,0.2347,0.1586,0.0922,0.0303]) age1=[] for i in range(m[0]): age1.append(ageSort[n-(i+1)]-ageSort[i]) age1
[51, 29, 17, 12, 7, 2]
b=np.sum(a*age1) b
44.1641
w=b**2/ss round(w, 4)
0.971
If n=12 in the shapiro-Wilk table, the significance probability and W values are as follows: The following is an estimate of the significance probability for the linearly calculated w based on the determined value: First, determine a straight line expression from the following values: This straight line is sufficient to be handwritten. The following is a solve()
function of the sympy module:
tab=pd.DataFrame([[0.5, 0.943],[0.9, 0.973]], columns=['significant Prob.','W']) tab
significant Prob. | W | |
---|---|---|
0 | 0.5 | 0.943 |
1 | 0.9 | 0.973 |
tab1=tab.values slope=(tab1[0,0]-tab1[1,0])/(tab1[0,1]-tab1[1,1]) round(slope, 4)
13.333
x, bias=symbols('x bias') eq=slope*x+bias print(eq)
bias + 13.3333333333333*x
eq1=eq.subs(x,tab1[0,1]) eq1
bias+12.5733333333333
bias1=solve(Eq(eq1, tab1[0,0]), bias) bias1
[-12.0733333333333]
linear=eq.subs(bias, bias1[0]) N(linear,4)
13.33x−12.07
The above calculation process can be substituted by the function scpy.stats.shapiro()
. This function returns the w value and p-value.
statics, pVal=stats.shapiro(age) print(f'statistics: {round(statics, 3)}, p-value: {round(pVal, 3)}')
statistics: 0.971, p-value: 0.922
The p value of the above results differs from the approximate p value previously calculated by interpolation. However, because both values are greater than the significance level of 0.05 or 0.1, you cannot reject the null hypothesis that the above data age follows a normal distribution. Compare these results to QQ plot analysis and:
val, para=stats.probplot(age, plot=plt, rvalue=True) print(f'reg.coeff.: {np.around(para[0], 3)} \nbias: {np.around(para[1], 3)}\ndeter.coeff.: {np.around(para[2], 3)}')
reg.coeff.: 14.264 bias: 62.667 deter.coeff.: 0.979
Example 3)
Let's run a sapiro-Wilk test on the na and vix used in Example 2.
statics, pVal=stats.shapiro(na.values) print(f'statistics: {round(statics, 3)}, p-value: {round(pVal, 6)}')
statistics: 0.967, p-value: 0.0
print(f'statistics: {round(statics, 3)}, p-value: {round(pVal, 6)}')
statistics: 0.967, p-value: 0.0
At a significance level of 0.05, the above results, unlike the Q-Q plot, cannot adopt the null hypothesis that they conform to a normal distribution.
Kolmogorov-Smirnov Test
The Kolmogorov-Smirnov(K-S) test is used to determine whether a sample comes from a population with a specific distribution. The K-S test is based on the empirical distribution function (EECDF) and given N sorted data points Y1 , Y2, … , YN , ECDF is defined as follows:
$$\text{ECDF}=\frac{n(i)}{N}$$In the expression above, n(i) n(i) is the number of points less than $Y_i$ and $Y_i$ is sorted in ascending order. ECDF is a step function that increases by $\displaystyle \frac{1}{N}$ at the value of each sorted data point.
Figure 5.16 shows plots of normal cumulative and empirical distribution functions for 100 normal random numbers. The K-S test is based on the maximum distance between these two curves.
np.random.seed(3) da=np.sort(np.random.randn(100)) np.around(da[:3], 3)
array([-2.916, -2.419, -2.248])
ECDF=[] N=len(da) for i in range(N): ECDF.append((i+1)/N) ECDF[:3]
[0.01, 0.02, 0.03]
NCDF=[stats.norm.cdf(i) for i in da] np.around(NCDF[:3], 3)
array([0.002, 0.008, 0.012])
plt.figure(figsize=(6, 4)) plt.plot(da, ECDF, color="blue", label="ECDF") plt.plot(da, NCDF, color="red", label="NCDF") plt.xlabel("X", size="13") plt.ylabel("Cumul. Prob.", size="13") plt.legend(loc="best") plt.title("100 random number", size="14") plt.show()
This test has several important limitations.
- Apply to continuous distributions only
- More sensitive near center of distribution than tail
- The distribution must be fully specified. In other words, estimating location, scale, and shape parameters from the data no longer makes the critical area of the K-S test valid. Generally, it should be determined by simulation.
The hypotheses for the K-S test are as follows:
- H0: Data follow a specific distribution.
- H1: Data do not follow a particular distribution.
F is a theoretical cumulative distribution function of a continuous variable, and the parameters mean, scale, and shape are not estimated from the data. Apply the critical value of the Kolmogorov table to determine whether to reject statistics based on the significance level. The critical value is determined by the size and significance level of the sample by the K-S table, but if the sample size is greater than 50, it is calculated as follows:
n|α | 0.001 | 0.01 | 0.02 | 0.05 | 0.1 | 0.15 | 0.2 |
n > 50 | $\frac{1.95}{n}$ | $\frac{1.63}{n}$ | $\frac{1.52}{n}$ | $\frac{1.36}{n}$ | $\frac{1.22}{n}$ | $\frac{1.14}{n}$ | $\frac{1.07}{n}$ |
The comparison of test statistics D and critical value leads to the following conclusions:
$$ D > \text{critical value} \rightarrow H0: \text{reject}$$Example 4)
K-S test for the na and vix used in Example 5.61.
Analyze into the following courses:
- Standardization of data
- Sort by Ascending
- Calculate the standard normal cumulative and empirical cumulative distributions
- Calculate Test Statistics
- Calculating thresholds
1) Nasdaq data
#standardization and sort of data naN=(na-na.mean())/na.std() naN2=np.sort(naN) np.around(naN2, 4)[:3]
array([-3.4007, -3.1598, -3.1574])
emCDF1=[] # emperical CDF N1=len(naN2) # sample size for i in range(N1): emCDF1.append((i+1)/N1) np.around(emCDF1[:3],3)
array([0.003, 0.005, 0.008])
norCDF1=[stats.norm.cdf(i) for i in naN2] np.around(norCDF1[:3],3)
array([0. , 0.001, 0.001])
D1=np.max(abs(np.array(emCDF1)-np.array(norCDF1))) print(f'statistics: {round(D1,3)}')
statistics: 0.063
crit1=1.36/np.sqrt(N1) print(f'critical value: {round(crit1,3)}')
critical value: 0.069
Test statistic D is less than the critical value. In other words, H0 cannot be rejected . The above analysis can be run by scipy.stats.kstest()
. This function returns test statistics and p-value. The results of this function show a very high p-value relative to the significance level, so the same conclusion is made.
statics1, pVal1=stats.kstest(naN2, 'norm') print(f'statistics: {round(statics1, 3)}, critical value:{round(pVal1, 3)}') statistics: 0.066, critical value:0.065
2) vix data
vixN=(vix-vix.mean())/vix.std() vixN2=np.sort(vixN) # sorting np.around(vixN2, 4)[:3]
array([-2.8079, -2.4982, -2.3345])
emCDF=[] N=len(vixN2) # sample size for i in range(N): emCDF.append((i+1)/N) np.around(emCDF[:3],3)
array([0.003, 0.005, 0.008])
norCDF=[stats.norm.cdf(i) for i in vixN2] np.around(norCDF[:3],3) array([0.002, 0.006, 0.01 ]) D=np.max(abs(np.array(emCDF)-np.array(norCDF))) print(f'statistics: {round(D,3)}')
statistics: 0.1
crit=1.36/np.sqrt(N) print(f'critical value: {round(crit,3)}')
critical value: 0.068
statics, pVal=stats.kstest(vixN2, 'norm') print(f'statistics: {round(statics, 3)}, critical value:{round(pVal, 3)}')
statistics: 0.1, critical value:0.001
D > critical value, so the null hypothesis can be rejected. This result is the same as the result by p-value of stats.kstest()
.
Figure 3 shows the standard normal and experience cumulative distributions for na and vix.
plt.figure(figsize=(10, 4)) plt.subplot(1,2,1) plt.plot(naN2, emCDF1, color="blue", label="emCDF") plt.plot(naN2, norCDF1, color="red", label="norCDF") plt.xlabel("X", size="13") plt.ylabel("Cumul. Prob.", size="13") plt.legend(loc="best") plt.title("Nasdaq K-S Test", size="14") plt.subplot(1,2,2) plt.plot(vixN2, emCDF, color="blue", label="emCDF") plt.plot(vixN2, norCDF, color="red", label="norCDF") plt.xlabel("X", size="13") plt.legend(loc="best") plt.title("VIX K-S Test", size="14") plt.show()
Generally, shapiro tests are very conservative compared to other methods and are useful in small-scale materials, but are size sensitive for larger than a certain size.
댓글
댓글 쓰기