기본 콘텐츠로 건너뛰기

[ML] 결정트리(Decision Tree) 모델

Normality Test

Contents

  1. Q-Q plot
  2. shapiro-Wilk test
  3. Kolmogorov-Smirnov Test

Normality Test

The central limit theorem approaches normal distribution as the number of samples in the data increases. In particular, the distribution of sample means corresponds to the normal distribution. However, for raw data other than the mean, it is sometimes important to match the normal distribution. For example, in regression, the difference between the observed values and the predicted values by the regression model is called residuals, and is performed on the assumption that the residuals conform to the normal distribution. Whether it fits the assumption or not is the basis for determining the fit of the established model.

The normality test uses the following methods:

  • Quantile-Quantile plot: Determination by visual analysis
  • Shaprio-Wilks test: primarily used for number of samples (n < 2000)
  • Kolmogoroves-Smrinov test: used when n>2000

Q-Q plot

The Q-Q (Quadrant) plot is illustrated by inverse cumulative distribution by separating two data into fractions and then comparing the distributions of the two groups, which are charts created for the same fraction values. Inverse cumulative distribution is the inverse function of the cumulative distribution. For example, the cumulative and inverse cumulative distributions of standard normal distributions can be represented as Figure 1.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy import stats 
from sympy import*
rng=np.linspace(-3, 3, 1000)
pdf=[stats.norm.pdf(i) for i in rng]
cdf=[stats.norm.cdf(i) for i in rng]
inverCdf=[stats.norm.ppf(i) for i in cdf]
plt.figure(figsize=(10, 4))
plt.subplot(1,2,1)
plt.title("(A) PDF & CDF", size=12, weight="bold")
plt.plot(rng, pdf, label="PDF")
plt.plot(rng, cdf, label="CDF")
plt.xlabel("X", size=12, weight="bold")
plt.ylabel("Probability", size=12, weight="bold")
plt.legend(loc="best")
plt.subplot(1,2,2)
plt.title("(B) Inverse CDF = I(X)", size=12, weight="bold")
plt.plot(cdf, inverCdf, label="Inverse CDF")
plt.xlabel("CDF", size=12, weight="bold")
plt.ylabel("Probability", size=12, weight="bold")
plt.legend(loc="best")
plt.show()
Figure 1. (a) The pdf, cdf and (b) inverse cumulative probability distributions in Standard normal distributions.

If two groups A and B have the same distribution, the p-values for each inverse cumulative distribution will have the following linear relationship:

$$\begin{align}I_A(x) &= mI_B(x)+n\\ m,n &\in \mathbb{R} \end{align}$$

If B follows a normal distribution, A follows a normal distribution. Of course, this relationship cannot numerically represent the degree of conformity in a qualitative analysis method, but it can determine approximate normality.

The following is a Q-Q plot of the random number of intervals. The Q-Q plot is created by the scipy.stats.probplot() function. This function represents the standardized value of the analytical target, the original value, the graph, the slope and deviation (intercept) of the regression line, and the coefficient of determination (R2). The coefficient of determination is the square of the correlation coefficient and has a value in the range [0, 1]. In the resulting plot of the following code, if it fits perfectly with the regression line, 1, if no relationship is identified, then zero. Therefore, quantitative judgment is possible by this result.

x=np.random.rand(1000)
val, para=stats.probplot(x, plot=plt)
print(f'standardized value: {np.around(val[0][:3], 3)}\nraw data: {np.around(val[1][:3], 3)}')
standardized value: [-3.198 -2.932 -2.784]
raw data: [0.    0.001 0.003]
print(f'slope of line: {round(para[0],3)} \nintercept: {round(para[1],3) } \ndeter.coeff.: {round(para[2],3)}')
slope of line: 0.29 
intercept: 0.499 
deter.coeff.: 0.974

Example 1)
  Perform a normality test by Q-Q plot on the daily variation rate for the nasdaq index (na) and the S&P Volatility index (vix) over a period of time.

import FinanceDataReader as fdr
st=pd.Timestamp(2020,6,1)
et=pd.Timestamp(2021, 12, 16)
naO=fdr.DataReader("IXIC",st, et)
vixO=fdr.DataReader("VIX", st, et)
na=(naO["Close"]-naO["Open"])/naO["Open"]*100
vix=(vixO["Close"]-vixO["Open"])/vixO["Open"]*100
na.tail(2)
Date
2021-12-15    2.198848
2021-12-16   -2.870927
dtype: float64
vix.tail(2)
Date
2021-12-15   -10.694444
2021-12-16    10.829741
dtype: float64
val, para=stats.probplot(na, plot=plt)
np.around(para, 3)
array([0.99 , 0.004, 0.983])
val2, para2=stats.probplot(vix, plot=plt)
np.around(para2,3)
array([ 7.109, -0.646,  0.931])

shapiro-Wilk test

Test samples $x_1, x_2, \cdots, x_n$ for conformity with the normal distribution and perform a null hypothesis (H0: follow the normal distribution) on the following W statistics:

$$\begin{align}&W=\frac{\left(\sum^n_{i=1} a_ix_{(i)}\right)^2}{\sum^n_{i=1} (x_i-\bar{x})^2}\\ &x_{(i)}=x_{n+1-i}-x_i\\ &n: \text{sample size} \end{align}$$

In the expression above, ai is the weight for the difference between the two values. The weights are calculated based on statistics, such as the mean and standard deviation of the sorted data, and can be determined from the shapiro-Wilk table. In addition, m is calculated based on the number of samples as follows: (You can test it by the function scipy.stats.shapiro(). Therefore, this book does not include shapiro-Wilk table. In the following example, only some parts of the table were used to illustrate the process.)

$$m=\begin{cases}\frac{n}{2}& \text{for n =even}\\\frac{n-1}{2} & \text{for n =odd} \end{cases}$$

In the above W statistic calculation, the i in $x_{(i)}$ are integers between [0, m]. That is, the W value represents the ratio of the difference between each small and large value of the overall degree of spread. The W for the Shapiro-Wilk test is calculated by the following process:

  1. data sorting
  2. SS calculation \[SS=\sum^n_{i=1} (x_i-\bar{x})^2\]
  3. Calculate b, the numderator of W $$b=\sum^m_{i=1}a_i(x_{n+1-i}-x_i)$$
  4. Calculate Test Statistics (W) $$ W=\frac{b^2}{SS}$$
  5. Calculate p-value based on the shapiro-Wilk table
    Shapiro-Wilk table is a table of weights ($a_i$) for the number of samples and W values corresponding to a specific significance probability.

Example 2)
 Let's conduct a shapiro-Wilk test on whether the data for the following ages meet the normal distribution.

age=np.array([65,61,63,86,70,55,74,35,72,68,45,58])
ageSort=np.sort(age)
da=pd.DataFrame([age, ageSort], index=['age', 'ageSort'])
da
0 1 2 3 4 5 6 7 8 9 10 11
age 65 61 63 86 70 55 74 35 72 68 45 58
ageSort 35 45 55 58 61 63 65 68 70 72 74 86

The data is sample count (n) 12, so there are six weights as $m=\frac{12}{2}=6$. Each weight is determined in the shapiro-Wilk table as follows:

n=len(age)
m=np.array([n/2 if n%2==0 else (n-1)/2], dtype=np.int16)
m
array([6], dtype=int16)
ss=np.sum((ageSort-np.mean(ageSort))**2)
round(ss,3)
2008.667
a=np.array([0.5475,0.3325,0.2347,0.1586,0.0922,0.0303])
age1=[]
for i in range(m[0]):
    age1.append(ageSort[n-(i+1)]-ageSort[i])
age1 
[51, 29, 17, 12, 7, 2]
b=np.sum(a*age1)
b
44.1641
w=b**2/ss
round(w, 4)
0.971

If n=12 in the shapiro-Wilk table, the significance probability and W values are as follows: The following is an estimate of the significance probability for the linearly calculated w based on the determined value: First, determine a straight line expression from the following values: This straight line is sufficient to be handwritten. The following is a solve() function of the sympy module:

tab=pd.DataFrame([[0.5, 0.943],[0.9, 0.973]], columns=['significant Prob.','W'])
tab
significant Prob. W
0 0.5 0.943
1 0.9 0.973
tab1=tab.values
slope=(tab1[0,0]-tab1[1,0])/(tab1[0,1]-tab1[1,1])
round(slope, 4)
13.333
x, bias=symbols('x bias')
eq=slope*x+bias
print(eq)
bias + 13.3333333333333*x
eq1=eq.subs(x,tab1[0,1])
eq1
bias+12.5733333333333
bias1=solve(Eq(eq1, tab1[0,0]), bias)
bias1
[-12.0733333333333]
linear=eq.subs(bias, bias1[0])
N(linear,4)
13.33x−12.07

The above calculation process can be substituted by the function scpy.stats.shapiro(). This function returns the w value and p-value.

statics, pVal=stats.shapiro(age)
print(f'statistics: {round(statics, 3)}, p-value: {round(pVal, 3)}')
statistics: 0.971, p-value: 0.922

The p value of the above results differs from the approximate p value previously calculated by interpolation. However, because both values are greater than the significance level of 0.05 or 0.1, you cannot reject the null hypothesis that the above data age follows a normal distribution. Compare these results to QQ plot analysis and:

val, para=stats.probplot(age, plot=plt, rvalue=True)
print(f'reg.coeff.: {np.around(para[0], 3)} \nbias: {np.around(para[1], 3)}\ndeter.coeff.: {np.around(para[2], 3)}')
reg.coeff.: 14.264 
bias: 62.667
deter.coeff.: 0.979

Example 3)
 Let's run a sapiro-Wilk test on the na and vix used in Example 2.

statics, pVal=stats.shapiro(na.values)
print(f'statistics: {round(statics, 3)}, p-value: {round(pVal, 6)}')
statistics: 0.967, p-value: 0.0
print(f'statistics: {round(statics, 3)}, p-value: {round(pVal, 6)}')
statistics: 0.967, p-value: 0.0

At a significance level of 0.05, the above results, unlike the Q-Q plot, cannot adopt the null hypothesis that they conform to a normal distribution.

Kolmogorov-Smirnov Test

The Kolmogorov-Smirnov(K-S) test is used to determine whether a sample comes from a population with a specific distribution. The K-S test is based on the empirical distribution function (EECDF) and given N sorted data points Y1 , Y2, … , YN , ECDF is defined as follows:

$$\text{ECDF}=\frac{n(i)}{N}$$

In the expression above, n(i) n(i) is the number of points less than $Y_i$ and $Y_i$ is sorted in ascending order. ECDF is a step function that increases by $\displaystyle \frac{1}{N}$ at the value of each sorted data point.

Figure 5.16 shows plots of normal cumulative and empirical distribution functions for 100 normal random numbers. The K-S test is based on the maximum distance between these two curves.

np.random.seed(3)
da=np.sort(np.random.randn(100))
np.around(da[:3], 3)
array([-2.916, -2.419, -2.248])
ECDF=[]
N=len(da)
for i in range(N):
    ECDF.append((i+1)/N)
ECDF[:3]
[0.01, 0.02, 0.03]
NCDF=[stats.norm.cdf(i) for i in da]
np.around(NCDF[:3], 3)
array([0.002, 0.008, 0.012])
plt.figure(figsize=(6, 4))
plt.plot(da, ECDF, color="blue", label="ECDF")
plt.plot(da, NCDF, color="red", label="NCDF")
plt.xlabel("X", size="13")
plt.ylabel("Cumul. Prob.", size="13")
plt.legend(loc="best")
plt.title("100 random number", size="14")
plt.show()
Figure 2. Experience cumulative distribution (ECDF) and normal cumulative distribution function (NCDF) for regular random numbers..

This test has several important limitations.

  • Apply to continuous distributions only
  • More sensitive near center of distribution than tail
  • The distribution must be fully specified. In other words, estimating location, scale, and shape parameters from the data no longer makes the critical area of the K-S test valid. Generally, it should be determined by simulation.

The hypotheses for the K-S test are as follows:

  • H0: Data follow a specific distribution.
  • H1: Data do not follow a particular distribution.
$$\begin{align}&\text{test statistics}\\ &\text{D}=\underset{1 \le i \le N}{\text{max}} \left(F(Y_i)-\frac{i-1}{N},\, \frac{i}{N}-F(Y_i) \right)\end{align}$$

F is a theoretical cumulative distribution function of a continuous variable, and the parameters mean, scale, and shape are not estimated from the data. Apply the critical value of the Kolmogorov table to determine whether to reject statistics based on the significance level. The critical value is determined by the size and significance level of the sample by the K-S table, but if the sample size is greater than 50, it is calculated as follows:

n|α0.001 0.01 0.02 0.05 0.1 0.15 0.2
n > 50$\frac{1.95}{n}$$\frac{1.63}{n}$$\frac{1.52}{n}$$\frac{1.36}{n}$$\frac{1.22}{n}$$\frac{1.14}{n}$$\frac{1.07}{n}$

The comparison of test statistics D and critical value leads to the following conclusions:

$$ D > \text{critical value} \rightarrow H0: \text{reject}$$

Example 4)
K-S test for the na and vix used in Example 5.61.

Analyze into the following courses:

  1. Standardization of data
  2. Sort by Ascending
  3. Calculate the standard normal cumulative and empirical cumulative distributions
  4. Calculate Test Statistics
  5. Calculating thresholds

1) Nasdaq data

#standardization and sort of data
naN=(na-na.mean())/na.std()
naN2=np.sort(naN)  
np.around(naN2, 4)[:3]
array([-3.4007, -3.1598, -3.1574])
emCDF1=[]    # emperical CDF 
N1=len(naN2)    # sample size
for i in range(N1): 
    emCDF1.append((i+1)/N1) 
np.around(emCDF1[:3],3)
array([0.003, 0.005, 0.008])
norCDF1=[stats.norm.cdf(i) for i in naN2]
np.around(norCDF1[:3],3)
array([0.   , 0.001, 0.001])
D1=np.max(abs(np.array(emCDF1)-np.array(norCDF1))) 
print(f'statistics: {round(D1,3)}')
statistics: 0.063
crit1=1.36/np.sqrt(N1)
print(f'critical value: {round(crit1,3)}')
critical value: 0.069

Test statistic D is less than the critical value. In other words, H0 cannot be rejected . The above analysis can be run by scipy.stats.kstest(). This function returns test statistics and p-value. The results of this function show a very high p-value relative to the significance level, so the same conclusion is made.

statics1, pVal1=stats.kstest(naN2, 'norm')
print(f'statistics: {round(statics1, 3)}, critical value:{round(pVal1, 3)}')
statistics: 0.066, critical value:0.065

2) vix data

vixN=(vix-vix.mean())/vix.std()
vixN2=np.sort(vixN) # sorting
np.around(vixN2, 4)[:3]
array([-2.8079, -2.4982, -2.3345])
emCDF=[]
N=len(vixN2)  # sample size
for i in range(N): 
    emCDF.append((i+1)/N) 
np.around(emCDF[:3],3)
array([0.003, 0.005, 0.008])
norCDF=[stats.norm.cdf(i) for i in vixN2]
np.around(norCDF[:3],3)
array([0.002, 0.006, 0.01 ])
D=np.max(abs(np.array(emCDF)-np.array(norCDF))) 
print(f'statistics: {round(D,3)}')
statistics: 0.1
crit=1.36/np.sqrt(N)
print(f'critical value: {round(crit,3)}')
critical value: 0.068
statics, pVal=stats.kstest(vixN2, 'norm')
print(f'statistics: {round(statics, 3)}, critical value:{round(pVal, 3)}')
statistics: 0.1, critical value:0.001

D > critical value, so the null hypothesis can be rejected. This result is the same as the result by p-value of stats.kstest().

Figure 3 shows the standard normal and experience cumulative distributions for na and vix.

plt.figure(figsize=(10, 4))
plt.subplot(1,2,1)
plt.plot(naN2, emCDF1, color="blue", label="emCDF")
plt.plot(naN2, norCDF1, color="red", label="norCDF")
plt.xlabel("X", size="13")
plt.ylabel("Cumul. Prob.", size="13")
plt.legend(loc="best")
plt.title("Nasdaq K-S Test", size="14")
plt.subplot(1,2,2)
plt.plot(vixN2, emCDF, color="blue", label="emCDF")
plt.plot(vixN2, norCDF, color="red", label="norCDF")
plt.xlabel("X", size="13")
plt.legend(loc="best")
plt.title("VIX K-S Test", size="14")
plt.show()
Figure 3. Experience cumulative distribution and normal cumulative distribution of the daily variation of nasdaq and vix over a period of time.

Generally, shapiro tests are very conservative compared to other methods and are useful in small-scale materials, but are size sensitive for larger than a certain size.

댓글

이 블로그의 인기 게시물

[Linear Algebra] 유사변환(Similarity transformation)

유사변환(Similarity transformation) n×n 차원의 정방 행렬 A, B 그리고 가역 행렬 P 사이에 식 1의 관계가 성립하면 행렬 A와 B는 유사행렬(similarity matrix)이 되며 행렬 A를 가역행렬 P와 B로 분해하는 것을 유사 변환(similarity transformation) 이라고 합니다. $$\tag{1} A = PBP^{-1} \Leftrightarrow P^{-1}AP = B $$ 식 2는 식 1의 양변에 B의 고유값을 고려한 것입니다. \begin{align}\tag{식 2} B - \lambda I &= P^{-1}AP – \lambda P^{-1}P\\ &= P^{-1}(AP – \lambda P)\\ &= P^{-1}(A - \lambda I)P \end{align} 식 2의 행렬식은 식 3과 같이 정리됩니다. \begin{align} &\begin{aligned}\textsf{det}(B - \lambda I ) & = \textsf{det}(P^{-1}(AP – \lambda P))\\ &= \textsf{det}(P^{-1}) \textsf{det}((A – \lambda I)) \textsf{det}(P)\\ &= \textsf{det}(P^{-1}) \textsf{det}(P) \textsf{det}((A – \lambda I))\\ &= \textsf{det}(A – \lambda I)\end{aligned}\\ &\begin{aligned}\because \; \textsf{det}(P^{-1}) \textsf{det}(P) &= \textsf{det}(P^{-1}P)\\ &= \textsf{det}(I)\end{aligned}\end{align} 유사행렬의 특성 유사행렬인 두 정방행렬 A와 B는 'A ~ B' 와 같

[matplotlib] 히스토그램(Histogram)

히스토그램(Histogram) 히스토그램은 확률분포의 그래픽적인 표현이며 막대그래프의 종류입니다. 이 그래프가 확률분포와 관계가 있으므로 통계적 요소를 나타내기 위해 많이 사용됩니다. plt.hist(X, bins=10)함수를 사용합니다. x=np.random.randn(1000) plt.hist(x, 10) plt.show() 위 그래프의 y축은 각 구간에 해당하는 갯수이다. 빈도수 대신 확률밀도를 나타내기 위해서는 위 함수의 매개변수 normed=True로 조정하여 나타낼 수 있다. 또한 매개변수 bins의 인수를 숫자로 전달할 수 있지만 리스트 객체로 지정할 수 있다. 막대그래프의 경우와 마찬가지로 각 막대의 폭은 매개변수 width에 의해 조정된다. y=np.linspace(min(x)-1, max(x)+1, 10) y array([-4.48810153, -3.54351935, -2.59893717, -1.65435499, -0.70977282, 0.23480936, 1.17939154, 2.12397372, 3.0685559 , 4.01313807]) plt.hist(x, y, normed=True) plt.show()

R 미분과 적분

내용 expression 미분 2차 미분 mosaic를 사용한 미분 적분 미분과 적분 R에서의 미분과 적분 함수는 expression()함수에 의해 생성된 표현식을 대상으로 합니다. expression expression(문자, 또는 식) 이 표현식의 평가는 eval() 함수에 의해 실행됩니다. > ex1<-expression(1+0:9) > ex1 expression(1 + 0:9) > eval(ex1) [1] 1 2 3 4 5 6 7 8 9 10 > ex2<-expression(u, 2, u+0:9) > ex2 expression(u, 2, u + 0:9) > ex2[1] expression(u) > ex2[2] expression(2) > ex2[3] expression(u + 0:9) > u<-0.9 > eval(ex2[3]) [1] 0.9 1.9 2.9 3.9 4.9 5.9 6.9 7.9 8.9 9.9 미분 D(표현식, 미분 변수) 함수로 미분을 실행합니다. 이 함수의 표현식은 expression() 함수로 생성된 객체이며 미분 변수는 다음 식의 분모의 변수를 의미합니다. $$\frac{d}{d \text{변수}}\text{표현식}$$ 이 함수는 어떤 함수의 미분의 결과를 표현식으로 반환합니다. > D(expression(2*x^3), "x") 2 * (3 * x^2) > eq<-expression(log(x)) > eq expression(log(x)) > D(eq, "x") 1/x > eq2<-expression(a/(1+b*exp(-d*x))); eq2 expression(a/(1 + b * exp(-d * x))) > D(eq2, "x") a * (b * (exp(-d * x) * d))/(1 + b