기본 콘텐츠로 건너뛰기

[ML] 결정트리(Decision Tree) 모델

Uniform and Normal distribution

Contents

  1. Uniform Distribution
    1. Mean and Variance
  2. Normal (Gaussian) Distribution
    1. Possibility Desnity Function(PDF)
    2. Cumulative Distribution Function(CDF)

Uniform Distribution

If the probability density function of a random variable X is constant over the range [a, b] as in Equation 1, then the variable is said to be uniformly distributed.

$$\begin{align}\tag{1}f(x)=\begin{cases}\frac{1}{b-a}& \quad a \lt x \lt b\\ 0&\quad \text{otherwise} \end{cases}\end{align}$$

A uniform distribution is expressed as

$$X \sim \text{Uniform(a, b)}$$

Example 1)
  If the variable X is uniformly distributed in the range [0,10], try calculating the following probabilities:

$$\begin{align}&f(x)=\frac{1}{10-0}\\&F(x)=\int^b_a \frac{1}{10-0} \, dx \quad 0 \le a \lt b \le 10 \end{align}$$
import numpy as np
import pandas as pd 
from sympy import *
import matplotlib.pyplot as plt
from scipy import stats  
a, b, x=symbols("a b x")
f=Rational(1, 10)
f
$\quad \color{blue}{\displaystyle \frac{1}{10}}$

1) 2 $\le$ X $\le$ 9

F1=f.integrate((x, 2, 9))
F1
$\quad \color{blue}{\displaystyle \frac{7}{10}}$

2) 1 $\le$ x $\le$ 4

F1=f.integrate((x, 1, 4))
F1
$\quad \color{blue}{\displaystyle \frac{3}{10}}$

3)x $\ge$ 6

F1=f.integrate((x, 6, 10))
F1
$\quad \color{blue}{\displaystyle \frac{2}{5}}$

Example 2)
 Assume that the bus leaves at 7:00 and stops at a specific stop exactly every 15 minutes. If a passenger waits for the bus between 7:00 and 7:30, what is the probability that the waiting time is
1) less than 5 minutes?
2) Probability of at least 12 minutes?

1)x ≤ 5 minutes
In this case, the probability that the passenger will be at the stop between 7:10 ~ 15 and 25 ~ 30.

$$P(10 \lt x \lt 15)+P(25 \lt x \lt 30)$$
x=symbols("x")
f=Rational(1, 30)
f.integrate((x, 10, 15))+f.integrate((x, 25, 30))
$\quad \color{blue}{\displaystyle \frac{1}{3}}$

2) x $\ge$ 3

f.integrate((x, 0, 3))+f.integrate((x, 15, 18))
$\quad \color{blue}{\displaystyle \frac{1}{5}}$

Mean and Variance

The mean and variance are calculated as in Equation 2 by applying their respective definitions. Of course, it can be calculated using the moment generating function.

$$\begin{align}\tag{2} E(X)&=\int^b_a x\frac{1}{b-a}\,dx\\&=\frac{1}{b-a}\frac{x^2}{2} \large{\vert} ^b_a\\&=\frac{b+a}{2}\\ E(X^2)&=\int^b_a x^2\frac{1}{b-a}\,dx\\&=\frac{1}{b-a}\frac{x^3}{3} \large{\vert} ^b_a\\&=\frac{b^3-a^3}{3(b-a)}\\&\frac{a^2+ab+b^2}{3}\\ Var(X)&=E(X^2)-(E(X))^2\\&=\frac{a^2+ab+b^2}{3}-\left(\frac{b+a}{2}\right)^2\\&=\frac{(b-a)^2}{12} \end{align}$$
a, b, x, t=symbols("a b x t")
f=1/(b-a)
E=integrate(x*f, (x, a, b))
factor(E)
$\qquad \color{blue}{\displaystyle \frac{a + b}{2}}$
Var=integrate(x**2*f, (x, a, b))-E**2
factor(Var)
$\qquad \color{blue}{\displaystyle \frac{\left(a - b\right)^{2}}{12}}$

In the code above, the function factor() is used to simply represent the expression generated by sympy in factorized form.

Normal (Gaussian) Distribution

When examining large-scale data for various phenomena, it shows the highest probability at the mean, and shows a bell-shaped shape with the probability decreasing to the same degree on both sides of the mean. This distribution is called the normal distribution. In particular, the distribution of large-scale random variables is close to the normal distribution regardless of the conditions of the variables, so it is the central distribution for studying various characteristics of the data.

Figure 1 shows that a bar graph (blue) is prepared for the frequency of sample means of stock price data for a specific period, and it corresponds to a theoretical normal distribution curve (red line) with the mean and variance of the sample means as parameters.

Figure 1. The relationship between probability and standard deviation in a normal distribution.

The theoretical normal distribution (red line) in Figure 1 is separated as shown in Figure 2. The y-axis of this normal distribution represents the probability (pdf), and the highest point is the average of this distribution, and both sides are symmetrical. Also, since the total area of this distribution is 1, the probability of a particular part of a variable can be calculated around the mean, and vice versa. The mean, mode, and median of this distribution are all equal.

Figure 2. The relationship between probability and standard deviation in a normal distribution.

As shown in Figure 2, the normal distribution is a symmetrical bell-shaped distribution with respect to the mean. This distribution can specify probabilities corresponding to multiples of the standard deviation σ from the mean μ.

$$\begin{align} &\mu \pm \sigma = 0.68\\&\mu \pm 1.96\sigma = 0.95\\ &\mu \pm 2.56\sigma = 0.99 \end{align}$$

As a result, in a normal distribution, \textbf{mean indicates the location of the center of the distribution and variance indicates the degree of dispersion}. As introduced in the sampling distribution, as the number of sample means increases, the variable conforms to a normal distribution. This phenomenon is called Central Limit Theorem(CLT). This theorem can be used to provide a basis for statistical analysis of various data whose distribution cannot be specified. In other words, as shown in Figure 3, the distribution of means for samples sampled from real data by CLT can be assumed to be a normal distribution, so various inference statistics such as linear regression and ANOVA can be performed.

Figure 3.Distribution of (a) raw data and (b) sample mean data.

As shown in Figure 4, the overall shape of the normal distribution is determined by the mean and variance. In other words, the parameters of the normal distribution are the mean and variance.

Figure 4. (Effect of a) mean and (b) standard deviation on the shape of the normal distribution.

Possibility Desnity Function(PDF)

The normal distribution in which the probability density function of the random variable X has the parameter mean (μ) and variance (σ2) is expressed as follows.

$$X \sim N(\mu, \sigma^2)$$

The probability density function (pdf) of this distribution is as Equation 3.

$$\begin{equation}\tag{3} f(x)=\frac{1}{\sqrt{2\pi \sigma}}\exp(-\frac{(x-\mu)^2}{2\sigma^2}), \quad -\infty \lt x\lt \infty \end{equation}$$

A variable that follows a normal distribution can be converted into a new variable Z with mean 0 and variance 1 by normalized as in Equation 4.

\begin{equation}\tag{4} Z=\frac{X-\mu_x}{\sigma_x} \end{equation}

All data following a normal distribution can be transformed as above, and in this distribution, the mean and variance are fixed values and are used in various statistical analyzes. This distribution is called standard normal distribution and is expressed as Equation 5, and the PDF is also simplified compared to the normal distribution PDF shown above.

$$\begin{align}\tag{5} &Z \sim N(0, 1)\\ & f(z) = \frac{1}{\sqrt{2 \pi}}\exp(-\frac{z^2}{2}), \quad z \in \mathbb{R} \end{align}$$

A random variable following a normal distribution can be standardized by using the zscore() function of the scipy.stats module. Standardization can also be applied by the StandardScaler() class from the sklearn.preprocessing module.

The mean and variance of the above standard normal distribution are 0 and 1, respectively. That statistic can be checked using the definition of expected value and variance.

z=symbols('z', real=True)
f=1/sqrt(2*pi)*exp(-z**2/2)
E=integrate(z*f, (z, -oo, oo))
E
0
Ex2=integrate(z**2*f, (z, -oo, oo))
Ex2
1
var=Ex2-E**2
var
1

The pdf of the normal distribution can be calculated by the stats.norm.pdf(x, loc=0, scale=1) method. In this method, x is the value of a random variable, and loc and scale represent the mean and standard deviation, respectively. For the standard normal distribution, the mean and standard deviation are 0 and 1, which is the default for this method. The following shows the curve of the standard normal distribution, indicating that it is symmetrical with respect to the mean 0.

plt.figure(figsize=(8, 4))
x=[stats.norm.pdf(i) for i in np.linspace(-5, 5, 100)]
plt.plot(np.linspace(-5, 5, 100), x, label="N(0, 1)")
plt.xlabel("x", fontsize="15", fontweight="bold")
plt.ylabel("f(x)", fontsize="15", fontweight="bold")
plt.axvline(x=0, color="red", linestyle="--", label="mean")
plt.legend(loc="best")
plt.show()

Cumulative Distribution Function(CDF)

The cumulative distribution function is defined as the sum of the probability density functions in the specified range. Therefore, for a continuous variable, it is calculated as Equation 6. In general, the CDF(F(x)) of the standard normal distribution is also expressed as Φ(x).

$$\begin{align}\tag{6} F(x)&=\Phi(x)\\&=P(Z\le x)\\&=\frac{1}{\sqrt{2}}\int^x_{-\infty} \exp(-\frac{x^2}{2})\,dx \end{align}$$

The error function(erf), Φ(x), in Equation 6 is defined as follows.

$$\Phi(x)=\frac{1}{2} \left(1+\text{erf}\left(\frac{x}{\sqrt{2}}\right)\right)$$

The cumulative distribution function must satisfy all of the following conditions.

  • $\displaystyle \lim_{x \to \infty}=1, \; \lim_{x \to -\infty}=0$
  • $\displaystyle \Phi(0) = \frac{1}{2}$
  • $\displaystyle \Phi(-x)=1-\Phi(x), \quad x \in \mathbb{R}$

Since the standard normal distribution and the normal distribution are only due to the numerical change of data and have essentially the same shape, the PDF and CDF of the normal distribution variable X can be calculated by replacing z with x in the standard normal distribution.

Example 3)
 For the distribution of $X \sim N(-10, 4)$, calculate the following probabilities from the normal distribution and calculate them from the transformed standard normal distribution.

  1. P(X < 0)
  2. P(-7 < X < 3),
  3. P(X >-3 | X >-5)

The standard normal distribution for this normal distribution is

plt.figure(figsize=(8, 4))
rng=np.linspace(-25, 5, 1000)
p=[stats.norm.pdf(i, -10, 4) for i in rng]
plt.plot(rng, p, label="N(-10, 4)")
rngZ=(rng-(-10))/np.sqrt(4)
pZ=[stats.norm.pdf(i) for i in rngZ]
plt.plot(rngZ, pZ, label="N(0, 1)")
plt.title(r'Normal Distribution $\rightarrow$ Standard Normal Distribution', fontsize="15", fontweight="bold")
plt.xlabel("x", fontsize="15", fontweight="bold")
plt.ylabel("f(x)", fontsize="15", fontweight="bold")
plt.legend(loc="best")
plt.show()

1) $\displaystyle P(X \lt 0)=P(Z /lt \frac{0-(-10)}{2})$

mu=-10
var=4
cf=stats.norm.cdf(0, mu, np.sqrt(var))
round(cf,4)
1.0
cfS=stats.norm.cdf((0-mu)/np.sqrt(var)) #Standard ND
round(cfS, 4)
1.0

2) $\displaystyle P(-7 \lt X \lt 3)=P(\frac{-7-(-10)}{2} \lt Z \lt \frac{3-(-10)}{2})$

mu=-10
var=4
cf=stats.norm.cdf(3, mu, np.sqrt(var))-stats.norm.cdf(-7, mu, np.sqrt(var))
round(cf, 4)
0.0668
cfS=stats.norm.cdf((3-mu)/np.sqrt(var))-stats.norm.cdf((-7-mu)/np.sqrt(var))
round(cfS, 4)
0.0668

It is a conditional probability.

$$\begin{align} P(X \gt -3 | X \gt -5)&=\frac{P(X \gt -3, X \gt -5)}{P(X \gt -5)}\\ &=\frac{P(X \gt -3)}{P(X \gt -5)}\\&=\frac{1-P(X \le -3)}{1-P(X \le -5)}\end{align}$$

As in the above expression, the result of excluding the cumulative probability from the whole is called survival probability, and the probability can be calculated by applying the sf() method of each distribution class in the scipy.stats module.

mu=-10
var=4
cf=(1-stats.norm.cdf(-3, mu, np.sqrt(var)))/(1-stats.norm.cdf(-5, mu, np.sqrt(var)))
round(cf, 4)
0.0375
survivalf=stats.norm.sf(-3, mu, np.sqrt(var))/stats.norm.sf(-5, mu, np.sqrt(var))
round(survivalf, 4)
0.0375
cfS=(1-stats.norm.cdf((-3-mu)/np.sqrt(var)))/(1-stats.norm.cdf((-5-mu)/np.sqrt(var)))
round(cfS, 4)
0.0375
survivalf=stats.norm.sf((-3-mu)/np.sqrt(var))/stats.norm.sf((-5-mu)/np.sqrt(var))
round(survivalf, 4)
0.0375

Converting the normal distribution to the standard normal distribution can be considered as creating a new variable by linear transformation of raw data. In the same way, the mean and variance of the other variable Y generated by linear transformation of the random variable X by aX+b can be calculated as follows .

$$\begin{align} &X \sim N(\mu_x, \sigma^2_x)\\ &Y=aX+b, \quad a, b \in \mathbb{R}\\ &\mu_y=a \mu_x+b\\ & \sigma^2_y=a^2\sigma^2_x\\ &Y \sim N(a\mu_x+b, a^2\sigma^2_x) \end{align}$$

Example 4)
The normal distribution with the daily change rates X and Y of the stock prices of two companies as random variables is expressed as follows.

$$X \sim N(0.18, 4.43), \quad Y \sim N(0.33, 11.5)$$

Under the assumption that they are independent of each other, can the next decision be made?

1) P(Z > 0.7) for the joint distribution Z of two random variables?

$$\begin{align} Z&=X+Y\\ E(Z)&=E(X)+E(Y)\\&=0.18+0.33\\&=0.51\\ Var(Z)&=Var(X)+Var(Y)\\&=4.43+11.5\\&=15.93\\ Z &\sim N(0.51, 15.93) \end{align}$$
muz=0.18+0.33
varz=4.43+11.5
pthan07=stats.norm.sf(0.7, muz, np.sqrt(varz))
round(pthan07, 4)
0.481

2) Probability that Y-X is greater than or equal to 0.03? It is expressed as P(Y-X > 0.3) and is calculated for a new joint random variable Z=Y-X.

mux=0.18; muy=0.33
varx=4.43; vary=11.5
pthan03=stats.norm.sf(0.3, muy-mux, np.sqrt(vary-var))
round(pthan03, 4)
0.4782

댓글

이 블로그의 인기 게시물

[Linear Algebra] 유사변환(Similarity transformation)

유사변환(Similarity transformation) n×n 차원의 정방 행렬 A, B 그리고 가역 행렬 P 사이에 식 1의 관계가 성립하면 행렬 A와 B는 유사행렬(similarity matrix)이 되며 행렬 A를 가역행렬 P와 B로 분해하는 것을 유사 변환(similarity transformation) 이라고 합니다. $$\tag{1} A = PBP^{-1} \Leftrightarrow P^{-1}AP = B $$ 식 2는 식 1의 양변에 B의 고유값을 고려한 것입니다. \begin{align}\tag{식 2} B - \lambda I &= P^{-1}AP – \lambda P^{-1}P\\ &= P^{-1}(AP – \lambda P)\\ &= P^{-1}(A - \lambda I)P \end{align} 식 2의 행렬식은 식 3과 같이 정리됩니다. \begin{align} &\begin{aligned}\textsf{det}(B - \lambda I ) & = \textsf{det}(P^{-1}(AP – \lambda P))\\ &= \textsf{det}(P^{-1}) \textsf{det}((A – \lambda I)) \textsf{det}(P)\\ &= \textsf{det}(P^{-1}) \textsf{det}(P) \textsf{det}((A – \lambda I))\\ &= \textsf{det}(A – \lambda I)\end{aligned}\\ &\begin{aligned}\because \; \textsf{det}(P^{-1}) \textsf{det}(P) &= \textsf{det}(P^{-1}P)\\ &= \textsf{det}(I)\end{aligned}\end{align} 유사행렬의 특성 유사행렬인 두 정방행렬 A와 B는 'A ~ B' 와 같

[matplotlib] 히스토그램(Histogram)

히스토그램(Histogram) 히스토그램은 확률분포의 그래픽적인 표현이며 막대그래프의 종류입니다. 이 그래프가 확률분포와 관계가 있으므로 통계적 요소를 나타내기 위해 많이 사용됩니다. plt.hist(X, bins=10)함수를 사용합니다. x=np.random.randn(1000) plt.hist(x, 10) plt.show() 위 그래프의 y축은 각 구간에 해당하는 갯수이다. 빈도수 대신 확률밀도를 나타내기 위해서는 위 함수의 매개변수 normed=True로 조정하여 나타낼 수 있다. 또한 매개변수 bins의 인수를 숫자로 전달할 수 있지만 리스트 객체로 지정할 수 있다. 막대그래프의 경우와 마찬가지로 각 막대의 폭은 매개변수 width에 의해 조정된다. y=np.linspace(min(x)-1, max(x)+1, 10) y array([-4.48810153, -3.54351935, -2.59893717, -1.65435499, -0.70977282, 0.23480936, 1.17939154, 2.12397372, 3.0685559 , 4.01313807]) plt.hist(x, y, normed=True) plt.show()

R 미분과 적분

내용 expression 미분 2차 미분 mosaic를 사용한 미분 적분 미분과 적분 R에서의 미분과 적분 함수는 expression()함수에 의해 생성된 표현식을 대상으로 합니다. expression expression(문자, 또는 식) 이 표현식의 평가는 eval() 함수에 의해 실행됩니다. > ex1<-expression(1+0:9) > ex1 expression(1 + 0:9) > eval(ex1) [1] 1 2 3 4 5 6 7 8 9 10 > ex2<-expression(u, 2, u+0:9) > ex2 expression(u, 2, u + 0:9) > ex2[1] expression(u) > ex2[2] expression(2) > ex2[3] expression(u + 0:9) > u<-0.9 > eval(ex2[3]) [1] 0.9 1.9 2.9 3.9 4.9 5.9 6.9 7.9 8.9 9.9 미분 D(표현식, 미분 변수) 함수로 미분을 실행합니다. 이 함수의 표현식은 expression() 함수로 생성된 객체이며 미분 변수는 다음 식의 분모의 변수를 의미합니다. $$\frac{d}{d \text{변수}}\text{표현식}$$ 이 함수는 어떤 함수의 미분의 결과를 표현식으로 반환합니다. > D(expression(2*x^3), "x") 2 * (3 * x^2) > eq<-expression(log(x)) > eq expression(log(x)) > D(eq, "x") 1/x > eq2<-expression(a/(1+b*exp(-d*x))); eq2 expression(a/(1 + b * exp(-d * x))) > D(eq2, "x") a * (b * (exp(-d * x) * d))/(1 + b