기본 콘텐츠로 건너뛰기

벡터와 행렬에 관련된 그림들

Uniform and Normal distribution

Contents

  1. Uniform Distribution
    1. Mean and Variance
  2. Normal (Gaussian) Distribution
    1. Possibility Desnity Function(PDF)
    2. Cumulative Distribution Function(CDF)

Uniform Distribution

If the probability density function of a random variable X is constant over the range [a, b] as in Equation 1, then the variable is said to be uniformly distributed.

$$\begin{align}\tag{1}f(x)=\begin{cases}\frac{1}{b-a}& \quad a \lt x \lt b\\ 0&\quad \text{otherwise} \end{cases}\end{align}$$

A uniform distribution is expressed as

$$X \sim \text{Uniform(a, b)}$$

Example 1)
  If the variable X is uniformly distributed in the range [0,10], try calculating the following probabilities:

$$\begin{align}&f(x)=\frac{1}{10-0}\\&F(x)=\int^b_a \frac{1}{10-0} \, dx \quad 0 \le a \lt b \le 10 \end{align}$$
import numpy as np
import pandas as pd 
from sympy import *
import matplotlib.pyplot as plt
from scipy import stats  
a, b, x=symbols("a b x")
f=Rational(1, 10)
f
$\quad \color{blue}{\displaystyle \frac{1}{10}}$

1) 2 $\le$ X $\le$ 9

F1=f.integrate((x, 2, 9))
F1
$\quad \color{blue}{\displaystyle \frac{7}{10}}$

2) 1 $\le$ x $\le$ 4

F1=f.integrate((x, 1, 4))
F1
$\quad \color{blue}{\displaystyle \frac{3}{10}}$

3)x $\ge$ 6

F1=f.integrate((x, 6, 10))
F1
$\quad \color{blue}{\displaystyle \frac{2}{5}}$

Example 2)
 Assume that the bus leaves at 7:00 and stops at a specific stop exactly every 15 minutes. If a passenger waits for the bus between 7:00 and 7:30, what is the probability that the waiting time is
1) less than 5 minutes?
2) Probability of at least 12 minutes?

1)x ≤ 5 minutes
In this case, the probability that the passenger will be at the stop between 7:10 ~ 15 and 25 ~ 30.

$$P(10 \lt x \lt 15)+P(25 \lt x \lt 30)$$
x=symbols("x")
f=Rational(1, 30)
f.integrate((x, 10, 15))+f.integrate((x, 25, 30))
$\quad \color{blue}{\displaystyle \frac{1}{3}}$

2) x $\ge$ 3

f.integrate((x, 0, 3))+f.integrate((x, 15, 18))
$\quad \color{blue}{\displaystyle \frac{1}{5}}$

Mean and Variance

The mean and variance are calculated as in Equation 2 by applying their respective definitions. Of course, it can be calculated using the moment generating function.

$$\begin{align}\tag{2} E(X)&=\int^b_a x\frac{1}{b-a}\,dx\\&=\frac{1}{b-a}\frac{x^2}{2} \large{\vert} ^b_a\\&=\frac{b+a}{2}\\ E(X^2)&=\int^b_a x^2\frac{1}{b-a}\,dx\\&=\frac{1}{b-a}\frac{x^3}{3} \large{\vert} ^b_a\\&=\frac{b^3-a^3}{3(b-a)}\\&\frac{a^2+ab+b^2}{3}\\ Var(X)&=E(X^2)-(E(X))^2\\&=\frac{a^2+ab+b^2}{3}-\left(\frac{b+a}{2}\right)^2\\&=\frac{(b-a)^2}{12} \end{align}$$
a, b, x, t=symbols("a b x t")
f=1/(b-a)
E=integrate(x*f, (x, a, b))
factor(E)
$\qquad \color{blue}{\displaystyle \frac{a + b}{2}}$
Var=integrate(x**2*f, (x, a, b))-E**2
factor(Var)
$\qquad \color{blue}{\displaystyle \frac{\left(a - b\right)^{2}}{12}}$

In the code above, the function factor() is used to simply represent the expression generated by sympy in factorized form.

Normal (Gaussian) Distribution

When examining large-scale data for various phenomena, it shows the highest probability at the mean, and shows a bell-shaped shape with the probability decreasing to the same degree on both sides of the mean. This distribution is called the normal distribution. In particular, the distribution of large-scale random variables is close to the normal distribution regardless of the conditions of the variables, so it is the central distribution for studying various characteristics of the data.

Figure 1 shows that a bar graph (blue) is prepared for the frequency of sample means of stock price data for a specific period, and it corresponds to a theoretical normal distribution curve (red line) with the mean and variance of the sample means as parameters.

Figure 1. The relationship between probability and standard deviation in a normal distribution.

The theoretical normal distribution (red line) in Figure 1 is separated as shown in Figure 2. The y-axis of this normal distribution represents the probability (pdf), and the highest point is the average of this distribution, and both sides are symmetrical. Also, since the total area of this distribution is 1, the probability of a particular part of a variable can be calculated around the mean, and vice versa. The mean, mode, and median of this distribution are all equal.

Figure 2. The relationship between probability and standard deviation in a normal distribution.

As shown in Figure 2, the normal distribution is a symmetrical bell-shaped distribution with respect to the mean. This distribution can specify probabilities corresponding to multiples of the standard deviation σ from the mean μ.

$$\begin{align} &\mu \pm \sigma = 0.68\\&\mu \pm 1.96\sigma = 0.95\\ &\mu \pm 2.56\sigma = 0.99 \end{align}$$

As a result, in a normal distribution, \textbf{mean indicates the location of the center of the distribution and variance indicates the degree of dispersion}. As introduced in the sampling distribution, as the number of sample means increases, the variable conforms to a normal distribution. This phenomenon is called Central Limit Theorem(CLT). This theorem can be used to provide a basis for statistical analysis of various data whose distribution cannot be specified. In other words, as shown in Figure 3, the distribution of means for samples sampled from real data by CLT can be assumed to be a normal distribution, so various inference statistics such as linear regression and ANOVA can be performed.

Figure 3.Distribution of (a) raw data and (b) sample mean data.

As shown in Figure 4, the overall shape of the normal distribution is determined by the mean and variance. In other words, the parameters of the normal distribution are the mean and variance.

Figure 4. (Effect of a) mean and (b) standard deviation on the shape of the normal distribution.

Possibility Desnity Function(PDF)

The normal distribution in which the probability density function of the random variable X has the parameter mean (μ) and variance (σ2) is expressed as follows.

$$X \sim N(\mu, \sigma^2)$$

The probability density function (pdf) of this distribution is as Equation 3.

$$\begin{equation}\tag{3} f(x)=\frac{1}{\sqrt{2\pi \sigma}}\exp(-\frac{(x-\mu)^2}{2\sigma^2}), \quad -\infty \lt x\lt \infty \end{equation}$$

A variable that follows a normal distribution can be converted into a new variable Z with mean 0 and variance 1 by normalized as in Equation 4.

\begin{equation}\tag{4} Z=\frac{X-\mu_x}{\sigma_x} \end{equation}

All data following a normal distribution can be transformed as above, and in this distribution, the mean and variance are fixed values and are used in various statistical analyzes. This distribution is called standard normal distribution and is expressed as Equation 5, and the PDF is also simplified compared to the normal distribution PDF shown above.

$$\begin{align}\tag{5} &Z \sim N(0, 1)\\ & f(z) = \frac{1}{\sqrt{2 \pi}}\exp(-\frac{z^2}{2}), \quad z \in \mathbb{R} \end{align}$$

A random variable following a normal distribution can be standardized by using the zscore() function of the scipy.stats module. Standardization can also be applied by the StandardScaler() class from the sklearn.preprocessing module.

The mean and variance of the above standard normal distribution are 0 and 1, respectively. That statistic can be checked using the definition of expected value and variance.

z=symbols('z', real=True)
f=1/sqrt(2*pi)*exp(-z**2/2)
E=integrate(z*f, (z, -oo, oo))
E
0
Ex2=integrate(z**2*f, (z, -oo, oo))
Ex2
1
var=Ex2-E**2
var
1

The pdf of the normal distribution can be calculated by the stats.norm.pdf(x, loc=0, scale=1) method. In this method, x is the value of a random variable, and loc and scale represent the mean and standard deviation, respectively. For the standard normal distribution, the mean and standard deviation are 0 and 1, which is the default for this method. The following shows the curve of the standard normal distribution, indicating that it is symmetrical with respect to the mean 0.

plt.figure(figsize=(8, 4))
x=[stats.norm.pdf(i) for i in np.linspace(-5, 5, 100)]
plt.plot(np.linspace(-5, 5, 100), x, label="N(0, 1)")
plt.xlabel("x", fontsize="15", fontweight="bold")
plt.ylabel("f(x)", fontsize="15", fontweight="bold")
plt.axvline(x=0, color="red", linestyle="--", label="mean")
plt.legend(loc="best")
plt.show()

Cumulative Distribution Function(CDF)

The cumulative distribution function is defined as the sum of the probability density functions in the specified range. Therefore, for a continuous variable, it is calculated as Equation 6. In general, the CDF(F(x)) of the standard normal distribution is also expressed as Φ(x).

$$\begin{align}\tag{6} F(x)&=\Phi(x)\\&=P(Z\le x)\\&=\frac{1}{\sqrt{2}}\int^x_{-\infty} \exp(-\frac{x^2}{2})\,dx \end{align}$$

The error function(erf), Φ(x), in Equation 6 is defined as follows.

$$\Phi(x)=\frac{1}{2} \left(1+\text{erf}\left(\frac{x}{\sqrt{2}}\right)\right)$$

The cumulative distribution function must satisfy all of the following conditions.

  • $\displaystyle \lim_{x \to \infty}=1, \; \lim_{x \to -\infty}=0$
  • $\displaystyle \Phi(0) = \frac{1}{2}$
  • $\displaystyle \Phi(-x)=1-\Phi(x), \quad x \in \mathbb{R}$

Since the standard normal distribution and the normal distribution are only due to the numerical change of data and have essentially the same shape, the PDF and CDF of the normal distribution variable X can be calculated by replacing z with x in the standard normal distribution.

Example 3)
 For the distribution of $X \sim N(-10, 4)$, calculate the following probabilities from the normal distribution and calculate them from the transformed standard normal distribution.

  1. P(X < 0)
  2. P(-7 < X < 3),
  3. P(X >-3 | X >-5)

The standard normal distribution for this normal distribution is

plt.figure(figsize=(8, 4))
rng=np.linspace(-25, 5, 1000)
p=[stats.norm.pdf(i, -10, 4) for i in rng]
plt.plot(rng, p, label="N(-10, 4)")
rngZ=(rng-(-10))/np.sqrt(4)
pZ=[stats.norm.pdf(i) for i in rngZ]
plt.plot(rngZ, pZ, label="N(0, 1)")
plt.title(r'Normal Distribution $\rightarrow$ Standard Normal Distribution', fontsize="15", fontweight="bold")
plt.xlabel("x", fontsize="15", fontweight="bold")
plt.ylabel("f(x)", fontsize="15", fontweight="bold")
plt.legend(loc="best")
plt.show()

1) $\displaystyle P(X \lt 0)=P(Z /lt \frac{0-(-10)}{2})$

mu=-10
var=4
cf=stats.norm.cdf(0, mu, np.sqrt(var))
round(cf,4)
1.0
cfS=stats.norm.cdf((0-mu)/np.sqrt(var)) #Standard ND
round(cfS, 4)
1.0

2) $\displaystyle P(-7 \lt X \lt 3)=P(\frac{-7-(-10)}{2} \lt Z \lt \frac{3-(-10)}{2})$

mu=-10
var=4
cf=stats.norm.cdf(3, mu, np.sqrt(var))-stats.norm.cdf(-7, mu, np.sqrt(var))
round(cf, 4)
0.0668
cfS=stats.norm.cdf((3-mu)/np.sqrt(var))-stats.norm.cdf((-7-mu)/np.sqrt(var))
round(cfS, 4)
0.0668

It is a conditional probability.

$$\begin{align} P(X \gt -3 | X \gt -5)&=\frac{P(X \gt -3, X \gt -5)}{P(X \gt -5)}\\ &=\frac{P(X \gt -3)}{P(X \gt -5)}\\&=\frac{1-P(X \le -3)}{1-P(X \le -5)}\end{align}$$

As in the above expression, the result of excluding the cumulative probability from the whole is called survival probability, and the probability can be calculated by applying the sf() method of each distribution class in the scipy.stats module.

mu=-10
var=4
cf=(1-stats.norm.cdf(-3, mu, np.sqrt(var)))/(1-stats.norm.cdf(-5, mu, np.sqrt(var)))
round(cf, 4)
0.0375
survivalf=stats.norm.sf(-3, mu, np.sqrt(var))/stats.norm.sf(-5, mu, np.sqrt(var))
round(survivalf, 4)
0.0375
cfS=(1-stats.norm.cdf((-3-mu)/np.sqrt(var)))/(1-stats.norm.cdf((-5-mu)/np.sqrt(var)))
round(cfS, 4)
0.0375
survivalf=stats.norm.sf((-3-mu)/np.sqrt(var))/stats.norm.sf((-5-mu)/np.sqrt(var))
round(survivalf, 4)
0.0375

Converting the normal distribution to the standard normal distribution can be considered as creating a new variable by linear transformation of raw data. In the same way, the mean and variance of the other variable Y generated by linear transformation of the random variable X by aX+b can be calculated as follows .

$$\begin{align} &X \sim N(\mu_x, \sigma^2_x)\\ &Y=aX+b, \quad a, b \in \mathbb{R}\\ &\mu_y=a \mu_x+b\\ & \sigma^2_y=a^2\sigma^2_x\\ &Y \sim N(a\mu_x+b, a^2\sigma^2_x) \end{align}$$

Example 4)
The normal distribution with the daily change rates X and Y of the stock prices of two companies as random variables is expressed as follows.

$$X \sim N(0.18, 4.43), \quad Y \sim N(0.33, 11.5)$$

Under the assumption that they are independent of each other, can the next decision be made?

1) P(Z > 0.7) for the joint distribution Z of two random variables?

$$\begin{align} Z&=X+Y\\ E(Z)&=E(X)+E(Y)\\&=0.18+0.33\\&=0.51\\ Var(Z)&=Var(X)+Var(Y)\\&=4.43+11.5\\&=15.93\\ Z &\sim N(0.51, 15.93) \end{align}$$
muz=0.18+0.33
varz=4.43+11.5
pthan07=stats.norm.sf(0.7, muz, np.sqrt(varz))
round(pthan07, 4)
0.481

2) Probability that Y-X is greater than or equal to 0.03? It is expressed as P(Y-X > 0.3) and is calculated for a new joint random variable Z=Y-X.

mux=0.18; muy=0.33
varx=4.43; vary=11.5
pthan03=stats.norm.sf(0.3, muy-mux, np.sqrt(vary-var))
round(pthan03, 4)
0.4782

댓글

이 블로그의 인기 게시물

[Linear Algebra] 유사변환(Similarity transformation)

유사변환(Similarity transformation) n×n 차원의 정방 행렬 A, B 그리고 가역 행렬 P 사이에 식 1의 관계가 성립하면 행렬 A와 B는 유사행렬(similarity matrix)이 되며 행렬 A를 가역행렬 P와 B로 분해하는 것을 유사 변환(similarity transformation) 이라고 합니다. $$\tag{1} A = PBP^{-1} \Leftrightarrow P^{-1}AP = B $$ 식 2는 식 1의 양변에 B의 고유값을 고려한 것입니다. \begin{align}\tag{식 2} B - \lambda I &= P^{-1}AP – \lambda P^{-1}P\\ &= P^{-1}(AP – \lambda P)\\ &= P^{-1}(A - \lambda I)P \end{align} 식 2의 행렬식은 식 3과 같이 정리됩니다. \begin{align} &\begin{aligned}\textsf{det}(B - \lambda I ) & = \textsf{det}(P^{-1}(AP – \lambda P))\\ &= \textsf{det}(P^{-1}) \textsf{det}((A – \lambda I)) \textsf{det}(P)\\ &= \textsf{det}(P^{-1}) \textsf{det}(P) \textsf{det}((A – \lambda I))\\ &= \textsf{det}(A – \lambda I)\end{aligned}\\ &\begin{aligned}\because \; \textsf{det}(P^{-1}) \textsf{det}(P) &= \textsf{det}(P^{-1}P)\\ &= \textsf{det}(I)\end{aligned}\end{align} 유사행렬의 특성 유사행렬인 두 정방행렬 A와 B는 'A ~ B' 와 같...

[sympy] Sympy객체의 표현을 위한 함수들

Sympy객체의 표현을 위한 함수들 General simplify(x): 식 x(sympy 객체)를 간단히 정리 합니다. import numpy as np from sympy import * x=symbols("x") a=sin(x)**2+cos(x)**2 a $\sin^{2}{\left(x \right)} + \cos^{2}{\left(x \right)}$ simplify(a) 1 simplify(b) $\frac{x^{3} + x^{2} - x - 1}{x^{2} + 2 x + 1}$ simplify(b) x - 1 c=gamma(x)/gamma(x-2) c $\frac{\Gamma\left(x\right)}{\Gamma\left(x - 2\right)}$ simplify(c) $\displaystyle \left(x - 2\right) \left(x - 1\right)$ 위의 예들 중 객체 c의 감마함수(gamma(x))는 확률분포 등 여러 부분에서 사용되는 표현식으로 다음과 같이 정의 됩니다. 감마함수는 음이 아닌 정수를 제외한 모든 수에서 정의됩니다. 식 1과 같이 자연수에서 감마함수는 factorial(!), 부동소수(양의 실수)인 경우 적분을 적용하여 계산합니다. $$\tag{식 1}\Gamma(n) =\begin{cases}(n-1)!& n:\text{자연수}\\\int^\infty_0x^{n-1}e^{-x}\,dx& n:\text{부동소수}\end{cases}$$ x=symbols('x') gamma(x).subs(x,4) $\displaystyle 6$ factorial 계산은 math.factorial() 함수를 사용할 수 있습니다. import math math.factorial(3) 6 a=gamma(x).subs(x,4.5) a.evalf(3) 11.6 simpilfy() 함수의 알고리즘은 식에서 공통사항을 찾아 정리하...

sympy.solvers로 방정식해 구하기

sympy.solvers로 방정식해 구하기 대수 방정식을 해를 계산하기 위해 다음 함수를 사용합니다. sympy.solvers.solve(f, *symbols, **flags) f=0, 즉 동차방정식에 대해 지정한 변수의 해를 계산 f : 식 또는 함수 symbols: 식의 해를 계산하기 위한 변수, 변수가 하나인 경우는 생략가능(자동으로 인식) flags: 계산 또는 결과의 방식을 지정하기 위한 인수들 dict=True: {x:3, y:1}같이 사전형식, 기본값 = False set=True :{(x,3),(y,1)}같이 집합형식, 기본값 = False ratioal=True : 실수를 유리수로 반환, 기본값 = False positive=True: 해들 중에 양수만을 반환, 기본값 = False 예 $x^2=1$의 해를 결정합니다. solve() 함수에 적용하기 위해서는 다음과 같이 식의 한쪽이 0이 되는 형태인 동차식으로 구성되어야 합니다. $$x^2-1=0$$ import numpy as np from sympy import * x = symbols('x') solve(x**2-1, x) [-1, 1] 위 식은 계산 과정은 다음과 같습니다. $$\begin{aligned}x^2-1=0 \rightarrow (x+1)(x-1)=0 \\ x=1 \; \text{or}\; -1\end{aligned}$$ 예 $x^4=1$의 해를 결정합니다. solve() 함수의 인수 set=True를 지정하였으므로 결과는 집합(set)형으로 반환됩니다. eq=x**4-1 solve(eq, set=True) ([x], {(-1,), (-I,), (1,), (I,)}) 위의 경우 I는 복소수입니다.즉 위 결과의 과정은 다음과 같습니다. $$x^4-1=(x^2+1)(x+1)(x-1)=0 \rightarrow x=\pm \sqrt{-1}, \; \pm 1=\pm i,\; \pm1$$ 실수...