Contents
Uniform Distribution
If the probability density function of a random variable X is constant over the range [a, b] as in Equation 1, then the variable is said to be uniformly distributed.
$$\begin{align}\tag{1}f(x)=\begin{cases}\frac{1}{b-a}& \quad a \lt x \lt b\\ 0&\quad \text{otherwise} \end{cases}\end{align}$$A uniform distribution is expressed as
$$X \sim \text{Uniform(a, b)}$$Example 1)
If the variable X is uniformly distributed in the range [0,10], try calculating the following probabilities:
import numpy as np import pandas as pd from sympy import * import matplotlib.pyplot as plt from scipy import stats
a, b, x=symbols("a b x") f=Rational(1, 10) f$\quad \color{blue}{\displaystyle \frac{1}{10}}$
1) 2 $\le$ X $\le$ 9
F1=f.integrate((x, 2, 9)) F1$\quad \color{blue}{\displaystyle \frac{7}{10}}$
2) 1 $\le$ x $\le$ 4
F1=f.integrate((x, 1, 4)) F1$\quad \color{blue}{\displaystyle \frac{3}{10}}$
3)x $\ge$ 6
F1=f.integrate((x, 6, 10)) F1$\quad \color{blue}{\displaystyle \frac{2}{5}}$
Example 2)
Assume that the bus leaves at 7:00 and stops at a specific stop exactly every 15 minutes. If a passenger waits for the bus between 7:00 and 7:30, what is the probability that the waiting time is
1) less than 5 minutes?
2) Probability of at least 12 minutes?
1)x ≤ 5 minutes
In this case, the probability that the passenger will be at the stop between 7:10 ~ 15 and 25 ~ 30.
x=symbols("x") f=Rational(1, 30) f.integrate((x, 10, 15))+f.integrate((x, 25, 30))$\quad \color{blue}{\displaystyle \frac{1}{3}}$
2) x $\ge$ 3
f.integrate((x, 0, 3))+f.integrate((x, 15, 18))$\quad \color{blue}{\displaystyle \frac{1}{5}}$
Mean and Variance
The mean and variance are calculated as in Equation 2 by applying their respective definitions. Of course, it can be calculated using the moment generating function.
$$\begin{align}\tag{2} E(X)&=\int^b_a x\frac{1}{b-a}\,dx\\&=\frac{1}{b-a}\frac{x^2}{2} \large{\vert} ^b_a\\&=\frac{b+a}{2}\\ E(X^2)&=\int^b_a x^2\frac{1}{b-a}\,dx\\&=\frac{1}{b-a}\frac{x^3}{3} \large{\vert} ^b_a\\&=\frac{b^3-a^3}{3(b-a)}\\&\frac{a^2+ab+b^2}{3}\\ Var(X)&=E(X^2)-(E(X))^2\\&=\frac{a^2+ab+b^2}{3}-\left(\frac{b+a}{2}\right)^2\\&=\frac{(b-a)^2}{12} \end{align}$$a, b, x, t=symbols("a b x t") f=1/(b-a) E=integrate(x*f, (x, a, b)) factor(E)$\qquad \color{blue}{\displaystyle \frac{a + b}{2}}$
Var=integrate(x**2*f, (x, a, b))-E**2 factor(Var)$\qquad \color{blue}{\displaystyle \frac{\left(a - b\right)^{2}}{12}}$
In the code above, the function factor()
is used to simply represent the expression generated by sympy in factorized form.
Normal (Gaussian) Distribution
When examining large-scale data for various phenomena, it shows the highest probability at the mean, and shows a bell-shaped shape with the probability decreasing to the same degree on both sides of the mean. This distribution is called the normal distribution. In particular, the distribution of large-scale random variables is close to the normal distribution regardless of the conditions of the variables, so it is the central distribution for studying various characteristics of the data.
Figure 1 shows that a bar graph (blue) is prepared for the frequency of sample means of stock price data for a specific period, and it corresponds to a theoretical normal distribution curve (red line) with the mean and variance of the sample means as parameters.
The theoretical normal distribution (red line) in Figure 1 is separated as shown in Figure 2. The y-axis of this normal distribution represents the probability (pdf), and the highest point is the average of this distribution, and both sides are symmetrical. Also, since the total area of this distribution is 1, the probability of a particular part of a variable can be calculated around the mean, and vice versa. The mean, mode, and median of this distribution are all equal.
As shown in Figure 2, the normal distribution is a symmetrical bell-shaped distribution with respect to the mean. This distribution can specify probabilities corresponding to multiples of the standard deviation σ from the mean μ.
$$\begin{align} &\mu \pm \sigma = 0.68\\&\mu \pm 1.96\sigma = 0.95\\ &\mu \pm 2.56\sigma = 0.99 \end{align}$$As a result, in a normal distribution, \textbf{mean indicates the location of the center of the distribution and variance indicates the degree of dispersion}. As introduced in the sampling distribution, as the number of sample means increases, the variable conforms to a normal distribution. This phenomenon is called Central Limit Theorem(CLT). This theorem can be used to provide a basis for statistical analysis of various data whose distribution cannot be specified. In other words, as shown in Figure 3, the distribution of means for samples sampled from real data by CLT can be assumed to be a normal distribution, so various inference statistics such as linear regression and ANOVA can be performed.
As shown in Figure 4, the overall shape of the normal distribution is determined by the mean and variance. In other words, the parameters of the normal distribution are the mean and variance.
Possibility Desnity Function(PDF)
The normal distribution in which the probability density function of the random variable X has the parameter mean (μ) and variance (σ2) is expressed as follows.
$$X \sim N(\mu, \sigma^2)$$The probability density function (pdf) of this distribution is as Equation 3.
$$\begin{equation}\tag{3} f(x)=\frac{1}{\sqrt{2\pi \sigma}}\exp(-\frac{(x-\mu)^2}{2\sigma^2}), \quad -\infty \lt x\lt \infty \end{equation}$$A variable that follows a normal distribution can be converted into a new variable Z with mean 0 and variance 1 by normalized as in Equation 4.
\begin{equation}\tag{4} Z=\frac{X-\mu_x}{\sigma_x} \end{equation}All data following a normal distribution can be transformed as above, and in this distribution, the mean and variance are fixed values and are used in various statistical analyzes. This distribution is called standard normal distribution and is expressed as Equation 5, and the PDF is also simplified compared to the normal distribution PDF shown above.
$$\begin{align}\tag{5} &Z \sim N(0, 1)\\ & f(z) = \frac{1}{\sqrt{2 \pi}}\exp(-\frac{z^2}{2}), \quad z \in \mathbb{R} \end{align}$$A random variable following a normal distribution can be standardized by using the zscore()
function of the scipy.stats module. Standardization can also be applied by the StandardScaler()
class from the sklearn.preprocessing module.
The mean and variance of the above standard normal distribution are 0 and 1, respectively. That statistic can be checked using the definition of expected value and variance.
z=symbols('z', real=True) f=1/sqrt(2*pi)*exp(-z**2/2) E=integrate(z*f, (z, -oo, oo)) E
0
Ex2=integrate(z**2*f, (z, -oo, oo)) Ex2
1
var=Ex2-E**2 var
1
The pdf of the normal distribution can be calculated by the stats.norm.pdf(x, loc=0, scale=1)
method. In this method, x is the value of a random variable, and loc and scale represent the mean and standard deviation, respectively. For the standard normal distribution, the mean and standard deviation are 0 and 1, which is the default for this method.
The following shows the curve of the standard normal distribution, indicating that it is symmetrical with respect to the mean 0.
plt.figure(figsize=(8, 4)) x=[stats.norm.pdf(i) for i in np.linspace(-5, 5, 100)] plt.plot(np.linspace(-5, 5, 100), x, label="N(0, 1)") plt.xlabel("x", fontsize="15", fontweight="bold") plt.ylabel("f(x)", fontsize="15", fontweight="bold") plt.axvline(x=0, color="red", linestyle="--", label="mean") plt.legend(loc="best") plt.show()
Cumulative Distribution Function(CDF)
The cumulative distribution function is defined as the sum of the probability density functions in the specified range. Therefore, for a continuous variable, it is calculated as Equation 6. In general, the CDF(F(x)) of the standard normal distribution is also expressed as Φ(x).
$$\begin{align}\tag{6} F(x)&=\Phi(x)\\&=P(Z\le x)\\&=\frac{1}{\sqrt{2}}\int^x_{-\infty} \exp(-\frac{x^2}{2})\,dx \end{align}$$The error function(erf), Φ(x), in Equation 6 is defined as follows.
$$\Phi(x)=\frac{1}{2} \left(1+\text{erf}\left(\frac{x}{\sqrt{2}}\right)\right)$$The cumulative distribution function must satisfy all of the following conditions.
- $\displaystyle \lim_{x \to \infty}=1, \; \lim_{x \to -\infty}=0$
- $\displaystyle \Phi(0) = \frac{1}{2}$
- $\displaystyle \Phi(-x)=1-\Phi(x), \quad x \in \mathbb{R}$
Since the standard normal distribution and the normal distribution are only due to the numerical change of data and have essentially the same shape, the PDF and CDF of the normal distribution variable X can be calculated by replacing z with x in the standard normal distribution.
Example 3)
For the distribution of $X \sim N(-10, 4)$, calculate the following probabilities from the normal distribution and calculate them from the transformed standard normal distribution.
- P(X < 0)
- P(-7 < X < 3),
- P(X >-3 | X >-5)
The standard normal distribution for this normal distribution is
plt.figure(figsize=(8, 4)) rng=np.linspace(-25, 5, 1000) p=[stats.norm.pdf(i, -10, 4) for i in rng] plt.plot(rng, p, label="N(-10, 4)") rngZ=(rng-(-10))/np.sqrt(4) pZ=[stats.norm.pdf(i) for i in rngZ] plt.plot(rngZ, pZ, label="N(0, 1)") plt.title(r'Normal Distribution $\rightarrow$ Standard Normal Distribution', fontsize="15", fontweight="bold") plt.xlabel("x", fontsize="15", fontweight="bold") plt.ylabel("f(x)", fontsize="15", fontweight="bold") plt.legend(loc="best") plt.show()
1) $\displaystyle P(X \lt 0)=P(Z /lt \frac{0-(-10)}{2})$
mu=-10 var=4 cf=stats.norm.cdf(0, mu, np.sqrt(var)) round(cf,4)
1.0
cfS=stats.norm.cdf((0-mu)/np.sqrt(var)) #Standard ND round(cfS, 4)
1.0
2) $\displaystyle P(-7 \lt X \lt 3)=P(\frac{-7-(-10)}{2} \lt Z \lt \frac{3-(-10)}{2})$
mu=-10 var=4 cf=stats.norm.cdf(3, mu, np.sqrt(var))-stats.norm.cdf(-7, mu, np.sqrt(var)) round(cf, 4)
0.0668
cfS=stats.norm.cdf((3-mu)/np.sqrt(var))-stats.norm.cdf((-7-mu)/np.sqrt(var)) round(cfS, 4)
0.0668
It is a conditional probability.
$$\begin{align} P(X \gt -3 | X \gt -5)&=\frac{P(X \gt -3, X \gt -5)}{P(X \gt -5)}\\ &=\frac{P(X \gt -3)}{P(X \gt -5)}\\&=\frac{1-P(X \le -3)}{1-P(X \le -5)}\end{align}$$As in the above expression, the result of excluding the cumulative probability from the whole is called survival probability, and the probability can be calculated by applying the sf()
method of each distribution class in the scipy.stats module.
mu=-10 var=4 cf=(1-stats.norm.cdf(-3, mu, np.sqrt(var)))/(1-stats.norm.cdf(-5, mu, np.sqrt(var))) round(cf, 4)
0.0375
survivalf=stats.norm.sf(-3, mu, np.sqrt(var))/stats.norm.sf(-5, mu, np.sqrt(var)) round(survivalf, 4)
0.0375
cfS=(1-stats.norm.cdf((-3-mu)/np.sqrt(var)))/(1-stats.norm.cdf((-5-mu)/np.sqrt(var))) round(cfS, 4)
0.0375
survivalf=stats.norm.sf((-3-mu)/np.sqrt(var))/stats.norm.sf((-5-mu)/np.sqrt(var)) round(survivalf, 4)
0.0375
Converting the normal distribution to the standard normal distribution can be considered as creating a new variable by linear transformation of raw data. In the same way, the mean
and variance
of the other variable Y generated by linear transformation of the random variable X by aX+b can be calculated as follows .
Example 4)
The normal distribution with the daily change rates X and Y of the stock prices of two companies as random variables is expressed as follows.
Under the assumption that they are independent of each other, can the next decision be made?
1) P(Z > 0.7) for the joint distribution Z of two random variables?
$$\begin{align} Z&=X+Y\\ E(Z)&=E(X)+E(Y)\\&=0.18+0.33\\&=0.51\\ Var(Z)&=Var(X)+Var(Y)\\&=4.43+11.5\\&=15.93\\ Z &\sim N(0.51, 15.93) \end{align}$$muz=0.18+0.33 varz=4.43+11.5 pthan07=stats.norm.sf(0.7, muz, np.sqrt(varz)) round(pthan07, 4)
0.481
2) Probability that Y-X is greater than or equal to 0.03? It is expressed as P(Y-X > 0.3) and is calculated for a new joint random variable Z=Y-X.
mux=0.18; muy=0.33 varx=4.43; vary=11.5 pthan03=stats.norm.sf(0.3, muy-mux, np.sqrt(vary-var)) round(pthan03, 4)
0.4782
댓글
댓글 쓰기