Content
Contents
- Geometric distribution
- Negative Binomial Distribution
- Hypergeometric distribution
- Poisson distribution
Discrete probability distribution II : Geometric, Negative Binomial, Hypermatric, and Possion
Geometric distribution
The distribution of probability change until the first success after repeating Bernoulli trials is called geometric distribution. For example, the probability mass function with the random variable X being the first success(s) by repeating Bernoulli trials with a probability of success p will be as follows.
$$\begin{align}&R_x=\{1,\, 2,\, \cdots \}\\&f(1)=P(X=1)=p\\ &f(2)=P(X=2)=(1-p)p \\& \qquad \vdots\end{align}$$Generalizing the above results, the probability mass function of the geometric distribution can be formulated as Equation 1.
$$\begin{equation}\tag{1} f(x)=P(X=x)=(1-p)^{x-1}p\\ \end{equation}$$As in Equation 1, the probability mass function depends only on the parameter p. Therefore, it is called a geometric distribution with parameter p and is expressed as follows.
$$X \sim \text{Geometric(p)}$$import numpy as np import pandas as pd from sympy import * from scipy import stats import matplotlib.pyplot as plt
plt.figure(figsize=(6, 4)) for i in [0.1, 0.3, 0.5, 0.7]: p=[stats.geom.pmf(j, i) for j in range(10)] plt.plot(range(10), p, label='Geometric('+str(i)+')') plt.xlabel("x", size=13, weight="bold") plt.ylabel("PMF", size=13, weight="bold") plt.legend(loc='best') plt.show()
Let's try to derive the expected value and variance of the X ∼ Geometric(p) distribution from the moment generating function (MGF).
$$\begin{align}\tag{2} M_x(t)&=E(e^{tX})\\&=\sum^\infty _{n=1}e^{tX}(1-p)^{x-1}p\\&=\frac{p}{q}\sum^\infty _{n=1}\left(qe^t\right)^x\\&=\frac{p}{q} \frac{qe^t}{1-qe^t}\\&=\frac{pe^t}{1-qe^t}, \quad 1-qe^t >0,\; 1-q=p \end{align}$$In Equation 2, the sum is written as
$$\begin{aligned}&S\quad =qe^t+\left(qe^t\right)^2+\left(qe^t\right)^3+\cdots\\ &\underline{qe^tS=\left(qe^t\right)^2+\left(qe^t\right)^3+\left(qe^t\right)^4+\cdots}\\ &S-qe^tS=qe^t\\&\therefore S=\frac{qe^t}{1-qe^t}\end{aligned}$$From the above MGF, the expected value and variance are respectively calculated as in Equation 3.
$$\begin{align}\tag{3} &E(X)=M_x^\prime(0)=\frac{1}{p}\\ &Var(X)=M_x^{\prime\prime}(0)-\left(M_x^\prime(0) \right)^2=\frac{q}{p^2} \end{align}$$x,t, p, q=symbols('x t p q') M=(p*exp(t))/(1-q*exp(t)) M$\quad \color{blue}{\displaystyle \frac{p e^{t}}{- q e^{t} + 1}}$
Md1=M.diff(t) Md1=simplify(Md1) Md1$\quad \color{blue}{\displaystyle \frac{p e^{t}}{\left(q e^{t} - 1\right)^{2}}}$
E=together(Md1.subs(t, 0)).subs(1-q, p) E # expected value$\quad \color{blue}{\displaystyle \frac{1}{p}}$
Md2=diff(Md1,t) Md2=Md2.subs(1-q, p) Md2=simplify(Md2) Md2$\quad \color{blue}{\displaystyle - \frac{p \left(q e^{t} + 1\right) e^{t}}{\left(q e^{t} - 1\right)^{3}}}$
Md20=Md2.subs(t, 0).subs(1-q, p) Md20$\quad \color{blue}{\displaystyle \frac{2 - p}{p^{2}}}$
var=simplify(Md20-E**2) var #variance$\quad \color{blue}{\displaystyle \frac{1 - p}{p^{2}}}$
Negative Binomial Distribution
Like the geometric distribution, it is a distribution that shows the change in probability until the rth success while repeating Bernoulli trials, also called Pascal distribution. For example, if you continue tossing a coin until the number of heads comes up r times, the trial ends when you get r-1 heads and 1 other heads. The total number of coin tosses in this trial is the random variable. Therefore, the range of the random variable X of this distribution is the number of heads r and the probability p that heads occur in one trial. It is expressed as Equation 4.
$$\begin{align}\tag{4}X &\sim NB(r, p)\\X&: \text{Total number of trials until the rth success occurs}\\f(x) &= P(X=x)\\&=\binom{x-1}{r-1}p^r(1-p)^{x-r}\\x&=r,\, r+1, \,\cdots \end{align}$$From the above expression, if the number of successes r=1, it becomes a geometric distribution.
Example 1)
Calculate the probability of event A in a coin toss that results in 4 heads and a total of 10.
This example can be considered separately as follows
In the above process, P(C)=p and P(B) becomes binomial, so it can be expressed as follows
$$P(B)=\binom{9}{3}p^3(1-p)^6$$In the end, the probability mass function of A becomes the function shown in the above equation. It is calculated as:
$$\begin{align}P(A)&=\binom{9}{3}p^3(1-p)^6\cdot p\\&=\binom{9}{3}p^4(1-p)^6\cdot \end{align}$$The following code is the result of applying nbinom.pmf(x-r, r, p)
from the scipy.stats module.
from scipy.special import comb p=Rational(1,4) pmfA=comb(9, 3)*p**4*(1-p)**6 round(pmfA, 3)
0.058
round(stats.nbinom.pmf(6, 4, 1/4), 3)
0.058
Figure 2 shows the change in the negative binomial distribution according to the parameter p.
plt.figure(figsize=(6, 4)) for i in [0.1, 0.3, 0.5, 0.7]: p=[stats.nbinom.pmf(j-3,3, i) for j in range(3, 14)] plt.plot(range(3,14), p, label="NB(3, "+str(i)+")") plt.xlabel("x", size=13, weight="bold") plt.ylabel("PMF", size=13, weight="bold") plt.legend(loc='best') plt.show()
Since the negative binomial distribution can be expressed as the sum of the geometric distributions, the expected value and variance can be calculated as in Equation 5.
$$\begin{align}\tag{5} &X=X_1+X_2+X_3+\cdots+X_r\\ &\begin{aligned}E(X)&=E(X_1)+E(X_2)+E(X_3)+\cdots+E(X_r)\\&=r\frac{1}{p}\end{aligned}\\ &\begin{aligned}Var(X)&=Var(X_1)+Var(X_2)+\cdots+Var(X_r)\\&=r\frac{1-p}{p^2}\end{aligned} \end{align}$$Hypergeometric distribution
The hypergeometric distribution is the distribution of the probability that the desired k items are selected when n out of N items are extracted in sampling without replacement. For example, if 2 balls are selected from a bag containing 5 blue balls and 3 red balls, the probability that they are both blue balls can be calculated as follows:
$$\frac{\binom{5}{0}\binom{3}{2}}{\binom{8}{2}}$$p=comb(3, 2)*comb(5, 0)/comb(8, 2) round(p, 3)
0.107
This trial must satisfy the conditions for the number of balls to be selected from the population and for eventual success. In other words, the number of selected blue balls becomes the random variable (X) in the trial according to the two conditions.
$$R_x=\{0, \; 1,\; 2\}$$The change of probability corresponding to each variable corresponds to the hypergeometric distribution, and the probability mass function is expressed as Equation 6.
$$ \begin{align}\tag{6} &X \sim \text{hypergeom(M, N, n )}\\ &n:\text{Total number to extract from the population} \\ &N:\text{Total number of success cases} \\ &M :\text{Number of population }\\\\ &\begin{aligned} f(x)&=P(X=x)\\&=\frac{\binom{n}{k} \binom{M-n}{N-k}}{\binom{M}{N}} \end{aligned} \end{align}$$The statistics of the hypergeometric distribution can be calculated with the hypergeom
class of the scipy.stats module. For this example:
k=2; M=8; n=3; N=2 round(stats.hypergeom.pmf(k, M, n, N), 4)
0.1071
For the random variable $R_x$, the PMF can be calculated as
M=8; n=3; N=2 p=[stats.hypergeom.pmf(k, M, n, N) for k in range(3)] np.around(p, 3)
array([0.357, 0.536, 0.107])
Example 2)
Visualize the probability distribution of choosing up to 20 blue balls from 50 blue balls and 30 red balls.
M=80; n=50; N=20 re=np.array([stats.hypergeom.pmf(k, M, n, N)for k in range(21)]) plt.plot(range(21), re, color="blue", label="hypergeom(80, 50, 20)") plt.yticks(fontweight="bold") plt.xticks(fontweight="bold") plt.xlabel("X", fontsize="13", fontweight="bold") plt.ylabel("Probability", fontsize="13", fontweight="bold") plt.legend(loc="best") plt.show()
k, M,n,x,t=symbols('k M n x t') f=(binomial(n, x)*binomial(M-n, N-x))/(binomial(M, N)) Mom=Sum(exp(t*x)*f, (x, 0, k))$\quad \color{blue}{\displaystyle \sum_{x=0}^{k} \frac{e^{t x} {\binom{n}{x}} {\binom{M - n}{k - x}}}{{\binom{M}{k}}}}$
Md10=diff(Mom, t).subs(t, 0) Md10 #expected value$\quad \color{blue}{\displaystyle \sum_{x=0}^{k} \frac{x {\binom{n}{x}} {\binom{M - n}{k - x}}}{{\binom{M}{k}}}}$
Md20=diff(Mom, (t, 2)).subs(t, 0) Md20 #Second derivative of MGF$\quad \color{blue}{\displaystyle \frac{\sum_{x=0}^{k} x^{2} {\binom{n}{x}} {\binom{M - n}{k - x}}}{{\binom{M}{k}}}}$
var=Md20-(Md10)**2 var #variance$\quad \color{blue}{\displaystyle - \left(\sum_{x=0}^{k} \frac{x {\binom{n}{x}} {\binom{M - n}{k - x}}}{{\binom{M}{k}}}\right)^{2} + \frac{\sum_{x=0}^{k} x^{2} {\binom{n}{x}} {\binom{M - n}{k - x}}}{{\binom{M}{k}}}}$
In order to generalize the expected values and variances derived above to a concise form, some expressions of MGF are converted and recalculated as Equation 7.
$$\begin{align}\label{7} &x\binom{n}{x} =n\binom{n-1}{x-1}\\ &\binom{M}{k}=\frac{M}{k}\binom{M-1}{k-1}\\\\ &\begin{aligned}E(X)&=(M_x(0))^\prime\\&=\frac{kn}{M}\sum_{x=0}^{k} \frac{x {\binom{n-1}{x-1}} {\binom{M - n}{k - x}}}{{\binom{M-1}{k-1}}}\\&=\frac{kn}{M}\\ & \because \sum_{x=0}^{k} \frac{x {\binom{n-1}{x-1}} {\binom{M - n}{k - x}}}{{\binom{M-1}{k-1}}} = 1\end{aligned}\\ &\begin{aligned}Var(X)&=(M_x(0))^{\prime\prime}-(E(X))^2\\&=\frac{kn}{M}\left(\frac{(M-n)(M-k)}{M(M-1)}\right)\end{aligned} \end{align}$$Example 3)
In the case of the above example, that is, calculate the expected value and variance in trials with the following conditions.
E=Md10.subs({M:80, n:50, N:20, k:20}).doit() E.evalf(4)
12.5
var=Md2-(Md1)**2 var.evalf(4)$\quad \color{blue}{\text{3.56}}$
#by scipy.stats.hypergeom() m,v=stats.hypergeom.stats(80, 50, 20, moments='m,v') print('expected value:{:.2f}, variance:{:.4f}'.format(m, v))
expected value:12.50, variance:3.5601
Example 4)
Assuming that the random variable X follows a hypergeometric distribution with the following parameters, calculate the expected value and variance of this distribution.
N,m,n, k=52,20, 8, 4 E=(n*m)/N round(E, 3)
3.077
Var=E*((N-m)*(N-n)/(N*(N-1))) round(Var, 3)
1.634
mean, v=stats.hypergeom.stats(N, m, n, moments='mv') print("expected value:{:.3f}, variance:{:.4f}".format(mean, v))
expected value:3.077, variance:1.6336
Example 5)
If 5 products are randomly selected from 50 products containing 4 defective products, the number of normal products selected becomes the random variable X. Determine the mean and variance of X.
N,m,n=50, 4, 5 m, v=stats.hypergeom.stats(N, m, n, moments='mv') print("expected value:{:.3f}, variance:{:.4f}".format(m, v))
expected value:0.400, variance:0.3380
Poisson distribution
It is a discrete distribution that expresses the probability of an event that can occur in a fixed time or space. As one of the most used distributions, the random variable X is the number of events. Therefore, since this distribution cannot determine the total number of occurrences, the ratio of occurrences of the target event to the whole, the general concept of probability, cannot be used. Instead, if the average number of occurrences in the target time or space is known, the probability is calculated by the distribution function Equation 7.
$$\begin{align}\tag{7} &P(X=x)=\frac{e^{-\lambda}\lambda^x}{x!}\\ &\lambda: \text{average number of events } \end{align}$$ The Poisson distribution with the probability mass function of Equation 4.11 has the following prerequisites.- An event that occurs in a particular time or space must be independent of an outcome that occurs in another time or space. (independence)
- The number of events occurring in a unit of time or space must be the same. (consistency)
- The probability of two or more simultaneous occurrences in a very short time or in a very small space must be zero. (Non-cluster)
The probability mass function of the above Poisson distribution must be 1 in all ranges. That is, the following must hold:
$$\begin{align}\sum^\infty_{i=1}\frac{e^{-\lambda}\lambda^i}{x!}&=e^{-\lambda}\sum^\infty_{i=1}\frac{\lambda^i}{x!}\\&=e^{-\lambda}e^{\lambda}\\&=1 &\end{align}$$The above expression is the Taylor series applied as follows.
$$\begin{align}\sum^\infty_{i=1}\frac{\lambda^i}{x!}&=1+\lambda+\frac{\lambda^2}{2!}+\frac{\lambda^3}{3!}+\cdots\\&=e^\lambda \end{align}$$Check the Taylor series applied in the above expansion process using sympy's series()
and Sum()
functions.
a, x=symbols('lambda x') f=a**x/factorial(x) series(exp(a), a, 0, 6,'+')$\quad \color{blue}{\displaystyle 1 + \lambda + \frac{\lambda^{2}}{2} + \frac{\lambda^{3}}{6} + \frac{\lambda^{4}}{24} + \frac{\lambda^{5}}{120} + O\left(\lambda^{6}\right)}$
Sum(f, (x, 0, oo)).doit()$\quad \color{blue}{\displaystyle e^{\lambda}}$
In summary, if a random variable X depends on the parameter $\lambda$, then it follows a Poisson distribution and is expressed as
$$X \sim \text{Poisson}(\lambda)$$The sample space produced by this distribution is
$$R_x=\{0,\,1, \,2,\, \cdots\}$$In the case of $\lambda=5$, the probability by the defined probability mass function according to the random variable is calculated as follows. This calculation can be done using the possion()
method pmf(random variable, $\lambda$) of the scipy.stats module.
a, x=symbols('lambda x') f=exp(-a)*a**x/factorial(x) f
$\quad \color{blue}{\displaystyle \frac{\lambda^{x} e^{- \lambda}}{x!}}$
p=[f.subs({a:5, x:j}).evalf(4) for j in range(21)] p
[0.006738, 0.03369, 0.08422, ... 1.056e-6, 2.641e-7]
np.around([stats.poisson.pmf(i, 5) for i in range(21)], 4)
array([0.0067, 0.0337, 0.0842, 0.1404, 0.1755, 0.1755, 0.1462, 0.1044, 0.0653, 0.0363, 0.0181, 0.0082, 0.0034, 0.0013, 0.0005, 0.0002, 0. , 0. , 0. , 0. , 0. ])
plt.figure(figsize=(7, 4)) for i in [1, 5, 10]: p=[f.subs({a:i, x:j}) for j in range(21)] plt.plot(range(21), p, label="Poisson("+str(i)+")") plt.legend(loc="best") plt.xlabel("X", fontsize="13", fontweight="bold") plt.ylabel("Probability", fontsize="13", fontweight="bold") plt.legend(loc="best") plt.show()
The expected value and variance of the Poisson distribution can be calculated from the moment generating function shown in Equation 8.
$$\begin{align}\tag{8} &\text{MGF}\\ &\begin{aligned}M_x(t)&=E(e^{tX})\\&=\sum^\infty_{x=0}e^{tx}f(x)\\&=\sum_{x=0}^{\infty} \frac{\lambda^{x} e^{- \lambda} e^{t x}}{x!}\\&=e^{- \lambda}\sum_{x=0}^{\infty} \frac{\lambda^{x} e^{t x}}{x!}\\&=e^{- \lambda} e^{\lambda e^{t}}\end{aligned} \end{align}$$The expected value and variance from the above MGF are as Equation 9.
$$\begin{align}\tag{9} &E(X)=M_x^\prime(0)\\ &\sigma^2_x=M_x^{\prime\prime}(0)-(E(X))^2 \end{align}$$a, x, t=symbols('lambda x t') f=exp(-a)*a**x/factorial(x) f$\quad \color{blue}{\displaystyle \frac{\lambda^{x} e^{- \lambda}}{x!}}$
M_o=Sum(exp(t*x)*f, (x, 0, oo)) M_o$\quad \color{blue}{\displaystyle \sum_{x=0}^{\infty} \frac{\lambda^{x} e^{- \lambda} e^{t x}}{x!}}$
M=exp(-a)*Sum((a*exp(t))**x/factorial(x), (x, 0, oo)).doit() M$\quad \color{blue}{\displaystyle e^{- \lambda} e^{\lambda e^{t}}}$
Md1=M.diff(t) Md1$\quad \color{blue}{\displaystyle \lambda e^{- \lambda} e^{t} e^{\lambda e^{t}}}$
E=Md1.subs(t, 0) E$\quad \color{blue}{\displaystyle \lambda}$
Md2=diff(M, (t, 2)) Md2$\quad \color{blue}{\displaystyle \lambda \left(\lambda e^{t} + 1\right) e^{- \lambda} e^{t} e^{\lambda e^{t}}}$
var=Md2.subs(t, 0)-E**2 simplify(var)$\quad \color{blue}{\displaystyle \lambda}$
Example 6)
If the average change in a stock's stock price over 30 minutes is \$50, you want to model this change by the Poisson distribution. What is the probability of no change for 5 minutes by this model? Expected value and variance of this distribution?
In this problem, the random variable is the change over 5 minutes. Therefore, P(X=0) must be determined from the Poisson distribution. Also, since the average change in 30 minutes is 5, the average change in 5 minutes is $\displaystyle 5\frac{50}{30}$. That is, the parameter $\displaystyle \lambda =\frac{25}{3}$ becomes.
p=stats.poisson.pmf(0, 25/3) round(p, 4)
0.0002
# Apply Equation 7 la=Rational(25,3) (exp(-la)*la**0/factorial(0)).evalf(1)
0.0002
In the above problem, $R_x=\{0, \, 1, \, 2, \, \cdots \}$ , and the probability distribution is as follows.
plt.figure(figsize=(7, 4)) p=[stats.poisson.pmf(i,25/3) for i in range(0, 51)] plt.plot(range(len(p)), p, label="Poisson(25/3)") plt.legend(loc="best") plt.yticks(fontweight="bold") plt.xticks(fontweight="bold") plt.xlabel("X", fontsize="13", fontweight="bold") plt.ylabel("Probability", fontsize="13", fontweight="bold") plt.legend(loc="best") plt.show()
The expected value and variance of the above distribution are:
m, v=stats.poisson.stats(25/3, moments='mv') print("Expected Value:{:.3f}, Variance:{:.4f}".format(m, v))
Expected Value:8.333, Variance:8.3333
Example 7)
The pmf of any discrete random variable is
Decide P(X ≤ 50), P(25 < X < 75), P(X=20|X<60).
Rx of this distribution is
p=[0.1, 0.2, 0.2, 0.3, 0.2] x=[20, 30, 40, 50, 70] d=pd.DataFrame([p, x]).T d.columns=["P(x)","x"] d
P(x) | x | |
---|---|---|
0 | 0.1 | 20.0 |
1 | 0.2 | 30.0 |
2 | 0.2 | 40.0 |
3 | 0.3 | 50.0 |
4 | 0.2 | 70.0 |
np.sum(d.iloc[2:,0])$\quad \color{blue}{\text{0.8}}$
d.iloc[0,0]/np.sum(d.iloc[:5,0])
0.1
Example 8)
Consider the following case where two dice are executed and the point of each dice is the random variables X and Y.
- Ranges of X and Y and Probability Mass Functions
- $\displaystyle P(X=2, Y=6) =\frac{1}{6} \cdot \frac{1}{6}=\frac{1}{36} $
댓글
댓글 쓰기