Contents
- probability mass function
- Cumulative Distribution Function(CDF)
- Bernoulli and the binomial probability distribution
Discrete probability distribution
The probability distribution is created based on the probability of each point or each interval in the sample space. Such a probability can be written as a function, and if the event (random variable) that is the object of the probability is a discrete variable, it is called probability mass function(PMF). Also, if it is a continuous variable, it is called probability density function(PDF). In both cases, the sum of each probability in a constant variable interval is called Cumulative Probability Distribution Function(CDF). The probability distribution can be expressed visually as a correspondence between the values of each random variable and a function. The shape of these distributions tends to follow the distributions implemented by a particular function. Therefore, various statistical methods can be applied by assuming an appropriate probability distribution in the analysis of data. For this reason, it can be said that understanding the characteristics of the distribution provides the basis for statistical analysis of data. This post introduces the discrete probability distribution.
It is very helpful to understand the distribution by organizing the probability mass function (PMF) and the cumulative distribution function (CDF). The probability density function (PDF) is rearranged when introducing continuous variables.
Probability Mass Function(PMF)
If the range Rx of the random variable X is a countable set, the sample space (S) can be expressed as follows. $$R_x={x_1,\; x_2, \;x_3, \;\cdots} $$A random variable is also a function that maps values to variables. That is, x1, x2, x3 etc. of S are values corresponding to each random variable. The function that can calculate the probability of each corresponding event becomes the probability mass function. If the event of interest is A, it is expressed as follows.
$$\text{A}=\{\text{s } \in \text{S} | \text{X(s)}=\text{x}_i\} $$The probability of event A is formulated by the probability mass function as
$$ f(x_i)=P(X=x_i) , \quad i = 1,\, 2,\, \cdots$$Example 1)
If a coin is tossed twice, determine S and the probability mass function (PMF) in a trial using the number of heads as a random variable.
import numpy as np import pandas as pd import matplotlib.pyplot as plt from sympy import *
X=['H','T'] S=np.array([]) for i in X: for j in X: S=np.append(S,(i, j).count("H")) S
array([2., 1., 1., 0.])
uni, fre=np.unique(S, return_counts=True) uni, fre
(array([0., 1., 2.]), array([1, 2, 1]))
p=fre/np.sum(fre) p
array([0.25, 0.5 , 0.25])
re=pd.DataFrame([uni, fre, p], index=["# of H", "Frequency", "Probabiltiy(p)"]).T re
# of H | Frequency | Probabiltiy(p) | |
---|---|---|---|
0 | 0.0 | 1.0 | 0.25 |
1 | 1.0 | 2.0 | 0.50 |
2 | 2.0 | 1.0 | 0.25 |
plt.figure(figsize=(6, 4)) plt.hist(S, bins=3, rwidth=0.8) plt.xlabel("# of H", size="13", weight="bold") plt.ylabel("PMF", size="13", weight="bold") plt.show()
Example 2)
If two dice are rolled and the sum of each value is a random variable, the range of the random variable ?
x=[1,2,3,4,5,6] s=np.array([]) for i in x: for j in x: s=np.append(s, i+j) s
array([ 2., 3., 4., 5., 6., 7., 3., 4., 5., 6., 7., 8., 4., 5., 6., 7., 8., 9., 5., 6., 7., 8., 9., 10., 6., 7., 8., 9., 10., 11., 7., 8., 9., 10., 11., 12.])
uni, fre=np.unique(s, return_counts=True) P=fre*Rational(1, 36) re=pd.DataFrame([uni, fre, P], index=["# of H", "Frequent", "Probability(p)"]).T re
# of H | Frequent | Probability(p) | |
---|---|---|---|
0 | 2.0 | 1 | 1/36 |
1 | 3.0 | 2 | 1/18 |
2 | 4.0 | 3 | 1/12 |
3 | 5.0 | 4 | 1/9 |
4 | 6.0 | 5 | 5/36 |
5 | 7.0 | 6 | 1/6 |
6 | 8.0 | 5 | 5/36 |
7 | 9.0 | 4 | 1/9 |
8 | 10.0 | 3 | 1/12 |
9 | 11.0 | 2 | 1/18 |
10 | 12.0 | 1 | 1/36 |
plt.figure(figsize=(6, 4)) plt.hist(s, bins=10, rwidth=0.8) plt.xlabel("# of H", size="13", weight="bold") plt.ylabel("PMF", size="13", weight="bold") plt.show()
Example 3)
The conditions for a normal coin toss test are as follows:
- The probability of getting heads is p
- Random variable Y: the number of times the coin is tossed when it comes up for the first time
Determine the probability distribution of Y and compute $P(2 \le x \le 5)$ if $p=\frac{1}{2}$.
$$\begin{align}&f(1)=P(Y=1)=P(H)=p\\&f(2) = P(Y=2)=P(TH)=(1-p)p\\ &f(3) = P(Y=3)=p(TTH)=(1-p)^2p\\ & \qquad \cdots\\ &f(y) = P(Y=y)=P(TT \cdots H)=(1-p)^{y-1}p\\ &P(2 \le x \le 5)=\sum^5_{x=2} p(1-p)^{x-1} \end{align}$$p=Rational(1/2) Px=np.array([(1-p)**(i-1)*p for i in range(2, 6)]) Px
array([1/4, 1/8, 1/16, 1/32], dtype=object)
P=np.sum(Px) P$\displaystyle \color{blue}{\frac{15}{32}}$
Cumulative Distribution Function(CDF)
A probability mass function represents a probability value for a point. In the case of a discrete variable, such a measurement is possible because the variable is a countable quantity, but in the case of a continuous variable, which will be introduced in the next chapter, it is difficult to specify a point, so a probability mass function cannot be defined. Instead, the probability density function is used. However, probabilities over a range can be calculated for both discrete and continuous variables. That is, it can be expressed as the sum of all probabilities at several points or in a specific range. This is called cumulative distribution function(CDF).
The cumulative distribution function (CDF) of the random variable X is defined as follows.
$$F(x)=P(X \le x), \quad \forall x \in R_X $$
Example 4)
If the random variable X is the number of heads in the trial of tossing two coins, then $R_X=\{0, 1, 2\}$. Determine the cumulative distribution function in this case.
To represent all events in this trial, the itertools module function product()
can be substituted for loops such as for statements
.
import itertools as itt case=list(itt.product(range(2), repeat=2)) case
[(0, 0), (0, 1), (1, 0), (1, 1)]
Rx=np.array([i[0]+i[1] for i in case]) Rx
array([0, 1, 1, 2])
uni, fre=np.unique(Rx, return_counts=True) uni
array([0, 1, 2])
pmf=fre*Rational(1, 4) pmf
array([1/4, 1/2, 1/4], dtype=object)
The above results are summarized as follows.
$$f(x)=\begin{cases}P(X=0)&=\frac{1}{4}\\P(X=1)&=\frac{1}{2}\\P(X=2)&=\frac{1}{4}\end{cases}$$The probability mass function can be rearranged by dividing the range as follows. As a result, it represents a cumulative distribution function over a certain range.
$$F(x)=\begin{cases}P(X \le 0)&=\frac{1}{4}\\P(X \le 1)&=\frac{1}{4}+\frac{1}{2}=\frac{3}{4}\\P(X \le 2)&=\frac{1}{4}+\frac{1}{2}+\frac{1}{4}=1\end{cases}$$In the following code, the cumsum()
function of the numpy module is used to calculate the cumulative sum of objects.
cdf=np.cumsum(pmf) cdf
array([1/4, 3/4, 1], dtype=object)
plt.step(uni, cdf, label="CDF") plt.scatter(uni, pmf, label="PMF", color="red") plt.xlabel("Rx", size='13', weight="bold") plt.ylabel("Probability", size='13', weight="bold") plt.legend(loc="best") plt.show()
Bernoulli and the binomial probability distribution
A probability distribution showing only success or failure (1 or 0) in one trial is called Bernoulli distribution. That is, this random variable contains only two values. The probability mass function of this distribution can be expressed as Equation 4.1.
$$\begin{equation}\tag{1} f(x)=P(X=x)=\begin{cases}p&\quad \text{for}\, x=1\\1-p&\quad \text{for}\, x=0\\ p&\quad \text{for}\, x=\text{otherwise} \end{cases} \end{equation}$$The above probability mass function (PMF) is determined by one parameter, the probability p. Therefore, this distribution can be expressed as
$$X \sim \text{Bernoulli(p)} $$Example 5)
If one dice is rolled, the random variables are:
The probability mass function (PMF) and E(x) of this distribution ?
$$\begin{aligned} &f(x)=\left(\frac{1}{3}\right)^x\left(\frac{2}{3}\right)^{1-x}\\ &\begin{aligned}E(x)&=1 \cdot \frac{1}{3}+0 \cdot \frac{2}{3}\\&=\frac{1}{3}\end{aligned} \end{aligned}$$Statistics such as PMF, CDF, expected value (mean) and variance of the Bernoulli distribution can be calculated using the bernoulli()
class of the scipy.stats module.
from scipy import stats
np.around(stats.bernoulli.pmf(1, 1/3), 3)
0.333
np.around(stats.bernoulli.stats(1/3, moments="mv"), 3)
array([0.333, 0.222])
The binomial distribution refers to the probability distribution when the above Bernoulli distribution is tried several times. For example, if a coin is tossed 3 times and the number of heads is the random variable X, the range of the random variable is $R_x=\{0,\, 1,\, 2, \, 3\}$. In this case, if p is the probability of coming up heads, the probability for each variable value can be expressed as
$$\begin{aligned} &f(0)=P(X=0)=P( \text{TTT})=(1-p)^3\\ &f(1)=P(X=1)=P( \text{TTH or THT or HTT})=3(1-p)^2 p\\ &f(2)=P(X=2)=P( \text{THH or HHT or HTH})=3(p^2(1-p))\\ &f(3)=P(X=3)=P(\text{HHH})=p^3 \end{aligned}$$Using the combination formula $^{\ref{combination}}$ for the above processes, the probability mass function of the binomial distribution is defined as Equation 2.
$$\begin{align}\tag{2} &f(x)=P(X=x)=\binom{n}{k}p^k(1-p)^{n-k}\\ & k: 0,\, 1, \,2, \,\cdots,\, n \end{align}$$In case of n trials, the probability of success k times is calculated in the same way as above. In other words, the binomial distribution is characterized by the number of trials n and the probability p. That is, this distribution, along with its parameters, is briefly expressed as:
$$X \sim \text{B(n, p)} $$The probability mass function of the binomial distribution can be calculated using the method pmf of the class scipy.stats.binom(x, n, p)
.
Example 6)
Visualize the binomial distribution of B(10, 0.3) and B(20, 0.6).
px=[stats.binom.pmf(i, 10, 0.3) for i in range(21)] py=[stats.binom.pmf(i, 20, 0.6) for i in range(21)] data=pd.DataFrame([range(11), px[:11], py[10:21]]).T data.columns=["x","Px","Py"] data
x | Px | Py | |
---|---|---|---|
0 | 0.0 | 0.028248 | 0.117142 |
1 | 1.0 | 0.121061 | 0.159738 |
2 | 2.0 | 0.233474 | 0.179706 |
3 | 3.0 | 0.266828 | 0.165882 |
4 | 4.0 | 0.200121 | 0.124412 |
5 | 5.0 | 0.102919 | 0.074647 |
6 | 6.0 | 0.036757 | 0.034991 |
7 | 7.0 | 0.009002 | 0.012350 |
8 | 8.0 | 0.001447 | 0.003087 |
9 | 9.0 | 0.000138 | 0.000487 |
1 | 0 10.0 | 0.000006 | 0.000037 |
plt.figure(figsize=(10, 5)) plt.subplots_adjust(wspace=0.5) plt.subplot(1,2,1) plt.bar(range(21), px, label='B(10, 0.3)') plt.xlabel("x", size=13, weight="bold") plt.ylabel("PMF", size=13, weight="bold") plt.legend(loc='best') plt.subplot(1,2,2) plt.bar(range(21), py, label='B(20, 0.6)') plt.xlabel("x", size=13, weight="bold") plt.ylabel("PMF", size=13, weight="bold") plt.legend(loc='best') plt.show()
According to Figure 4, the parameters n and p of the binomial distribution show the change in the shape of the distribution. A binomial distribution is a multiple trial of the Bernoulli distribution, and the sum of each Bernoulli trial becomes the binomial distribution. In other words, the trial of tossing a coin is a Bernoulli probability distribution with Bernoulli(p) as the probability mass function. If these trials are repeated n times, the sum of each Bernoulli trial forms a binomial distribution, that is, B(n, p).
If each of X1, \,X2, … is a random variable of Bernoulli(p), then the distribution of the sum of each variable (X=X1+X2+ …) has a binomial distribution B(n, p) It's possible. According to this definition, it is possible to create a new random variable by adding the events of two binomial distributions with the same probability.
$$\begin{align}&X \sim B(n, p) \\ &Y \sim B(m, p)\end{align}$$The sum of two binomial variables (X and Y) can be expressed as a new random variable Z as follows.
$$Z \sim B(n+m, p)$$Example 7)
If the defective rate is 0.01 among the products of a certain factory, what is the probability that at least 2 defective products will be included if 30 products are randomly selected from among the products?
The sample space for this event is $R_x=\{0,\, 1, \,2, \, \cdots\}$ and can be expressed as the following binomial distribution:
$$X \sim B(30, 0.01)$$The probability that two or more defective items are included in this probability distribution is calculated as follows.
$$P(X \ge 2) = 1 -P(X \le 2)$$The above calculation can be calculated with a probability mass function and a cumulative function as shown in the following code.
pthan2=1-sum([stats.binom.pmf(i, 30, 0.01) for i in range(3)]) round(pthan2, 4)
0.0033
round(1-stats.binom.cdf(2, 30, 0.01), 4)
0.0033
Example 8)
A game is played with rules in which the prize money is determined according to the number specified in the three rounds of the dice.
Rounds | 0 | 1 | 2 | 3 |
---|---|---|---|---|
prize | -1000 | 1000 | 2000 | 3000 |
Determines the expected value of this game.
This trial can be expressed as a binomial distribution as
$$\begin{align} &X \sim B\left(3, \frac{1}{6} \right) \\ & f(x)=P(X=x)=\binom{3}{x}\left(\frac{1}{6}\right)^x\left(\frac{1}{6} \right)^{3-x}, \quad x=0,\;1,\;2,\;3 \end{align}$$P=np.array([-1000, 1000, 2000, 3000]) pmf=[stats.binom.pmf(i, 3, 1/6) for i in range(4)] np.around(pmf, 3)
array([0.579, 0.347, 0.069, 0.005])
Ex=np.sum(P*pmf) round(Ex,3)
-78.704
As a result, a loss of approximately \$79 is expected.
Example 9)
In a daily trade of a stock, the probability that the close price increases by more than 1% relative to the open price is 0.45. If you buy the stock 10 times under an investment plan of buying the stock at the open price and selling it at the closing price when it rises 1%, how many times will this trade be completed on average? Also, what is the probability that this transaction will happen more than 5 times out of 10?
This distribution is shown in Figure 5.
plt.figure(figsize=(6, 4)) p=[stats.binom.pmf(i, 10, 0.45) for i in range(11)] plt.bar(range(11), p, label='B(10, 0.45)') plt.xlabel("x", size=13, weight="bold") plt.ylabel("PMF", size=13, weight="bold") plt.legend(loc='best') plt.show()
Ex=sum([i*stats.binom.pmf(i, 10, 0.45) for i in range(11)]) round(Ex, 3)
4.5
This trade will happen on average 4 out of 10 times.
p4=stats.binom.cdf(4, 10, 0.45) pthan5=1-p4 round(pthan5,3)
0.496
The expected value and variance of the binomial distribution can be derived with the moment generating function (MX(t)). The nth moment of a moment generating function is equal to the nth derivative of that function. Therefore, the expected value, i.e., the mean, is a first-order moment, so it is derived as follows.
$$\begin{align}&\begin{aligned}M_x(t)&=E(e^{tx})\\&=\sum^n_x e^{tx}\binom{n}{x} p^x(1-p)^{n-x}\\&=\sum^n_x \binom{n}{x} (pe^t)^x(1-p)^{n-x}\\&=(pe^t+1-p)^n\end{aligned}\\&\because\; (p+q)^n=\sum^n_x \binom{n}{x}p^x(1-p)^{n-x} \end{align}$$The variance is the second moment (second derivative) of the moment in the above expression and is calculated as follows.
$$\sigma^2=\frac{d^2 (M_x(t)))}{dt}(0)-E(x^2)$$It is calculated by applying the diff()
function of the sympy module.
t, p, n=symbols("t p n") M=(p*exp(t)+1-p)**n Md1=M.diff(t) Md1$\quad \displaystyle \color{blue}{\frac{n p \left(p e^{t} - p + 1\right)^{n} e^{t}}{p e^{t} - p + 1}}$
E=Md1.subs(t, 0) E$\quad \color{blue}{\text{np}}$
Md2=Md1.diff(t) simplify(Md2)$\quad \color{blue}{\displaystyle \frac{n p \left(p \left(n - 1\right) \left(p e^{t} - p + 1\right)^{n + 1} e^{t} + \left(p e^{t} - p + 1\right)^{n + 2}\right) e^{t}}{\left(p e^{t} - p + 1\right)^{3}}}$
Md20=Md2.subs(t, 0) Md20$\quad \color{blue}{\displaystyle n^{2} p^{2} - n p^{2} + n p}$
var=Md20-E**2 var$\quad \color{blue}{\displaystyle - n p^{2} + n p}$
Summarizing the above results, the expected value and variance of the binomial distribution are as shown in Equation 3.
$$\begin{align}\tag{3}&E(x)=np\\&Var(X)=np(1-p) \end{align}$$Example 10)
If a student randomly selects an answer from 15 questions in the form of choosing one out of five, what is the probability of getting more than 5 correct answers? And what are the expected values and variances?
It is the binomial probability with probability p=$\displaystyle \frac{1}{5}$ of getting a correct answer to a single problem.
$$\begin{align} &R_x=\{0,\, 1,\, 2,\, 3,\, \cdots \} \\& f(x)=\binom{15}{x}\left(\frac{1}{5} \right)^x \left(\frac{4}{5} \right)^{15-x}\end{align}$$Use the sympy module to calculate directly from the above probability density function.
n, x=symbols("n x") f=binomial(n, x)*(1/5)**x*(4/5)**(n-x) Fless4=[f.subs({n:15, x:i}) for i in range(5)] Fless4
[0.0351843720888320, 0.131941395333120, 0.230897441832960, 0.250138895319040, 0.187604171489280]
Fless4_np=np.array(Fless4, dtype=np.float64) np.around(Fless4_np, 3)
array([0.035, 0.132, 0.231, 0.25 , 0.188])
Fmore5=1-np.sum(Fless4_np) Fmore5
0.16423372393676727
The above code uses the sympy package. As a result, the data types of Fless4 and its elements are as follows.
type(Fless4), type(Fless4[0])
(list, sympy.core.numbers.Float)
That is, the data type of the object is a list, and the data type of each element is the float type uniquely designated by sympy. The data type of these elements is different from the array data type of numpy. Therefore, in order to apply all numpy functions or classes to this object, the data type of each element must be newly designated. For this reason, dtype must be specified in the object Fless4_np in the code above.
In addition to the direct calculation of the code above, the scipy.stats.binom.cdf()
function can be applied.
Fthan5=1-stats.binom.cdf(4, 15, 1/5) round(Fthan5, 3)
0.164
The mean and variance of this distribution are
mu=15*1/5 var=15*(1/5)*(1-1/5) print("mean:{:.3f}, variance:{:.3f}".format(mu, var))
mean:3.000, variance:2.400
mu, var=stats.binom.stats(15, 1/5, moments="m, v") print("mean:{:.3f}, variance:{:.3f}".format(mu, var))
mean:3.000, variance:2.400
댓글
댓글 쓰기