Discrete probability distribution I : Bernoulli and Binomial

probability mass function
Cumulative Distribution Function(CDF)
Bernoulli and the binomial probability distribution

Discrete probability distribution

The probability distribution is created based on the probability of each point or each interval in the sample space. Such a probability can be written as a function, and if the event (random variable) that is the object of the probability is a discrete variable, it is called probability mass function(PMF). Also, if it is a continuous variable, it is called probability density function(PDF). In both cases, the sum of each probability in a constant variable interval is called Cumulative Probability Distribution Function(CDF). The probability distribution can be expressed visually as a correspondence between the values of each random variable and a function. The shape of these distributions tends to follow the distributions implemented by a particular function. Therefore, various statistical methods can be applied by assuming an appropriate probability distribution in the analysis of data. For this reason, it can be said that understanding the characteristics of the distribution provides the basis for statistical analysis of data. This post introduces the discrete probability distribution.

It is very helpful to understand the distribution by organizing the probability mass function (PMF) and the cumulative distribution function (CDF). The probability density function (PDF) is rearranged when introducing continuous variables.

Probability Mass Function(PMF)

If the range R_x of the random variable X is a countable set, the sample space (S) can be expressed as follows. $$R_x={x_1,\; x_2, \;x_3, \;\cdots} $$

A random variable is also a function that maps values to variables. That is, x₁, x₂, x₃ etc. of S are values corresponding to each random variable. The function that can calculate the probability of each corresponding event becomes the probability mass function. If the event of interest is A, it is expressed as follows.

$$\text{A}=\{\text{s } \in \text{S} | \text{X(s)}=\text{x}_i\} $$

The probability of event A is formulated by the probability mass function as

$$ f(x_i)=P(X=x_i) , \quad i = 1,\, 2,\, \cdots$$

Example 1)
If a coin is tossed twice, determine S and the probability mass function (PMF) in a trial using the number of heads as a random variable.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sympy import *

X=['H','T']
S=np.array([])
for i in X:
    for j in X:
        S=np.append(S,(i, j).count("H"))
S

array([2., 1., 1., 0.])

uni, fre=np.unique(S, return_counts=True)
uni, fre

(array([0., 1., 2.]), array([1, 2, 1]))

p=fre/np.sum(fre)
p

array([0.25, 0.5 , 0.25])

re=pd.DataFrame([uni, fre, p], index=["# of H", "Frequency", "Probabiltiy(p)"]).T
re

	# of H	Frequency	Probabiltiy(p)
0	0.0	1.0	0.25
1	1.0	2.0	0.50
2	2.0	1.0	0.25

plt.figure(figsize=(6, 4))
plt.hist(S, bins=3, rwidth=0.8)
plt.xlabel("# of H", size="13", weight="bold")
plt.ylabel("PMF", size="13", weight="bold")
plt.show()

Figure 1. PMF for the number of heads in two coin trials.

Example 2)
If two dice are rolled and the sum of each value is a random variable, the range of the random variable ?

x=[1,2,3,4,5,6]
s=np.array([])
for i in x:
    for j in x:
        s=np.append(s, i+j)
s

array([ 2.,  3.,  4.,  5.,  6.,  7.,  3.,  4.,  5.,  6.,  7.,  8.,  4.,
        5.,  6.,  7.,  8.,  9.,  5.,  6.,  7.,  8.,  9., 10.,  6.,  7.,
        8.,  9., 10., 11.,  7.,  8.,  9., 10., 11., 12.])

uni, fre=np.unique(s, return_counts=True)
P=fre*Rational(1, 36)
re=pd.DataFrame([uni, fre, P], index=["# of H", "Frequent", "Probability(p)"]).T
re

	# of H	Frequent	Probability(p)
0	2.0	1	1/36
1	3.0	2	1/18
2	4.0	3	1/12
3	5.0	4	1/9
4	6.0	5	5/36
5	7.0	6	1/6
6	8.0	5	5/36
7	9.0	4	1/9
8	10.0	3	1/12
9	11.0	2	1/18
10	12.0	1	1/36

plt.figure(figsize=(6, 4))
plt.hist(s, bins=10, rwidth=0.8)
plt.xlabel("# of H", size="13", weight="bold")
plt.ylabel("PMF", size="13", weight="bold")
plt.show()

Figure 2. PMF for the sum of points when two dice are executed in Example 2.

Example 3)
The conditions for a normal coin toss test are as follows:

The probability of getting heads is p
Random variable Y: the number of times the coin is tossed when it comes up for the first time

Determine the probability distribution of Y and compute $P(2 \le x \le 5)$ if $p=\frac{1}{2}$.

p=Rational(1/2)
Px=np.array([(1-p)**(i-1)*p for i in range(2, 6)])
Px

array([1/4, 1/8, 1/16, 1/32], dtype=object)

P=np.sum(Px)
P

Cumulative Distribution Function(CDF)

A probability mass function represents a probability value for a point. In the case of a discrete variable, such a measurement is possible because the variable is a countable quantity, but in the case of a continuous variable, which will be introduced in the next chapter, it is difficult to specify a point, so a probability mass function cannot be defined. Instead, the probability density function is used. However, probabilities over a range can be calculated for both discrete and continuous variables. That is, it can be expressed as the sum of all probabilities at several points or in a specific range. This is called cumulative distribution function(CDF).

The cumulative distribution function (CDF) of the random variable X is defined as follows.

$$F(x)=P(X \le x), \quad \forall x \in R_X $$

Example 4)
If the random variable X is the number of heads in the trial of tossing two coins, then $R_X=\{0, 1, 2\}$. Determine the cumulative distribution function in this case.

To represent all events in this trial, the itertools module function product() can be substituted for loops such as for statements.

import itertools as itt
case=list(itt.product(range(2), repeat=2))
case

[(0, 0), (0, 1), (1, 0), (1, 1)]

Rx=np.array([i[0]+i[1] for i in case])
Rx

array([0, 1, 1, 2])

uni, fre=np.unique(Rx, return_counts=True)
uni

array([0, 1, 2])

pmf=fre*Rational(1, 4)
pmf

array([1/4, 1/2, 1/4], dtype=object)

The above results are summarized as follows.

The probability mass function can be rearranged by dividing the range as follows. As a result, it represents a cumulative distribution function over a certain range.

In the following code, the cumsum() function of the numpy module is used to calculate the cumulative sum of objects.

cdf=np.cumsum(pmf)
cdf

array([1/4, 3/4, 1], dtype=object)

plt.step(uni, cdf, label="CDF")
plt.scatter(uni, pmf, label="PMF", color="red")
plt.xlabel("Rx", size='13', weight="bold")
plt.ylabel("Probability", size='13', weight="bold")
plt.legend(loc="best")
plt.show()

Figure 3. PMF and CDF for two coins in example 4.
Bernoulli and the binomial probability distribution

A probability distribution showing only success or failure (1 or 0) in one trial is called **Bernoulli distribution**. That is, this random variable contains only two values. The probability mass function of this distribution can be expressed as Equation 4.1.
$$\begin{equation}\tag{1} f(x)=P(X=x)=\begin{cases}p&\quad \text{for}\, x=1\\1-p&\quad \text{for}\, x=0\\ p&\quad \text{for}\, x=\text{otherwise} \end{cases} \end{equation}$$
The above probability mass function (PMF) is determined by one parameter, the probability p. Therefore, this distribution can be expressed as
$$X \sim \text{Bernoulli(p)} $$
**Example 5)**
If one dice is rolled, the random variables are:
$$\begin{aligned}&\text{point 1 or 3 : x=1}\\ &\text{Any other value: x=0}\end{aligned}$$
The probability mass function (PMF) and E(x) of this distribution ?
$$\begin{aligned} &f(x)=\left(\frac{1}{3}\right)^x\left(\frac{2}{3}\right)^{1-x}\\ &\begin{aligned}E(x)&=1 \cdot \frac{1}{3}+0 \cdot \frac{2}{3}\\&=\frac{1}{3}\end{aligned} \end{aligned}$$
Statistics such as PMF, CDF, expected value (mean) and variance of the Bernoulli distribution can be calculated using the `bernoulli()` class of the scipy.stats module.

from scipy import stats

np.around(stats.bernoulli.pmf(1, 1/3), 3)

0.333

np.around(stats.bernoulli.stats(1/3, moments="mv"), 3)

array([0.333, 0.222])

The binomial distribution refers to the probability distribution when the above Bernoulli distribution is tried several times. For example, if a coin is tossed 3 times and the number of heads is the random variable X, the range of the random variable is $R_x=\{0,\, 1,\, 2, \, 3\}$. In this case, if p is the probability of coming up heads, the probability for each variable value can be expressed as
$$\begin{aligned} &f(0)=P(X=0)=P( \text{TTT})=(1-p)^3\\ &f(1)=P(X=1)=P( \text{TTH or THT or HTT})=3(1-p)^2 p\\ &f(2)=P(X=2)=P( \text{THH or HHT or HTH})=3(p^2(1-p))\\ &f(3)=P(X=3)=P(\text{HHH})=p^3 \end{aligned}$$
Using the combination formula $^{\ref{combination}}$ for the above processes, the probability mass function of the binomial distribution is defined as Equation 2.
$$\begin{align}\tag{2} &f(x)=P(X=x)=\binom{n}{k}p^k(1-p)^{n-k}\\ & k: 0,\, 1, \,2, \,\cdots,\, n \end{align}$$
In case of n trials, the probability of success k times is calculated in the same way as above. In other words, the binomial distribution is characterized by the number of trials n and the probability p. That is, this distribution, along with its parameters, is briefly expressed as:
$$X \sim \text{B(n, p)} $$
The probability mass function of the binomial distribution can be calculated using the method pmf of the class `scipy.stats.binom(x, n, p)`.

**Example 6)**
Visualize the binomial distribution of B(10, 0.3) and B(20, 0.6).

px=[stats.binom.pmf(i, 10, 0.3) for i in range(21)] py=[stats.binom.pmf(i, 20, 0.6) for i in range(21)] data=pd.DataFrame([range(11), px[:11], py[10:21]]).T data.columns=["x","Px","Py"] data

x Px Py

0 0.0 0.028248 0.117142

1 1.0 0.121061 0.159738

2 2.0 0.233474 0.179706

3 3.0 0.266828 0.165882

4 4.0 0.200121 0.124412

5 5.0 0.102919 0.074647

6 6.0 0.036757 0.034991

7 7.0 0.009002 0.012350

8 8.0 0.001447 0.003087

9 9.0 0.000138 0.000487

1 0 10.0 0.000006 0.000037

plt.figure(figsize=(10, 5)) plt.subplots_adjust(wspace=0.5) plt.subplot(1,2,1) plt.bar(range(21), px, label='B(10, 0.3)') plt.xlabel("x", size=13, weight="bold") plt.ylabel("PMF", size=13, weight="bold") plt.legend(loc='best') plt.subplot(1,2,2) plt.bar(range(21), py, label='B(20, 0.6)') plt.xlabel("x", size=13, weight="bold") plt.ylabel("PMF", size=13, weight="bold") plt.legend(loc='best') plt.show()

Figure 4. Changes in the binomial distribution according to parameters.
According to Figure 4, the parameters n and p of the binomial distribution show the change in the shape of the distribution. A binomial distribution is a multiple trial of the Bernoulli distribution, and the sum of each Bernoulli trial becomes the binomial distribution. In other words, the trial of tossing a coin is a Bernoulli probability distribution with Bernoulli(p) as the probability mass function. If these trials are repeated n times, the sum of each Bernoulli trial forms a binomial distribution, that is, B(n, p).

If each of X₁, \,X₂, … is a random variable of Bernoulli(p), then the distribution of the sum of each variable (X=X₁+X₂+ …) has a binomial distribution B(n, p) It's possible. According to this definition, it is possible to create a new random variable by adding the events of two binomial distributions with the same probability.
$$\begin{align}&X \sim B(n, p) \\ &Y \sim B(m, p)\end{align}$$
The sum of two binomial variables (X and Y) can be expressed as a new random variable Z as follows.
$$Z \sim B(n+m, p)$$
**Example 7)**
If the defective rate is 0.01 among the products of a certain factory, what is the probability that at least 2 defective products will be included if 30 products are randomly selected from among the products?

The sample space for this event is $R_x=\{0,\, 1, \,2, \, \cdots\}$ and can be expressed as the following binomial distribution:
$$X \sim B(30, 0.01)$$
The probability that two or more defective items are included in this probability distribution is calculated as follows.
$$P(X \ge 2) = 1 -P(X \le 2)$$
The above calculation can be calculated with a probability mass function and a cumulative function as shown in the following code.

pthan2=1-sum([stats.binom.pmf(i, 30, 0.01) for i in range(3)]) round(pthan2, 4)

0.0033

round(1-stats.binom.cdf(2, 30, 0.01), 4)

0.0033

**Example 8)**
A game is played with rules in which the prize money is determined according to the number specified in the three rounds of the dice.

Rounds 0 1 2 3

prize -1000 1000 2000 3000

Determines the expected value of this game.

This trial can be expressed as a binomial distribution as
$$\begin{align} &X \sim B\left(3, \frac{1}{6} \right) \\ & f(x)=P(X=x)=\binom{3}{x}\left(\frac{1}{6}\right)^x\left(\frac{1}{6} \right)^{3-x}, \quad x=0,\;1,\;2,\;3 \end{align}$$
P=np.array([-1000, 1000, 2000, 3000]) pmf=[stats.binom.pmf(i, 3, 1/6) for i in range(4)] np.around(pmf, 3)

array([0.579, 0.347, 0.069, 0.005])

Ex=np.sum(P*pmf) round(Ex,3)

-78.704

As a result, a loss of approximately \$79 is expected.

**Example 9)**
In a daily trade of a stock, the probability that the close price increases by more than 1% relative to the open price is 0.45. If you buy the stock 10 times under an investment plan of buying the stock at the open price and selling it at the closing price when it rises 1%, how many times will this trade be completed on average? Also, what is the probability that this transaction will happen more than 5 times out of 10?
In this case, the random variable X is the number of sells per day. $$\begin{align}&R_x=\{0,\,1,\,2,\, \cdots, \,10\} \\ &X \sim B(10, 0.45) \end{align}$$
This distribution is shown in Figure 5.

plt.figure(figsize=(6, 4)) p=[stats.binom.pmf(i, 10, 0.45) for i in range(11)] plt.bar(range(11), p, label='B(10, 0.45)') plt.xlabel("x", size=13, weight="bold") plt.ylabel("PMF", size=13, weight="bold") plt.legend(loc='best') plt.show()

Figure 5. Binomial distribution with B(10, 0.45).

Ex=sum([i*stats.binom.pmf(i, 10, 0.45) for i in range(11)]) round(Ex, 3)

4.5

This trade will happen on average 4 out of 10 times.

p4=stats.binom.cdf(4, 10, 0.45) pthan5=1-p4 round(pthan5,3)

0.496

The expected value and variance of the binomial distribution can be derived with the **moment generating function (MX(t))**. The nth moment of a moment generating function is equal to the nth derivative of that function. Therefore, the expected value, i.e., the mean, is a first-order moment, so it is derived as follows.
$$\begin{align}&\begin{aligned}M_x(t)&=E(e^{tx})\\&=\sum^n_x e^{tx}\binom{n}{x} p^x(1-p)^{n-x}\\&=\sum^n_x \binom{n}{x} (pe^t)^x(1-p)^{n-x}\\&=(pe^t+1-p)^n\end{aligned}\\&\because\; (p+q)^n=\sum^n_x \binom{n}{x}p^x(1-p)^{n-x} \end{align}$$
The variance is the second moment (second derivative) of the moment in the above expression and is calculated as follows.
$$\sigma^2=\frac{d^2 (M_x(t)))}{dt}(0)-E(x^2)$$
It is calculated by applying the `diff()` function of the sympy module.

t, p, n=symbols("t p n") M=(p*exp(t)+1-p)**n Md1=M.diff(t) Md1
$\quad \displaystyle \color{blue}{\frac{n p \left(p e^{t} - p + 1\right)^{n} e^{t}}{p e^{t} - p + 1}}$
E=Md1.subs(t, 0) E
$\quad \color{blue}{\text{np}}$
Md2=Md1.diff(t) simplify(Md2)
$\quad \color{blue}{\displaystyle \frac{n p \left(p \left(n - 1\right) \left(p e^{t} - p + 1\right)^{n + 1} e^{t} + \left(p e^{t} - p + 1\right)^{n + 2}\right) e^{t}}{\left(p e^{t} - p + 1\right)^{3}}}$
Md20=Md2.subs(t, 0) Md20
$\quad \color{blue}{\displaystyle n^{2} p^{2} - n p^{2} + n p}$
var=Md20-E**2 var
$\quad \color{blue}{\displaystyle - n p^{2} + n p}$
Summarizing the above results, the expected value and variance of the binomial distribution are as shown in Equation 3.
$$\begin{align}\tag{3}&E(x)=np\\&Var(X)=np(1-p) \end{align}$$
**Example 10)**
If a student randomly selects an answer from 15 questions in the form of choosing one out of five, what is the probability of getting more than 5 correct answers? And what are the expected values and variances?

It is the binomial probability with probability p=$\displaystyle \frac{1}{5}$ of getting a correct answer to a single problem.
$$\begin{align} &R_x=\{0,\, 1,\, 2,\, 3,\, \cdots \} \\& f(x)=\binom{15}{x}\left(\frac{1}{5} \right)^x \left(\frac{4}{5} \right)^{15-x}\end{align}$$
Use the sympy module to calculate directly from the above probability density function.

n, x=symbols("n x") f=binomial(n, x)*(1/5)**x*(4/5)**(n-x) Fless4=[f.subs({n:15, x:i}) for i in range(5)] Fless4

[0.0351843720888320, 0.131941395333120, 0.230897441832960, 0.250138895319040, 0.187604171489280]

Fless4_np=np.array(Fless4, dtype=np.float64) np.around(Fless4_np, 3)

array([0.035, 0.132, 0.231, 0.25 , 0.188])

Fmore5=1-np.sum(Fless4_np) Fmore5

0.16423372393676727

The above code uses the sympy package. As a result, the data types of Fless4 and its elements are as follows.

type(Fless4), type(Fless4[0])

(list, sympy.core.numbers.Float)

That is, the data type of the object is a list, and the data type of each element is the float type uniquely designated by sympy. The data type of these elements is different from the array data type of numpy. Therefore, in order to apply all numpy functions or classes to this object, the data type of each element must be newly designated. For this reason, dtype must be specified in the object Fless4_np in the code above.

In addition to the direct calculation of the code above, the `scipy.stats.binom.cdf()` function can be applied.

Fthan5=1-stats.binom.cdf(4, 15, 1/5) round(Fthan5, 3)

0.164

The mean and variance of this distribution are

mu=15*1/5 var=15*(1/5)*(1-1/5) print("mean:{:.3f}, variance:{:.3f}".format(mu, var))

mean:3.000, variance:2.400

mu, var=stats.binom.stats(15, 1/5, moments="m, v") print("mean:{:.3f}, variance:{:.3f}".format(mu, var))

mean:3.000, variance:2.400

	x	Px	Py
0	0.0	0.028248	0.117142
1	1.0	0.121061	0.159738
2	2.0	0.233474	0.179706
3	3.0	0.266828	0.165882
4	4.0	0.200121	0.124412
5	5.0	0.102919	0.074647
6	6.0	0.036757	0.034991
7	7.0	0.009002	0.012350
8	8.0	0.001447	0.003087
9	9.0	0.000138	0.000487
1	0 10.0	0.000006	0.000037

Rounds	0	1	2	3
prize	-1000	1000	2000	3000

sympy.solvers로 방정식해 구하기

sympy.solvers로 방정식해 구하기 대수 방정식을 해를 계산하기 위해 다음 함수를 사용합니다. sympy.solvers.solve(f, *symbols, **flags) f=0, 즉 동차방정식에 대해 지정한 변수의 해를 계산 f : 식 또는 함수 symbols: 식의 해를 계산하기 위한 변수, 변수가 하나인 경우는 생략가능(자동으로 인식) flags: 계산 또는 결과의 방식을 지정하기 위한 인수들 dict=True: {x:3, y:1}같이 사전형식, 기본값 = False set=True :{(x,3),(y,1)}같이 집합형식, 기본값 = False ratioal=True : 실수를 유리수로 반환, 기본값 = False positive=True: 해들 중에 양수만을 반환, 기본값 = False 예 $x^2=1$의 해를 결정합니다. solve() 함수에 적용하기 위해서는 다음과 같이 식의 한쪽이 0이 되는 형태인 동차식으로 구성되어야 합니다. $$x^2-1=0$$ import numpy as np from sympy import * x = symbols('x') solve(x**2-1, x) [-1, 1] 위 식은 계산 과정은 다음과 같습니다. $$\begin{aligned}x^2-1=0 \rightarrow (x+1)(x-1)=0 \\ x=1 \; \text{or}\; -1\end{aligned}$$ 예 $x^4=1$의 해를 결정합니다. solve() 함수의 인수 set=True를 지정하였으므로 결과는 집합(set)형으로 반환됩니다. eq=x**4-1 solve(eq, set=True) ([x], {(-1,), (-I,), (1,), (I,)}) 위의 경우 I는 복소수입니다.즉 위 결과의 과정은 다음과 같습니다. $$x^4-1=(x^2+1)(x+1)(x-1)=0 \rightarrow x=\pm \sqrt{-1}, \; \pm 1=\pm i,\; \pm1$$ 실수...

sons dataStory

이 블로그 검색

[matplotlib]quiver()함수