contents
Independence and Conditional Probability
Independence
Events whose intersection is the empty set are independent events or mutually exclusive outcomes. For example, if a single die is rolled, it is an independent event because it cannot happen that both 1 and 2 are rolled together. On the other hand, the probability of 1 and an odd number can occur at the same time because 1 is already odd. Therefore, these events are not mutually exclusive results.
Calculating the probabilities of independent events is relatively easy. In other words, the probabilities of an event of 1 or 2 in a single die trial are mutually independent and therefore the sum of their probabilities.
$$P(1 \, \text{or} \, 2) =P(1)+P(2)= \frac{1}{6}+\frac{1}{6}=\frac{1}{3}$$Contrary to the above, if events A and B are not independent events, the above sum is modified as follows.
$$\begin{aligned}&P(A\; \text{or} \;B) = P(A)+ P(B) - P(A\; \text{and} \;B)\\ & \text{or}\\ &P(A \cup B) = P(A)+P(B) - P(A \cap B) \end{aligned}$$If there are two independent events $E_1$ and $E_2$, then the probability of their occurrence is simply calculated as the sum of the two probabilities. $$P(E_1 \;\text{or} \; E_2) =P(E_1 \cup E_2)= P(E_1) + P(E_2)$$
By expanding the above equation, all probabilities for two or more independent events are calculated as in equation 1.
$$\begin{equation}\tag{1} \begin{aligned}&P(E_1 \;\text{or}\; E_2 \;\text{or}\; E_3 \cdots \;\text{or}\; E_n)\\&\quad =P(E_1 \cup E_2 \cup E_3 \cdots \cup E_n)\\ &\quad = P(E_1) + P(E_2) +P(E_3)+ \cdots +P(E_n) \end{aligned}\end{equation}$$In the case of interdependent events, the common parts between the events must be considered. Therefore, Equation 1 is converted to Equation 2.
$$\begin{equation} \tag{2} \begin{aligned} &P(E_1 \;\text{or}\; E_2 \;\text{or}\; E_3 \cdots \;\text{or}\; E_n) \\&=P(E_1 \cup E_2 \cup E_3 \cdots \cup E_n)\\&= P(E_1) + P(E_2) +P(E_3)+ \cdots +P(E_n)\\&\quad -P(E_1 \cap E_2)- \cdots -P(E_{n-1} \cap E_n) \\& \quad -P(E_1 \cap E_2 \cap E_3 \cdots \cap E_n) \end{aligned} \end{equation}$$In probability and statistics, 'or' means 'union' and 'and' means 'intersection'
Example 1)
Determines the probability of an event with a point of (3,1,5) in a trial of rolling three dice of different colors.
This implementation is an independent case. The number of elements in the sample space is 6 × 6 × 6=216 and is as follows.
import numpy as np import pandas as pd import matplotlib.pyplot as plt
rng=range(1, 7) s=np.array([(i, j, k) for i in rng for j in rng for k in rng]) s[:3,:]
array([[1, 1, 1], [1, 1, 2], [1, 1, 3]])
len(s)
216
In the sample space, the dice point (3,1,5) occurs only once.
trg=np.array([[3,1,5]]) x=s[np.where(s[:,0]==trg[0,0])] for i in range(1, s.shape[1]): x=x[np.where(x[:,i]==trg[0, i])] x
array([[3, 1, 5]])
The probability of this event is.
from sympy import *
p=Rational(x.shape[0], s.shape[0]) p
The result of the above code is using the multiplication rule as follows:
$$P(3 \cap 1 \cap 5) = \frac{1}{6} \cdot \frac{1}{6} \cdot \frac{1}{6}=\frac{1}{216}$$Conditional Probability
The following data is a survey of whether children enter college immediately after high school graduation according to whether their parents have graduated from college or not.
C / P | Pyes | Pno | total |
---|---|---|---|
Cyes | 231 | 214 | 445 |
Cno | 49 | 298 | 347 |
total | 280 | 512 | 792 |
P:Parent, C:Children |
d=pd.DataFrame([[231, 214],[49, 298]]) drsum=d.sum(axis=1) dT=pd.concat([d, drsum], axis=1) dTcsum=dT.sum(axis=0) dT=pd.concat([dT,pd.DataFrame(dTcsum).T], axis=0) dT.columns=["Pyes","Pno","total"] dT.index=['Cyes','Cno','total'] dT
P_yes | P_no | total | |
---|---|---|---|
C_yes | 231 | 214 | 445 |
C_no | 49 | 298 | 347 |
total | 280 | 512 | 792 |
From the above data, try to determine:
- What is the probability that children from parents with degrees will go to college?
- Probability of parents holding college degrees among students who did not go to college immediately after high school graduation?
Table 1 is a cross table showing data for parent and student variables together. The last row and last column of this table show data only for the parent variable and the student variable, respectively. The probabilities corresponding to those univariates are called marginal pribability. For example, the probability of yes among the variables of freshmen can be calculated as follows.
$$P(\text{C}_{\text{yes}})=\frac{\text{C}_{\text{yes}}}{\text{C}_{\text{total}}}=\frac{445}{792}$$In Table 1, except for marginal probabilities, both parent and student variables are considered, and the corresponding probabilities are called joint probability. In the case of the above example, the joint probability can be calculated as following problem 3.
- Probability of going to college right after graduating from high school and parents without a degree?
In probability or statistics, use ',' for shorthand instead of 'and' $$P(\text{C}_{\text{yes}} \; \text{and} \; \text{P}_{\text{no}}) =P(\text{C}_{\text{yes}}, \text{P}_{\text{no}})$$
It can be expressed by calculating the probability for each term in Table 1. That is, it displays the frequency of all terms divided by the total number.
PdT=dT/dT.iloc[2,2] np.around(PdT,2)
Pyes | Pno | total | |
---|---|---|---|
Cyes | 0.29 | 0.27 | 0.56 |
Cno | 0.06 | 0.38 | 0.44 |
total | 0.35 | 0.65 | 1.00 |
From the table above, what information could we use to estimate whether there is a link between a parent's degree and a child's college entrance right after high school?
- Probability of a parent with a degree among students entering college: $$\frac{\text{P}_\text{yes} \cap \text{C}_\text{yes}}{\text{C}_\text{yes}}=\frac{0.29}{0.56}=0.52$$
- Probability of students attending college from parents with degrees:
The case where a condition is given to a specific probability as above is called conditional probability. In the above case, the basis for calculating the probability, that is, the condition that the parent has a degree, is given to the denominator. The conditional probability is expressed as Equation 3 using "|".
$$\begin{equation} \tag{3} P(\text{target} | \text{condition}) = \frac{P(\text{target} \,\cap\, \text{condition})}{P(\text{condition})} \end{equation}$$Therefore, in the above case
$$P(\text{C}_\text{yes} | \text{P}_\text{yes}) = \frac{P(\text{C}_\text{yes} ∩ \text{P}_\text{yes})}{P(\text{P}_\text{yes})}$$The above conditional probability calculation process can be generalized as in Equation 4.
$$\begin{equation}\tag{4} \begin{aligned}&P(A \,|\, B)=\frac{P(A\, \cap\, B)}{P(B)}\\ & P(A \,\cap\, B)=P(A\, |\, B)P(B) \end{aligned} \end{equation}$$Example 2)
Table 2 shows the results of cancer diagnoses by gender in both cities. Determines the probability that the diagnosis result is male in A.
Sex / City | A | B | total |
---|---|---|---|
Male | 23876 | 739 | 25615 |
Female | 58302 | 5558 | 63860 |
total | 82178 | 7297 | 89475 |
To calculate the probability corresponding to each value in the table above, it is coded in DataFrame format.
d=pd.DataFrame([[23876,739],[58302,5558]]) drsum=d.sum(axis=1) dT=pd.concat([d, drsum], axis=1) dTcsum=dT.sum(axis=0) dT=pd.concat([dT,pd.DataFrame(dTcsum).T], axis=0) dT.columns=["A","B","Rtotal"] dT.index=['M','F','Ctotal'] dT
A | B | Rtotal | |
---|---|---|---|
M | 23876 | 739 | 24615 |
F | 58302 | 5558 | 63860 |
Ctotal | 82178 | 6297 | 88475 |
PdT=np.around(dT/dT.iloc[2,2], 2) np.around(PdT,2)
A | B | Rtotal | |
---|---|---|---|
M | 0.27 | 0.01 | 0.28 |
F | 0.66 | 0.06 | 0.72 |
Ctotal | 0.93 | 0.07 | 1.00 |
What is the probability (P(M|A)) of A in the table above?
$$P(M|A)=\frac{P(M \cap A)}{P(A)}$$P_MA=PdT.iloc[0,0]/PdT.iloc[2,0] np.around(P_MA,2)
0.29
If the independent outcomes of all events in a trial are A1, A2, …, Ak and the condition for each outcome is denoted as B, then the sum of the probabilities of each outcome in that condition is as Equation 5 can indicate.
$$\begin{equation}\tag{5} P(A_1|B)+P(A_2|B)+\cdots+P(A_k|B)=1 \end{equation}$$You can calculate a specific probability from the probability for a condition.
Example 3)
Table 3 is data on the production rate and defective product rate of factories A, B, and C, which produce lamps of a certain company.
factory | p(factory) | B | D, P(D|PR) |
---|---|---|---|
A | 0.35 | 0.015 | 25615 |
B | 0.35 | 0.01 | 63860 |
C | 0.3 | 0.02 | 89475 |
P(): production rate, D: efective product ratio |
Calculating the probability that a randomly selected defective product was produced in Factory C from Table 3,
$$P(C|D)=\frac{P(C \cap D)}{P(D)}$$Since the information of $P(C \cap D)$ is not mentioned from the above expression, it is calculated as the probability of defective products in the table above using the general formula of conditional probability.
$$P(D|C)=\frac{P(D \cap C)}{P(C)}$$The set of products is commutative, that is, $P(D \cap C)=P(C \cap D)$, so it is calculated as follows.
$$P(C \cap D)=P(D|C)P(C)=0.020 \cdot 0.3=0.006$$P(D) is equal to the sum of the probabilities of defective products at each plant.
$$\begin{aligned}P(D)&=P(D \cap A) +P(D \cap B)+P(D \cap C)\\ &=P(D|A)P(A)+P(D|B)P(B)+P(D|C)P(C)\\ &=0.015 \cdot 0.35 +0.01 \cdot 0.35 +0.02 \cdot 0.30 \\ &=0.01475 \end{aligned}$$Therefore, it is calculated as:
$$\begin{aligned}P(C|D)&=\frac{P(C \cap D)}{P(D)}\\&=\frac{0.006}{0.01475}\\&=0.407 \end{aligned}$$Applying the above process to calculate for other factories as well, it can be expressed as:
$$\begin{aligned}P(A|D)&=\frac{P(A ∩ D)}{P(D)}\\& =\frac{P(D|A)P(A)}{P(D)}\\&=\frac{0.015 · 0.35}{0.01475}\\&=0.356\\ P(B|D)&=\frac{P(B \cap D)}{P(D)} \\&=\frac{P(D|B)P(B)}{P(D}\\&= \frac{0.01 · 0.35}{0.01475}\\&=0.237 \end{aligned}$$In the example above, P(A), P(B), and P(C) are the probabilities that can be obtained before calculating additional information. This probability is called the prior probability. The conditional probabilities P(A|D), P(B|D), and P(C|D) that can be calculated based on these prior probabilities are called posterior probability. In other words, it means the probability that the conditional probability of an event can be calculated after obtaining additional information such as the defective rate from the prior probability. The generalization of the above process is called Bayes theorem.
If several subspaces (B1, B2, …, Bk) in sample space S are independent, then $$S=B_1 \cup B_2 \cup \cdots \cup B_k$$ The probabilities of all events lie between [0, 1]. In this case, the total occurrence of an event (A) is equal to the sum of the conditional probabilities that can be generated from all the conditions for that event. Of course, each case must satisfy the premise of independence. $$\begin{aligned}&A=(A \cap B_1) \cup(A \cap B_2) \cup \cdots \cup(A \cap B_k)\\&\begin{aligned}P(A)&= P(A \cap B_1) \cup P(A \cap B_2) \cup \cdots \cup P(A \cap B_k)\\&=\sum^ k_{i=1} P(A \cap B_i)\\& = \sum^k_{i=1} P(A|B_i)P(B_i) \end{aligned} \end{aligned}$$ From the above relationship, the posterior probability of $B_k$ is calculated as Equation 6.
$$\begin{align}\tag{6} P(B_k) &=\frac{P(B_k \cap A)}{P(A)}\\&=\frac{P(A|B_k)P(B_k)}{\sum^{k}_{i=1} P(A|B_i)P(B_i)} \end{align}$$
Example 4)
You want to choose one from a toolbox containing 40 fuses. Of those fuses, 5 are completely defective (D), 10 are partially defective(pD), lasting for 1 hour, and the remaining 25 (G) are normal. If one is selected, what is the probability of choosing a normal product if it is not a complete defective product?
If fuses are classified as defective and non-defective (D) products (ND), G is included in ND. In this classification, the intersection of G and ND is G.
Therefore, the answer in this example is calculated as follows.
$$\begin{aligned}P(G | ND)&=\frac{P(G \cap ND)}{P(ND)}\\ &=\frac{P(G)}{P(ND)}\\ &=\frac{25/40}{35/40}\\&=\frac{5}{7} \end{aligned}$$Example 5)
A has two children and has to attend a meeting with the youngest son. If he can attend the meeting, what is the probability that his family will contain only two sons?
- Sample Space: S={(b,b), (b,g), (g, b), (g,g)}, b:boy, g: girl
- Any event that meets the conditions of attendance: A={(b,b),(b,g), (g, b)}
- target event: B={(b,b)}
Example 6)
The following data is a two-day change in the closing prices of NASDAQ (na) and the Chicago Board Options Exchange Volatility Index (vix) over a period of time, listing an increase as 1 and a decrease as -1. Are the price movements of the two stocks independent?
Date | na | vi |
---|---|---|
2020-03-03 | 0 | 1 |
2020-03-04 | 1 | 0 |
⁝ | ⁝ | ⁝ |
2021-06-17 | 1 | 0 |
2021-06-18 | 0 | 1 |
The independence of two groups can be determined by considering the intersection. That is, the probability for the intersection of independent events is calculated as the product of the two probabilities. If the two groups are independent, the product of the two probabilities will be the same as the result of the conditional probability, as shown in the following equation.
$$\begin{align}&P(A) \cap P(B)=P(A|B)P(B)\\& \rightarrow\; P(\text{kos}=1) \cap P(\text{kq}=1)=P(\text{kos}=1|\text{kq}=1)P(\text{kq}=1)\end{align}$$Create a crosstabulation for the table above. This uses the pd.crosstab()
function.
import FinanceDataReader as fdr st=pd.Timestamp(2020, 3, 2) et=pd.Timestamp(2021, 11,29) na=fdr.DataReader('IXIC', st, et)['Close'] vix=fdr.DataReader('VIX', st, et)['Close'] na1=na.pct_change() na1=na1.replace(0, method="ffill") na1=na1.dropna() na1.head(2)
Date | |
---|---|
2020-03-03 | -0.029948 |
2020-03-04 | 0.038461 |
Name: Close, dtype: float64 |
vix1=vix.pct_change() vix1=vix1.replace(0, method="ffill") vix1=vix1.dropna() vix1.head(2)
Date | |
---|---|
2020-03-03 | 0.101735 |
2020-03-04 | -0.131179 |
Name: Close, dtype: float64 |
na2=pd.cut(na1, bins=[-1, 0, 1],labels=[0, 1]) na2[:2,]
Date | |
---|---|
2020-03-03 | 0 |
2020-03-04 | 1 |
Categories (2, int64): [0 < 1] |
vix2=pd.cut(vix1, bins=[-1, 0, 1], labels=(0, 1)) vix2[:2]
Date | |
---|---|
2020-03-03 | 1 |
2020-03-04 | 0 |
Categories (2, int64): [0 < 1] |
ct=pd.crosstab(na2, vix2, rownames=['nasdaq'], colnames=['vix'], margins=True, normalize=True) ct
vix | 0 | 1 | All |
---|---|---|---|
nasdaq | |||
0 | 0.097506 | 0.308390 | 0.405896 |
1 | 0.462585 | 0.131519 | 0.594104 |
All | 0.560091 | 0.439909 | 1.000000 |
Calculate from the results of the cross table above.
p1na=ct.iloc[2,0] p1vix=ct.iloc[0, 2] p1=np.around(p1na*p1vix,2) p1
0.23
# Probability of 'na' increase in vix increase condition p1naVix=ct.iloc[0,0] p1naVix
0.09750566893424037
p1_2=p1naVix*p1vix np.around(p1_2, 2)
0.04
The fact that the above two results p1, p1_2 are different means that the two stocks are not independent. Since the two stocks are not independent, the correlation coefficient between the two data will not be zero.(see correlation analysis) The correlation coefficient between two stocks can be calculated using the DataFrame object.corr()
function.
data=pd.concat([na, vix], axis=1) data.columns=['na', 'vix'] data.head()
na | vix | |
---|---|---|
Date | ||
2020-03-02 | 8952.2 | 33.42 |
2020-03-03 | 8684.1 | 36.82 |
2020-03-04 | 9018.1 | 31.99 |
2020-03-05 | 8738.6 | 39.62 |
2020-03-06 | 8575.6 | 41.94 |
data.corr()
na | vix | |
---|---|---|
na | 1.000000 | -0.820456 |
vix | -0.820456 | 1.000000 |
The object data used in the code above is a combination of na and vix data. This binding is applied to the DataFrame.concat()
method. Correlation analysis results for two variables in data indicate that there is a very strong inverse relationship between them. In other words, since the two variables are not independent, the probability for the intersection of the two stocks must be calculated by Bayes' theorem.
댓글
댓글 쓰기