Independence and Conditional Probability

Independence and Conditional Probability
1. Independence
2. Conditional Probability

Independence and Conditional Probability

Independence

Events whose intersection is the empty set are independent events or mutually exclusive outcomes. For example, if a single die is rolled, it is an independent event because it cannot happen that both 1 and 2 are rolled together. On the other hand, the probability of 1 and an odd number can occur at the same time because 1 is already odd. Therefore, these events are not mutually exclusive results.

Calculating the probabilities of independent events is relatively easy. In other words, the probabilities of an event of 1 or 2 in a single die trial are mutually independent and therefore the sum of their probabilities.

$$P(1 \, \text{or} \, 2) =P(1)+P(2)= \frac{1}{6}+\frac{1}{6}=\frac{1}{3}$$

Contrary to the above, if events A and B are not independent events, the above sum is modified as follows.

$$\begin{aligned}&P(A\; \text{or} \;B) = P(A)+ P(B) - P(A\; \text{and} \;B)\\ & \text{or}\\ &P(A \cup B) = P(A)+P(B) - P(A \cap B) \end{aligned}$$

sum of events
If there are two independent events $E_1$ and $E_2$, then the probability of their occurrence is simply calculated as the sum of the two probabilities. $$P(E_1 \;\text{or} \; E_2) =P(E_1 \cup E_2)= P(E_1) + P(E_2)$$

By expanding the above equation, all probabilities for two or more independent events are calculated as in equation 1.

$$\begin{equation}\tag{1} \begin{aligned}&P(E_1 \;\text{or}\; E_2 \;\text{or}\; E_3 \cdots \;\text{or}\; E_n)\\&\quad =P(E_1 \cup E_2 \cup E_3 \cdots \cup E_n)\\ &\quad = P(E_1) + P(E_2) +P(E_3)+ \cdots +P(E_n) \end{aligned}\end{equation}$$

In the case of interdependent events, the common parts between the events must be considered. Therefore, Equation 1 is converted to Equation 2.

$$\begin{equation} \tag{2} \begin{aligned} &P(E_1 \;\text{or}\; E_2 \;\text{or}\; E_3 \cdots \;\text{or}\; E_n) \\&=P(E_1 \cup E_2 \cup E_3 \cdots \cup E_n)\\&= P(E_1) + P(E_2) +P(E_3)+ \cdots +P(E_n)\\&\quad -P(E_1 \cap E_2)- \cdots -P(E_{n-1} \cap E_n) \\& \quad -P(E_1 \cap E_2 \cap E_3 \cdots \cap E_n) \end{aligned} \end{equation}$$

Note
In probability and statistics, 'or' means 'union' and 'and' means 'intersection'

Example 1)
Determines the probability of an event with a point of (3,1,5) in a trial of rolling three dice of different colors.
This implementation is an independent case. The number of elements in the sample space is 6 × 6 × 6=216 and is as follows.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

rng=range(1, 7)
s=np.array([(i, j, k) for i in rng for j in rng for k in rng])
s[:3,:]

array([[1, 1, 1],
       [1, 1, 2],
       [1, 1, 3]])

len(s)

In the sample space, the dice point (3,1,5) occurs only once.

trg=np.array([[3,1,5]])
x=s[np.where(s[:,0]==trg[0,0])]
for i in range(1, s.shape[1]):
    x=x[np.where(x[:,i]==trg[0, i])]
x

array([[3, 1, 5]])

The probability of this event is.

from sympy import *

p=Rational(x.shape[0], s.shape[0])
p

$\displaystyle \frac{1}{216}$

The result of the above code is using the multiplication rule as follows:

$$P(3 \cap 1 \cap 5) = \frac{1}{6} \cdot \frac{1}{6} \cdot \frac{1}{6}=\frac{1}{216}$$

Conditional Probability

The following data is a survey of whether children enter college immediately after high school graduation according to whether their parents have graduated from college or not.

**Table 1. Parents and college freshmen**
C / P	P_yes	P_no	total
C_yes	231	214	445
C_no	49	298	347
total	280	512	792
P:Parent, C:Children

d=pd.DataFrame([[231, 214],[49, 298]])
drsum=d.sum(axis=1)
dT=pd.concat([d, drsum], axis=1)
dTcsum=dT.sum(axis=0)
dT=pd.concat([dT,pd.DataFrame(dTcsum).T], axis=0)
dT.columns=["Pyes","Pno","total"]
dT.index=['Cyes','Cno','total']
dT

	P__yes	P__no	total
C__yes	231	214	445
C__no	49	298	347
total	280	512	792

From the above data, try to determine:

What is the probability that children from parents with degrees will go to college?

$$\begin{aligned}P(\text{C}_{\text{yes}}\; \text{in}\; \text{P}_{\text{yes}}) &= \frac{\text{C}_{\text{yes}}}{\text{P}_{\text{yes}}}\\&=\frac{231}{280}\\&=0.825 \end{aligned} $$

Probability of parents holding college degrees among students who did not go to college immediately after high school graduation?

$$P(\text{P}_{\text{yes}}\, \text{in}\, \text{C}_{\text{no}}) = \frac{\text{P}_{\text{yes}}}{\text{C}_{\text{no}}}=\frac{49}{347}$$

Table 1 is a cross table showing data for parent and student variables together. The last row and last column of this table show data only for the parent variable and the student variable, respectively. The probabilities corresponding to those univariates are called marginal pribability. For example, the probability of yes among the variables of freshmen can be calculated as follows.

$$P(\text{C}_{\text{yes}})=\frac{\text{C}_{\text{yes}}}{\text{C}_{\text{total}}}=\frac{445}{792}$$

In Table 1, except for marginal probabilities, both parent and student variables are considered, and the corresponding probabilities are called joint probability. In the case of the above example, the joint probability can be calculated as following problem 3.

Probability of going to college right after graduating from high school and parents without a degree?

$$P(\text{C}_{\text{yes}} \; \text{and} \; \text{P}_{\text{no}}) = \frac{214}{792} = 0.27$$

Note
In probability or statistics, use ',' for shorthand instead of 'and' $$P(\text{C}_{\text{yes}} \; \text{and} \; \text{P}_{\text{no}}) =P(\text{C}_{\text{yes}}, \text{P}_{\text{no}})$$

It can be expressed by calculating the probability for each term in Table 1. That is, it displays the frequency of all terms divided by the total number.

PdT=dT/dT.iloc[2,2]
np.around(PdT,2)

	P_yes	P_no	total
C_yes	0.29	0.27	0.56
C_no	0.06	0.38	0.44
total	0.35	0.65	1.00

From the table above, what information could we use to estimate whether there is a link between a parent's degree and a child's college entrance right after high school?

Probability of a parent with a degree among students entering college:
Probability of students attending college from parents with degrees:

$$\frac{\text{C}_\text{yes} \cap \text{P}_\text{yes}}{\text{P}_\text{yes}}=\frac{0.29}{0.35}=0.82$$

The case where a condition is given to a specific probability as above is called conditional probability. In the above case, the basis for calculating the probability, that is, the condition that the parent has a degree, is given to the denominator. The conditional probability is expressed as Equation 3 using "|".

$$\begin{equation} \tag{3} P(\text{target} | \text{condition}) = \frac{P(\text{target} \,\cap\, \text{condition})}{P(\text{condition})} \end{equation}$$

Therefore, in the above case

$$P(\text{C}_\text{yes} | \text{P}_\text{yes}) = \frac{P(\text{C}_\text{yes} ∩ \text{P}_\text{yes})}{P(\text{P}_\text{yes})}$$

The above conditional probability calculation process can be generalized as in Equation 4.

$$\begin{equation}\tag{4} \begin{aligned}&P(A \,|\, B)=\frac{P(A\, \cap\, B)}{P(B)}\\ & P(A \,\cap\, B)=P(A\, |\, B)P(B) \end{aligned} \end{equation}$$

Example 2)
Table 2 shows the results of cancer diagnoses by gender in both cities. Determines the probability that the diagnosis result is male in A.

**Table 2. Cancer diagnosis results**
Sex / City	A	B	total
Male	23876	739	25615
Female	58302	5558	63860
total	82178	7297	89475

To calculate the probability corresponding to each value in the table above, it is coded in DataFrame format.

d=pd.DataFrame([[23876,739],[58302,5558]])
drsum=d.sum(axis=1)
dT=pd.concat([d, drsum], axis=1)
dTcsum=dT.sum(axis=0)
dT=pd.concat([dT,pd.DataFrame(dTcsum).T], axis=0)
dT.columns=["A","B","Rtotal"]
dT.index=['M','F','Ctotal']
dT

	A	B	Rtotal
M	23876	739	24615
F	58302	5558	63860
Ctotal	82178	6297	88475

PdT=np.around(dT/dT.iloc[2,2], 2)
np.around(PdT,2)

	A	B	Rtotal
M	0.27	0.01	0.28
F	0.66	0.06	0.72
Ctotal	0.93	0.07	1.00

What is the probability (P(M|A)) of A in the table above?

$$P(M|A)=\frac{P(M \cap A)}{P(A)}$$

P_MA=PdT.iloc[0,0]/PdT.iloc[2,0]
np.around(P_MA,2)

0.29

If the independent outcomes of all events in a trial are A₁, A₂, …, A_k and the condition for each outcome is denoted as B, then the sum of the probabilities of each outcome in that condition is as Equation 5 can indicate.

$$\begin{equation}\tag{5} P(A_1|B)+P(A_2|B)+\cdots+P(A_k|B)=1 \end{equation}$$

You can calculate a specific probability from the probability for a condition.
Example 3)
Table 3 is data on the production rate and defective product rate of factories A, B, and C, which produce lamps of a certain company.

**Table 3. Data on production by plant**
factory	p(factory)	B	D, P(D\|PR)
A	0.35	0.015	25615
B	0.35	0.01	63860
C	0.3	0.02	89475
P(): production rate, D: efective product ratio

Calculating the probability that a randomly selected defective product was produced in Factory C from Table 3,

$$P(C|D)=\frac{P(C \cap D)}{P(D)}$$

Since the information of $P(C \cap D)$ is not mentioned from the above expression, it is calculated as the probability of defective products in the table above using the general formula of conditional probability.

$$P(D|C)=\frac{P(D \cap C)}{P(C)}$$

The set of products is commutative, that is, $P(D \cap C)=P(C \cap D)$, so it is calculated as follows.

$$P(C \cap D)=P(D|C)P(C)=0.020 \cdot 0.3=0.006$$

P(D) is equal to the sum of the probabilities of defective products at each plant.

$$\begin{aligned}P(D)&=P(D \cap A) +P(D \cap B)+P(D \cap C)\\ &=P(D|A)P(A)+P(D|B)P(B)+P(D|C)P(C)\\ &=0.015 \cdot 0.35 +0.01 \cdot 0.35 +0.02 \cdot 0.30 \\ &=0.01475 \end{aligned}$$

Therefore, it is calculated as:

$$\begin{aligned}P(C|D)&=\frac{P(C \cap D)}{P(D)}\\&=\frac{0.006}{0.01475}\\&=0.407 \end{aligned}$$

Applying the above process to calculate for other factories as well, it can be expressed as:

$$\begin{aligned}P(A|D)&=\frac{P(A ∩ D)}{P(D)}\\& =\frac{P(D|A)P(A)}{P(D)}\\&=\frac{0.015 · 0.35}{0.01475}\\&=0.356\\ P(B|D)&=\frac{P(B \cap D)}{P(D)} \\&=\frac{P(D|B)P(B)}{P(D}\\&= \frac{0.01 · 0.35}{0.01475}\\&=0.237 \end{aligned}$$

In the example above, P(A), P(B), and P(C) are the probabilities that can be obtained before calculating additional information. This probability is called the prior probability. The conditional probabilities P(A|D), P(B|D), and P(C|D) that can be calculated based on these prior probabilities are called posterior probability. In other words, it means the probability that the conditional probability of an event can be calculated after obtaining additional information such as the defective rate from the prior probability. The generalization of the above process is called Bayes theorem.

Bayes theorem

If several subspaces (B₁, B₂, …, B_k) in sample space S are independent, then $$S=B_1 \cup B_2 \cup \cdots \cup B_k$$ The probabilities of all events lie between [0, 1]. In this case, the total occurrence of an event (A) is equal to the sum of the conditional probabilities that can be generated from all the conditions for that event. Of course, each case must satisfy the premise of independence. $$\begin{aligned}&A=(A \cap B_1) \cup(A \cap B_2) \cup \cdots \cup(A \cap B_k)\\&\begin{aligned}P(A)&= P(A \cap B_1) \cup P(A \cap B_2) \cup \cdots \cup P(A \cap B_k)\\&=\sum^ k_{i=1} P(A \cap B_i)\\& = \sum^k_{i=1} P(A|B_i)P(B_i) \end{aligned} \end{aligned}$$ From the above relationship, the posterior probability of $B_k$ is calculated as Equation 6.

$$\begin{align}\tag{6} P(B_k) &=\frac{P(B_k \cap A)}{P(A)}\\&=\frac{P(A|B_k)P(B_k)}{\sum^{k}_{i=1} P(A|B_i)P(B_i)} \end{align}$$

Example 4)
You want to choose one from a toolbox containing 40 fuses. Of those fuses, 5 are completely defective (D), 10 are partially defective(pD), lasting for 1 hour, and the remaining 25 (G) are normal. If one is selected, what is the probability of choosing a normal product if it is not a complete defective product?
If fuses are classified as defective and non-defective (D) products (ND), G is included in ND. In this classification, the intersection of G and ND is G.

$$P(G \cap ND) =P(G)$$

Therefore, the answer in this example is calculated as follows.

$$\begin{aligned}P(G | ND)&=\frac{P(G \cap ND)}{P(ND)}\\ &=\frac{P(G)}{P(ND)}\\ &=\frac{25/40}{35/40}\\&=\frac{5}{7} \end{aligned}$$

Example 5)
A has two children and has to attend a meeting with the youngest son. If he can attend the meeting, what is the probability that his family will contain only two sons?

Sample Space: S={(b,b), (b,g), (g, b), (g,g)}, b:boy, g: girl
Any event that meets the conditions of attendance: A={(b,b),(b,g), (g, b)}
target event: B={(b,b)}

$$\begin{align}P(B|A)&=\frac{P(B \cap A)}{P(B)}\\&=\frac{\frac{1}{4}}{\frac{1}{3}}\\&=\frac{1}{3} \end{align}$$

Example 6)
The following data is a two-day change in the closing prices of NASDAQ (na) and the Chicago Board Options Exchange Volatility Index (vix) over a period of time, listing an increase as 1 and a decrease as -1. Are the price movements of the two stocks independent?

Date	na	vi
2020-03-03	0	1
2020-03-04	1	0
⁝	⁝	⁝
2021-06-17	1	0
2021-06-18	0	1

The independence of two groups can be determined by considering the intersection. That is, the probability for the intersection of independent events is calculated as the product of the two probabilities. If the two groups are independent, the product of the two probabilities will be the same as the result of the conditional probability, as shown in the following equation.

$$\begin{align}&P(A) \cap P(B)=P(A|B)P(B)\\& \rightarrow\; P(\text{kos}=1) \cap P(\text{kq}=1)=P(\text{kos}=1|\text{kq}=1)P(\text{kq}=1)\end{align}$$

Create a crosstabulation for the table above. This uses the pd.crosstab() function.

import FinanceDataReader as fdr
st=pd.Timestamp(2020, 3, 2)
et=pd.Timestamp(2021, 11,29)
na=fdr.DataReader('IXIC', st, et)['Close']
vix=fdr.DataReader('VIX', st, et)['Close']
na1=na.pct_change()
na1=na1.replace(0, method="ffill")
na1=na1.dropna()
na1.head(2)

Date
2020-03-03	-0.029948
2020-03-04	0.038461
Name: Close, dtype: float64

vix1=vix.pct_change()
vix1=vix1.replace(0, method="ffill")
vix1=vix1.dropna()
vix1.head(2)

Date
2020-03-03	0.101735
2020-03-04	-0.131179
Name: Close, dtype: float64

na2=pd.cut(na1, bins=[-1, 0, 1],labels=[0, 1])
na2[:2,]

Date
2020-03-03	0
2020-03-04	1
Categories (2, int64): [0 < 1]

vix2=pd.cut(vix1, bins=[-1, 0, 1], labels=(0, 1))
vix2[:2]

Date
2020-03-03	1
2020-03-04	0
Categories (2, int64): [0 < 1]

ct=pd.crosstab(na2, vix2, rownames=['nasdaq'], colnames=['vix'], margins=True, normalize=True)
ct

vix	0	1	All
nasdaq
0	0.097506	0.308390	0.405896
1	0.462585	0.131519	0.594104
All	0.560091	0.439909	1.000000

Calculate from the results of the cross table above.

p1na=ct.iloc[2,0]
p1vix=ct.iloc[0, 2]
p1=np.around(p1na*p1vix,2)
p1

0.23

# Probability of 'na' increase in vix increase condition
p1naVix=ct.iloc[0,0]
p1naVix

0.09750566893424037

p1_2=p1naVix*p1vix
np.around(p1_2, 2)

0.04

The fact that the above two results p1, p1_2 are different means that the two stocks are not independent. Since the two stocks are not independent, the correlation coefficient between the two data will not be zero.(see correlation analysis) The correlation coefficient between two stocks can be calculated using the DataFrame object.corr() function.

data=pd.concat([na, vix], axis=1)
data.columns=['na', 'vix']
data.head()

	na	vix
Date
2020-03-02	8952.2	33.42
2020-03-03	8684.1	36.82
2020-03-04	9018.1	31.99
2020-03-05	8738.6	39.62
2020-03-06	8575.6	41.94

data.corr()

	na	vix
na	1.000000	-0.820456
vix	-0.820456	1.000000

The object data used in the code above is a combination of na and vix data. This binding is applied to the DataFrame.concat() method. Correlation analysis results for two variables in data indicate that there is a very strong inverse relationship between them. In other words, since the two variables are not independent, the probability for the intersection of the two stocks must be calculated by Bayes' theorem.

sympy.solvers로 방정식해 구하기

sympy.solvers로 방정식해 구하기 대수 방정식을 해를 계산하기 위해 다음 함수를 사용합니다. sympy.solvers.solve(f, *symbols, **flags) f=0, 즉 동차방정식에 대해 지정한 변수의 해를 계산 f : 식 또는 함수 symbols: 식의 해를 계산하기 위한 변수, 변수가 하나인 경우는 생략가능(자동으로 인식) flags: 계산 또는 결과의 방식을 지정하기 위한 인수들 dict=True: {x:3, y:1}같이 사전형식, 기본값 = False set=True :{(x,3),(y,1)}같이 집합형식, 기본값 = False ratioal=True : 실수를 유리수로 반환, 기본값 = False positive=True: 해들 중에 양수만을 반환, 기본값 = False 예 $x^2=1$의 해를 결정합니다. solve() 함수에 적용하기 위해서는 다음과 같이 식의 한쪽이 0이 되는 형태인 동차식으로 구성되어야 합니다. $$x^2-1=0$$ import numpy as np from sympy import * x = symbols('x') solve(x**2-1, x) [-1, 1] 위 식은 계산 과정은 다음과 같습니다. $$\begin{aligned}x^2-1=0 \rightarrow (x+1)(x-1)=0 \\ x=1 \; \text{or}\; -1\end{aligned}$$ 예 $x^4=1$의 해를 결정합니다. solve() 함수의 인수 set=True를 지정하였으므로 결과는 집합(set)형으로 반환됩니다. eq=x**4-1 solve(eq, set=True) ([x], {(-1,), (-I,), (1,), (I,)}) 위의 경우 I는 복소수입니다.즉 위 결과의 과정은 다음과 같습니다. $$x^4-1=(x^2+1)(x+1)(x-1)=0 \rightarrow x=\pm \sqrt{-1}, \; \pm 1=\pm i,\; \pm1$$ 실수...

sons dataStory

이 블로그 검색

[matplotlib]quiver()함수