기본 콘텐츠로 건너뛰기

[ML] 결정트리(Decision Tree) 모델

Independence and Conditional Probability

contents

  1. Independence and Conditional Probability
    1. Independence
    2. Conditional Probability

Independence and Conditional Probability

Independence

Events whose intersection is the empty set are independent events or mutually exclusive outcomes. For example, if a single die is rolled, it is an independent event because it cannot happen that both 1 and 2 are rolled together. On the other hand, the probability of 1 and an odd number can occur at the same time because 1 is already odd. Therefore, these events are not mutually exclusive results.

Calculating the probabilities of independent events is relatively easy. In other words, the probabilities of an event of 1 or 2 in a single die trial are mutually independent and therefore the sum of their probabilities.

$$P(1 \, \text{or} \, 2) =P(1)+P(2)= \frac{1}{6}+\frac{1}{6}=\frac{1}{3}$$

Contrary to the above, if events A and B are not independent events, the above sum is modified as follows.

$$\begin{aligned}&P(A\; \text{or} \;B) = P(A)+ P(B) - P(A\; \text{and} \;B)\\ & \text{or}\\ &P(A \cup B) = P(A)+P(B) - P(A \cap B) \end{aligned}$$
sum of events
If there are two independent events $E_1$ and $E_2$, then the probability of their occurrence is simply calculated as the sum of the two probabilities. $$P(E_1 \;\text{or} \; E_2) =P(E_1 \cup E_2)= P(E_1) + P(E_2)$$

By expanding the above equation, all probabilities for two or more independent events are calculated as in equation 1.

$$\begin{equation}\tag{1} \begin{aligned}&P(E_1 \;\text{or}\; E_2 \;\text{or}\; E_3 \cdots \;\text{or}\; E_n)\\&\quad =P(E_1 \cup E_2 \cup E_3 \cdots \cup E_n)\\ &\quad = P(E_1) + P(E_2) +P(E_3)+ \cdots +P(E_n) \end{aligned}\end{equation}$$

In the case of interdependent events, the common parts between the events must be considered. Therefore, Equation 1 is converted to Equation 2.

$$\begin{equation} \tag{2} \begin{aligned} &P(E_1 \;\text{or}\; E_2 \;\text{or}\; E_3 \cdots \;\text{or}\; E_n) \\&=P(E_1 \cup E_2 \cup E_3 \cdots \cup E_n)\\&= P(E_1) + P(E_2) +P(E_3)+ \cdots +P(E_n)\\&\quad -P(E_1 \cap E_2)- \cdots -P(E_{n-1} \cap E_n) \\& \quad -P(E_1 \cap E_2 \cap E_3 \cdots \cap E_n) \end{aligned} \end{equation}$$
Note
  In probability and statistics, 'or' means 'union' and 'and' means 'intersection'

Example 1)
  Determines the probability of an event with a point of (3,1,5) in a trial of rolling three dice of different colors.
This implementation is an independent case. The number of elements in the sample space is 6 × 6 × 6=216 and is as follows.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
rng=range(1, 7)
s=np.array([(i, j, k) for i in rng for j in rng for k in rng])
s[:3,:]
array([[1, 1, 1],
       [1, 1, 2],
       [1, 1, 3]])
len(s)
216

In the sample space, the dice point (3,1,5) occurs only once.

trg=np.array([[3,1,5]])
x=s[np.where(s[:,0]==trg[0,0])]
for i in range(1, s.shape[1]):
    x=x[np.where(x[:,i]==trg[0, i])]
x
array([[3, 1, 5]])

The probability of this event is.

from sympy import *
p=Rational(x.shape[0], s.shape[0])
p
$\displaystyle \frac{1}{216}$

The result of the above code is using the multiplication rule as follows:

$$P(3 \cap 1 \cap 5) = \frac{1}{6} \cdot \frac{1}{6} \cdot \frac{1}{6}=\frac{1}{216}$$

Conditional Probability

The following data is a survey of whether children enter college immediately after high school graduation according to whether their parents have graduated from college or not.

Table 1. Parents and college freshmen
C / P Pyes Pno total
Cyes 231 214 445
Cno 49 298347
total 280 512 792
P:Parent, C:Children
d=pd.DataFrame([[231, 214],[49, 298]])
drsum=d.sum(axis=1)
dT=pd.concat([d, drsum], axis=1)
dTcsum=dT.sum(axis=0)
dT=pd.concat([dT,pd.DataFrame(dTcsum).T], axis=0)
dT.columns=["Pyes","Pno","total"]
dT.index=['Cyes','Cno','total']
dT
P_yes P_no total
C_yes 231 214 445
C_no 49 298 347
total 280 512 792

From the above data, try to determine:

  1. What is the probability that children from parents with degrees will go to college?
$$\begin{aligned}P(\text{C}_{\text{yes}}\; \text{in}\; \text{P}_{\text{yes}}) &= \frac{\text{C}_{\text{yes}}}{\text{P}_{\text{yes}}}\\&=\frac{231}{280}\\&=0.825 \end{aligned} $$
  1. Probability of parents holding college degrees among students who did not go to college immediately after high school graduation?
$$P(\text{P}_{\text{yes}}\, \text{in}\, \text{C}_{\text{no}}) = \frac{\text{P}_{\text{yes}}}{\text{C}_{\text{no}}}=\frac{49}{347}$$

Table 1 is a cross table showing data for parent and student variables together. The last row and last column of this table show data only for the parent variable and the student variable, respectively. The probabilities corresponding to those univariates are called marginal pribability. For example, the probability of yes among the variables of freshmen can be calculated as follows.

$$P(\text{C}_{\text{yes}})=\frac{\text{C}_{\text{yes}}}{\text{C}_{\text{total}}}=\frac{445}{792}$$

In Table 1, except for marginal probabilities, both parent and student variables are considered, and the corresponding probabilities are called joint probability. In the case of the above example, the joint probability can be calculated as following problem 3.

  1. Probability of going to college right after graduating from high school and parents without a degree?
$$P(\text{C}_{\text{yes}} \; \text{and} \; \text{P}_{\text{no}}) = \frac{214}{792} = 0.27$$
Note
In probability or statistics, use ',' for shorthand instead of 'and' $$P(\text{C}_{\text{yes}} \; \text{and} \; \text{P}_{\text{no}}) =P(\text{C}_{\text{yes}}, \text{P}_{\text{no}})$$

It can be expressed by calculating the probability for each term in Table 1. That is, it displays the frequency of all terms divided by the total number.

PdT=dT/dT.iloc[2,2]
np.around(PdT,2)
Pyes Pno total
Cyes 0.29 0.27 0.56
Cno 0.06 0.38 0.44
total 0.35 0.65 1.00

From the table above, what information could we use to estimate whether there is a link between a parent's degree and a child's college entrance right after high school?

  1. Probability of a parent with a degree among students entering college:
  2. $$\frac{\text{P}_\text{yes} \cap \text{C}_\text{yes}}{\text{C}_\text{yes}}=\frac{0.29}{0.56}=0.52$$
  3. Probability of students attending college from parents with degrees:
$$\frac{\text{C}_\text{yes} \cap \text{P}_\text{yes}}{\text{P}_\text{yes}}=\frac{0.29}{0.35}=0.82$$

The case where a condition is given to a specific probability as above is called conditional probability. In the above case, the basis for calculating the probability, that is, the condition that the parent has a degree, is given to the denominator. The conditional probability is expressed as Equation 3 using "|".

$$\begin{equation} \tag{3} P(\text{target} | \text{condition}) = \frac{P(\text{target} \,\cap\, \text{condition})}{P(\text{condition})} \end{equation}$$

Therefore, in the above case

$$P(\text{C}_\text{yes} | \text{P}_\text{yes}) = \frac{P(\text{C}_\text{yes} ∩ \text{P}_\text{yes})}{P(\text{P}_\text{yes})}$$

The above conditional probability calculation process can be generalized as in Equation 4.

$$\begin{equation}\tag{4} \begin{aligned}&P(A \,|\, B)=\frac{P(A\, \cap\, B)}{P(B)}\\ & P(A \,\cap\, B)=P(A\, |\, B)P(B) \end{aligned} \end{equation}$$

Example 2)
  Table 2 shows the results of cancer diagnoses by gender in both cities. Determines the probability that the diagnosis result is male in A.

Table 2. Cancer diagnosis results
Sex / City A B total
Male 23876 739 25615
Female58302 555863860
total82178729789475

To calculate the probability corresponding to each value in the table above, it is coded in DataFrame format.

d=pd.DataFrame([[23876,739],[58302,5558]])
drsum=d.sum(axis=1)
dT=pd.concat([d, drsum], axis=1)
dTcsum=dT.sum(axis=0)
dT=pd.concat([dT,pd.DataFrame(dTcsum).T], axis=0)
dT.columns=["A","B","Rtotal"]
dT.index=['M','F','Ctotal']
dT
A B Rtotal
M 23876 739 24615
F 58302 5558 63860
Ctotal 82178 6297 88475
PdT=np.around(dT/dT.iloc[2,2], 2)
np.around(PdT,2)
A B Rtotal
M 0.27 0.01 0.28
F 0.66 0.06 0.72
Ctotal 0.93 0.07 1.00

What is the probability (P(M|A)) of A in the table above?

$$P(M|A)=\frac{P(M \cap A)}{P(A)}$$
P_MA=PdT.iloc[0,0]/PdT.iloc[2,0]
np.around(P_MA,2)
0.29

If the independent outcomes of all events in a trial are A1, A2, …, Ak and the condition for each outcome is denoted as B, then the sum of the probabilities of each outcome in that condition is as Equation 5 can indicate.

$$\begin{equation}\tag{5} P(A_1|B)+P(A_2|B)+\cdots+P(A_k|B)=1 \end{equation}$$

You can calculate a specific probability from the probability for a condition.
Example 3)
Table 3 is data on the production rate and defective product rate of factories A, B, and C, which produce lamps of a certain company.

Table 3. Data on production by plant
factory p(factory) BD, P(D|PR)
A0.350.015 25615
B0.350.0163860
C0.30.0289475
P(): production rate, D: efective product ratio

Calculating the probability that a randomly selected defective product was produced in Factory C from Table 3,

$$P(C|D)=\frac{P(C \cap D)}{P(D)}$$

Since the information of $P(C \cap D)$ is not mentioned from the above expression, it is calculated as the probability of defective products in the table above using the general formula of conditional probability.

$$P(D|C)=\frac{P(D \cap C)}{P(C)}$$

The set of products is commutative, that is, $P(D \cap C)=P(C \cap D)$, so it is calculated as follows.

$$P(C \cap D)=P(D|C)P(C)=0.020 \cdot 0.3=0.006$$

P(D) is equal to the sum of the probabilities of defective products at each plant.

$$\begin{aligned}P(D)&=P(D \cap A) +P(D \cap B)+P(D \cap C)\\ &=P(D|A)P(A)+P(D|B)P(B)+P(D|C)P(C)\\ &=0.015 \cdot 0.35 +0.01 \cdot 0.35 +0.02 \cdot 0.30 \\ &=0.01475 \end{aligned}$$

Therefore, it is calculated as:

$$\begin{aligned}P(C|D)&=\frac{P(C \cap D)}{P(D)}\\&=\frac{0.006}{0.01475}\\&=0.407 \end{aligned}$$

Applying the above process to calculate for other factories as well, it can be expressed as:

$$\begin{aligned}P(A|D)&=\frac{P(A ∩ D)}{P(D)}\\& =\frac{P(D|A)P(A)}{P(D)}\\&=\frac{0.015 · 0.35}{0.01475}\\&=0.356\\ P(B|D)&=\frac{P(B \cap D)}{P(D)} \\&=\frac{P(D|B)P(B)}{P(D}\\&= \frac{0.01 · 0.35}{0.01475}\\&=0.237 \end{aligned}$$

In the example above, P(A), P(B), and P(C) are the probabilities that can be obtained before calculating additional information. This probability is called the prior probability. The conditional probabilities P(A|D), P(B|D), and P(C|D) that can be calculated based on these prior probabilities are called posterior probability. In other words, it means the probability that the conditional probability of an event can be calculated after obtaining additional information such as the defective rate from the prior probability. The generalization of the above process is called Bayes theorem.

Bayes theorem

If several subspaces (B1, B2, …, Bk) in sample space S are independent, then $$S=B_1 \cup B_2 \cup \cdots \cup B_k$$ The probabilities of all events lie between [0, 1]. In this case, the total occurrence of an event (A) is equal to the sum of the conditional probabilities that can be generated from all the conditions for that event. Of course, each case must satisfy the premise of independence. $$\begin{aligned}&A=(A \cap B_1) \cup(A \cap B_2) \cup \cdots \cup(A \cap B_k)\\&\begin{aligned}P(A)&= P(A \cap B_1) \cup P(A \cap B_2) \cup \cdots \cup P(A \cap B_k)\\&=\sum^ k_{i=1} P(A \cap B_i)\\& = \sum^k_{i=1} P(A|B_i)P(B_i) \end{aligned} \end{aligned}$$ From the above relationship, the posterior probability of $B_k$ is calculated as Equation 6.

$$\begin{align}\tag{6} P(B_k) &=\frac{P(B_k \cap A)}{P(A)}\\&=\frac{P(A|B_k)P(B_k)}{\sum^{k}_{i=1} P(A|B_i)P(B_i)} \end{align}$$

Example 4)
  You want to choose one from a toolbox containing 40 fuses. Of those fuses, 5 are completely defective (D), 10 are partially defective(pD), lasting for 1 hour, and the remaining 25 (G) are normal. If one is selected, what is the probability of choosing a normal product if it is not a complete defective product?
If fuses are classified as defective and non-defective (D) products (ND), G is included in ND. In this classification, the intersection of G and ND is G.

$$P(G \cap ND) =P(G)$$

Therefore, the answer in this example is calculated as follows.

$$\begin{aligned}P(G | ND)&=\frac{P(G \cap ND)}{P(ND)}\\ &=\frac{P(G)}{P(ND)}\\ &=\frac{25/40}{35/40}\\&=\frac{5}{7} \end{aligned}$$

Example 5)
  A has two children and has to attend a meeting with the youngest son. If he can attend the meeting, what is the probability that his family will contain only two sons?

  1. Sample Space: S={(b,b), (b,g), (g, b), (g,g)}, b:boy, g: girl
  2. Any event that meets the conditions of attendance: A={(b,b),(b,g), (g, b)}
  3. target event: B={(b,b)}
$$\begin{align}P(B|A)&=\frac{P(B \cap A)}{P(B)}\\&=\frac{\frac{1}{4}}{\frac{1}{3}}\\&=\frac{1}{3} \end{align}$$

Example 6)
  The following data is a two-day change in the closing prices of NASDAQ (na) and the Chicago Board Options Exchange Volatility Index (vix) over a period of time, listing an increase as 1 and a decrease as -1. Are the price movements of the two stocks independent?

Date navi
2020-03-03 0 1
2020-03-04 1 0
2021-06-17 1 0
2021-06-18 0 1

The independence of two groups can be determined by considering the intersection. That is, the probability for the intersection of independent events is calculated as the product of the two probabilities. If the two groups are independent, the product of the two probabilities will be the same as the result of the conditional probability, as shown in the following equation.

$$\begin{align}&P(A) \cap P(B)=P(A|B)P(B)\\& \rightarrow\; P(\text{kos}=1) \cap P(\text{kq}=1)=P(\text{kos}=1|\text{kq}=1)P(\text{kq}=1)\end{align}$$

Create a crosstabulation for the table above. This uses the pd.crosstab() function.

import FinanceDataReader as fdr
st=pd.Timestamp(2020, 3, 2)
et=pd.Timestamp(2021, 11,29)
na=fdr.DataReader('IXIC', st, et)['Close']
vix=fdr.DataReader('VIX', st, et)['Close']
na1=na.pct_change()
na1=na1.replace(0, method="ffill")
na1=na1.dropna()
na1.head(2)
Date
2020-03-03-0.029948
2020-03-04 0.038461
Name: Close, dtype: float64
vix1=vix.pct_change()
vix1=vix1.replace(0, method="ffill")
vix1=vix1.dropna()
vix1.head(2)
Date
2020-03-03 0.101735
2020-03-04-0.131179
Name: Close, dtype: float64
na2=pd.cut(na1, bins=[-1, 0, 1],labels=[0, 1])
na2[:2,]
Date
2020-03-03 0
2020-03-041
Categories (2, int64): [0 < 1]
vix2=pd.cut(vix1, bins=[-1, 0, 1], labels=(0, 1))
vix2[:2]
Date
2020-03-03 1
2020-03-040
Categories (2, int64): [0 < 1]
ct=pd.crosstab(na2, vix2, rownames=['nasdaq'], colnames=['vix'], margins=True, normalize=True)
ct
vix 0 1 All
nasdaq
0 0.097506 0.308390 0.405896
1 0.462585 0.131519 0.594104
All 0.560091 0.439909 1.000000

Calculate from the results of the cross table above.

p1na=ct.iloc[2,0]
p1vix=ct.iloc[0, 2]
p1=np.around(p1na*p1vix,2)
p1
0.23
# Probability of 'na' increase in vix increase condition
p1naVix=ct.iloc[0,0]
p1naVix
0.09750566893424037
p1_2=p1naVix*p1vix
np.around(p1_2, 2)
0.04

The fact that the above two results p1, p1_2 are different means that the two stocks are not independent. Since the two stocks are not independent, the correlation coefficient between the two data will not be zero.(see correlation analysis) The correlation coefficient between two stocks can be calculated using the DataFrame object.corr() function.

data=pd.concat([na, vix], axis=1)
data.columns=['na', 'vix']
data.head()
na vix
Date
2020-03-02 8952.2 33.42
2020-03-03 8684.1 36.82
2020-03-04 9018.1 31.99
2020-03-05 8738.6 39.62
2020-03-06 8575.6 41.94
data.corr()
na vix
na 1.000000 -0.820456
vix -0.820456 1.000000

The object data used in the code above is a combination of na and vix data. This binding is applied to the DataFrame.concat() method. Correlation analysis results for two variables in data indicate that there is a very strong inverse relationship between them. In other words, since the two variables are not independent, the probability for the intersection of the two stocks must be calculated by Bayes' theorem.

댓글

이 블로그의 인기 게시물

[Linear Algebra] 유사변환(Similarity transformation)

유사변환(Similarity transformation) n×n 차원의 정방 행렬 A, B 그리고 가역 행렬 P 사이에 식 1의 관계가 성립하면 행렬 A와 B는 유사행렬(similarity matrix)이 되며 행렬 A를 가역행렬 P와 B로 분해하는 것을 유사 변환(similarity transformation) 이라고 합니다. $$\tag{1} A = PBP^{-1} \Leftrightarrow P^{-1}AP = B $$ 식 2는 식 1의 양변에 B의 고유값을 고려한 것입니다. \begin{align}\tag{식 2} B - \lambda I &= P^{-1}AP – \lambda P^{-1}P\\ &= P^{-1}(AP – \lambda P)\\ &= P^{-1}(A - \lambda I)P \end{align} 식 2의 행렬식은 식 3과 같이 정리됩니다. \begin{align} &\begin{aligned}\textsf{det}(B - \lambda I ) & = \textsf{det}(P^{-1}(AP – \lambda P))\\ &= \textsf{det}(P^{-1}) \textsf{det}((A – \lambda I)) \textsf{det}(P)\\ &= \textsf{det}(P^{-1}) \textsf{det}(P) \textsf{det}((A – \lambda I))\\ &= \textsf{det}(A – \lambda I)\end{aligned}\\ &\begin{aligned}\because \; \textsf{det}(P^{-1}) \textsf{det}(P) &= \textsf{det}(P^{-1}P)\\ &= \textsf{det}(I)\end{aligned}\end{align} 유사행렬의 특성 유사행렬인 두 정방행렬 A와 B는 'A ~ B' 와 같

[matplotlib] 히스토그램(Histogram)

히스토그램(Histogram) 히스토그램은 확률분포의 그래픽적인 표현이며 막대그래프의 종류입니다. 이 그래프가 확률분포와 관계가 있으므로 통계적 요소를 나타내기 위해 많이 사용됩니다. plt.hist(X, bins=10)함수를 사용합니다. x=np.random.randn(1000) plt.hist(x, 10) plt.show() 위 그래프의 y축은 각 구간에 해당하는 갯수이다. 빈도수 대신 확률밀도를 나타내기 위해서는 위 함수의 매개변수 normed=True로 조정하여 나타낼 수 있다. 또한 매개변수 bins의 인수를 숫자로 전달할 수 있지만 리스트 객체로 지정할 수 있다. 막대그래프의 경우와 마찬가지로 각 막대의 폭은 매개변수 width에 의해 조정된다. y=np.linspace(min(x)-1, max(x)+1, 10) y array([-4.48810153, -3.54351935, -2.59893717, -1.65435499, -0.70977282, 0.23480936, 1.17939154, 2.12397372, 3.0685559 , 4.01313807]) plt.hist(x, y, normed=True) plt.show()

R 미분과 적분

내용 expression 미분 2차 미분 mosaic를 사용한 미분 적분 미분과 적분 R에서의 미분과 적분 함수는 expression()함수에 의해 생성된 표현식을 대상으로 합니다. expression expression(문자, 또는 식) 이 표현식의 평가는 eval() 함수에 의해 실행됩니다. > ex1<-expression(1+0:9) > ex1 expression(1 + 0:9) > eval(ex1) [1] 1 2 3 4 5 6 7 8 9 10 > ex2<-expression(u, 2, u+0:9) > ex2 expression(u, 2, u + 0:9) > ex2[1] expression(u) > ex2[2] expression(2) > ex2[3] expression(u + 0:9) > u<-0.9 > eval(ex2[3]) [1] 0.9 1.9 2.9 3.9 4.9 5.9 6.9 7.9 8.9 9.9 미분 D(표현식, 미분 변수) 함수로 미분을 실행합니다. 이 함수의 표현식은 expression() 함수로 생성된 객체이며 미분 변수는 다음 식의 분모의 변수를 의미합니다. $$\frac{d}{d \text{변수}}\text{표현식}$$ 이 함수는 어떤 함수의 미분의 결과를 표현식으로 반환합니다. > D(expression(2*x^3), "x") 2 * (3 * x^2) > eq<-expression(log(x)) > eq expression(log(x)) > D(eq, "x") 1/x > eq2<-expression(a/(1+b*exp(-d*x))); eq2 expression(a/(1 + b * exp(-d * x))) > D(eq2, "x") a * (b * (exp(-d * x) * d))/(1 + b