기본 콘텐츠로 건너뛰기

[matplotlib] 등고선(Contour)

Independence and Conditional Probability

contents

  1. Independence and Conditional Probability
    1. Independence
    2. Conditional Probability

Independence and Conditional Probability

Independence

Events whose intersection is the empty set are independent events or mutually exclusive outcomes. For example, if a single die is rolled, it is an independent event because it cannot happen that both 1 and 2 are rolled together. On the other hand, the probability of 1 and an odd number can occur at the same time because 1 is already odd. Therefore, these events are not mutually exclusive results.

Calculating the probabilities of independent events is relatively easy. In other words, the probabilities of an event of 1 or 2 in a single die trial are mutually independent and therefore the sum of their probabilities.

P(1or2)=P(1)+P(2)=16+16=13

Contrary to the above, if events A and B are not independent events, the above sum is modified as follows.

P(AorB)=P(A)+P(B)P(AandB)orP(AB)=P(A)+P(B)P(AB)
sum of events
If there are two independent events E1 and E2, then the probability of their occurrence is simply calculated as the sum of the two probabilities. P(E1orE2)=P(E1E2)=P(E1)+P(E2)

By expanding the above equation, all probabilities for two or more independent events are calculated as in equation 1.

(1)P(E1orE2orE3orEn)=P(E1E2E3En)=P(E1)+P(E2)+P(E3)++P(En)

In the case of interdependent events, the common parts between the events must be considered. Therefore, Equation 1 is converted to Equation 2.

(2)P(E1orE2orE3orEn)=P(E1E2E3En)=P(E1)+P(E2)+P(E3)++P(En)P(E1E2)P(En1En)P(E1E2E3En)
Note
  In probability and statistics, 'or' means 'union' and 'and' means 'intersection'

Example 1)
  Determines the probability of an event with a point of (3,1,5) in a trial of rolling three dice of different colors.
This implementation is an independent case. The number of elements in the sample space is 6 × 6 × 6=216 and is as follows.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
rng=range(1, 7)
s=np.array([(i, j, k) for i in rng for j in rng for k in rng])
s[:3,:]
array([[1, 1, 1],
       [1, 1, 2],
       [1, 1, 3]])
len(s)
216

In the sample space, the dice point (3,1,5) occurs only once.

trg=np.array([[3,1,5]])
x=s[np.where(s[:,0]==trg[0,0])]
for i in range(1, s.shape[1]):
    x=x[np.where(x[:,i]==trg[0, i])]
x
array([[3, 1, 5]])

The probability of this event is.

from sympy import *
p=Rational(x.shape[0], s.shape[0])
p
1216

The result of the above code is using the multiplication rule as follows:

P(315)=161616=1216

Conditional Probability

The following data is a survey of whether children enter college immediately after high school graduation according to whether their parents have graduated from college or not.

Table 1. Parents and college freshmen
C / P Pyes Pno total
Cyes 231 214 445
Cno 49 298347
total 280 512 792
P:Parent, C:Children
d=pd.DataFrame([[231, 214],[49, 298]])
drsum=d.sum(axis=1)
dT=pd.concat([d, drsum], axis=1)
dTcsum=dT.sum(axis=0)
dT=pd.concat([dT,pd.DataFrame(dTcsum).T], axis=0)
dT.columns=["Pyes","Pno","total"]
dT.index=['Cyes','Cno','total']
dT
P_yes P_no total
C_yes 231 214 445
C_no 49 298 347
total 280 512 792

From the above data, try to determine:

  1. What is the probability that children from parents with degrees will go to college?
P(CyesinPyes)=CyesPyes=231280=0.825
  1. Probability of parents holding college degrees among students who did not go to college immediately after high school graduation?
P(PyesinCno)=PyesCno=49347

Table 1 is a cross table showing data for parent and student variables together. The last row and last column of this table show data only for the parent variable and the student variable, respectively. The probabilities corresponding to those univariates are called marginal pribability. For example, the probability of yes among the variables of freshmen can be calculated as follows.

P(Cyes)=CyesCtotal=445792

In Table 1, except for marginal probabilities, both parent and student variables are considered, and the corresponding probabilities are called joint probability. In the case of the above example, the joint probability can be calculated as following problem 3.

  1. Probability of going to college right after graduating from high school and parents without a degree?
P(CyesandPno)=214792=0.27
Note
In probability or statistics, use ',' for shorthand instead of 'and' P(CyesandPno)=P(Cyes,Pno)

It can be expressed by calculating the probability for each term in Table 1. That is, it displays the frequency of all terms divided by the total number.

PdT=dT/dT.iloc[2,2]
np.around(PdT,2)
Pyes Pno total
Cyes 0.29 0.27 0.56
Cno 0.06 0.38 0.44
total 0.35 0.65 1.00

From the table above, what information could we use to estimate whether there is a link between a parent's degree and a child's college entrance right after high school?

  1. Probability of a parent with a degree among students entering college:
  2. PyesCyesCyes=0.290.56=0.52
  3. Probability of students attending college from parents with degrees:
CyesPyesPyes=0.290.35=0.82

The case where a condition is given to a specific probability as above is called conditional probability. In the above case, the basis for calculating the probability, that is, the condition that the parent has a degree, is given to the denominator. The conditional probability is expressed as Equation 3 using "|".

(3)P(target|condition)=P(targetcondition)P(condition)

Therefore, in the above case

P(Cyes|Pyes)=P(CyesPyes)P(Pyes)

The above conditional probability calculation process can be generalized as in Equation 4.

(4)P(A|B)=P(AB)P(B)P(AB)=P(A|B)P(B)

Example 2)
  Table 2 shows the results of cancer diagnoses by gender in both cities. Determines the probability that the diagnosis result is male in A.

Table 2. Cancer diagnosis results
Sex / City A B total
Male 23876 739 25615
Female58302 555863860
total82178729789475

To calculate the probability corresponding to each value in the table above, it is coded in DataFrame format.

d=pd.DataFrame([[23876,739],[58302,5558]])
drsum=d.sum(axis=1)
dT=pd.concat([d, drsum], axis=1)
dTcsum=dT.sum(axis=0)
dT=pd.concat([dT,pd.DataFrame(dTcsum).T], axis=0)
dT.columns=["A","B","Rtotal"]
dT.index=['M','F','Ctotal']
dT
A B Rtotal
M 23876 739 24615
F 58302 5558 63860
Ctotal 82178 6297 88475
PdT=np.around(dT/dT.iloc[2,2], 2)
np.around(PdT,2)
A B Rtotal
M 0.27 0.01 0.28
F 0.66 0.06 0.72
Ctotal 0.93 0.07 1.00

What is the probability (P(M|A)) of A in the table above?

P(M|A)=P(MA)P(A)
P_MA=PdT.iloc[0,0]/PdT.iloc[2,0]
np.around(P_MA,2)
0.29

If the independent outcomes of all events in a trial are A1, A2, …, Ak and the condition for each outcome is denoted as B, then the sum of the probabilities of each outcome in that condition is as Equation 5 can indicate.

(5)P(A1|B)+P(A2|B)++P(Ak|B)=1

You can calculate a specific probability from the probability for a condition.
Example 3)
Table 3 is data on the production rate and defective product rate of factories A, B, and C, which produce lamps of a certain company.

Table 3. Data on production by plant
factory p(factory) BD, P(D|PR)
A0.350.015 25615
B0.350.0163860
C0.30.0289475
P(): production rate, D: efective product ratio

Calculating the probability that a randomly selected defective product was produced in Factory C from Table 3,

P(C|D)=P(CD)P(D)

Since the information of P(CD) is not mentioned from the above expression, it is calculated as the probability of defective products in the table above using the general formula of conditional probability.

P(D|C)=P(DC)P(C)

The set of products is commutative, that is, P(DC)=P(CD), so it is calculated as follows.

P(CD)=P(D|C)P(C)=0.0200.3=0.006

P(D) is equal to the sum of the probabilities of defective products at each plant.

P(D)=P(DA)+P(DB)+P(DC)=P(D|A)P(A)+P(D|B)P(B)+P(D|C)P(C)=0.0150.35+0.010.35+0.020.30=0.01475

Therefore, it is calculated as:

P(C|D)=P(CD)P(D)=0.0060.01475=0.407

Applying the above process to calculate for other factories as well, it can be expressed as:

P(A|D)=P(AD)P(D)=P(D|A)P(A)P(D)=0.015·0.350.01475=0.356P(B|D)=P(BD)P(D)=P(D|B)P(B)P(D=0.01·0.350.01475=0.237

In the example above, P(A), P(B), and P(C) are the probabilities that can be obtained before calculating additional information. This probability is called the prior probability. The conditional probabilities P(A|D), P(B|D), and P(C|D) that can be calculated based on these prior probabilities are called posterior probability. In other words, it means the probability that the conditional probability of an event can be calculated after obtaining additional information such as the defective rate from the prior probability. The generalization of the above process is called Bayes theorem.

Bayes theorem

If several subspaces (B1, B2, …, Bk) in sample space S are independent, then S=B1B2Bk The probabilities of all events lie between [0, 1]. In this case, the total occurrence of an event (A) is equal to the sum of the conditional probabilities that can be generated from all the conditions for that event. Of course, each case must satisfy the premise of independence. A=(AB1)(AB2)(ABk)P(A)=P(AB1)P(AB2)P(ABk)=i=1kP(ABi)=i=1kP(A|Bi)P(Bi) From the above relationship, the posterior probability of Bk is calculated as Equation 6.

(6)P(Bk)=P(BkA)P(A)=P(A|Bk)P(Bk)i=1kP(A|Bi)P(Bi)

Example 4)
  You want to choose one from a toolbox containing 40 fuses. Of those fuses, 5 are completely defective (D), 10 are partially defective(pD), lasting for 1 hour, and the remaining 25 (G) are normal. If one is selected, what is the probability of choosing a normal product if it is not a complete defective product?
If fuses are classified as defective and non-defective (D) products (ND), G is included in ND. In this classification, the intersection of G and ND is G.

P(GND)=P(G)

Therefore, the answer in this example is calculated as follows.

P(G|ND)=P(GND)P(ND)=P(G)P(ND)=25/4035/40=57

Example 5)
  A has two children and has to attend a meeting with the youngest son. If he can attend the meeting, what is the probability that his family will contain only two sons?

  1. Sample Space: S={(b,b), (b,g), (g, b), (g,g)}, b:boy, g: girl
  2. Any event that meets the conditions of attendance: A={(b,b),(b,g), (g, b)}
  3. target event: B={(b,b)}
P(B|A)=P(BA)P(B)=1413=13

Example 6)
  The following data is a two-day change in the closing prices of NASDAQ (na) and the Chicago Board Options Exchange Volatility Index (vix) over a period of time, listing an increase as 1 and a decrease as -1. Are the price movements of the two stocks independent?

Date navi
2020-03-03 0 1
2020-03-04 1 0
2021-06-17 1 0
2021-06-18 0 1

The independence of two groups can be determined by considering the intersection. That is, the probability for the intersection of independent events is calculated as the product of the two probabilities. If the two groups are independent, the product of the two probabilities will be the same as the result of the conditional probability, as shown in the following equation.

P(A)P(B)=P(A|B)P(B)P(kos=1)P(kq=1)=P(kos=1|kq=1)P(kq=1)

Create a crosstabulation for the table above. This uses the pd.crosstab() function.

import FinanceDataReader as fdr
st=pd.Timestamp(2020, 3, 2)
et=pd.Timestamp(2021, 11,29)
na=fdr.DataReader('IXIC', st, et)['Close']
vix=fdr.DataReader('VIX', st, et)['Close']
na1=na.pct_change()
na1=na1.replace(0, method="ffill")
na1=na1.dropna()
na1.head(2)
Date
2020-03-03-0.029948
2020-03-04 0.038461
Name: Close, dtype: float64
vix1=vix.pct_change()
vix1=vix1.replace(0, method="ffill")
vix1=vix1.dropna()
vix1.head(2)
Date
2020-03-03 0.101735
2020-03-04-0.131179
Name: Close, dtype: float64
na2=pd.cut(na1, bins=[-1, 0, 1],labels=[0, 1])
na2[:2,]
Date
2020-03-03 0
2020-03-041
Categories (2, int64): [0 < 1]
vix2=pd.cut(vix1, bins=[-1, 0, 1], labels=(0, 1))
vix2[:2]
Date
2020-03-03 1
2020-03-040
Categories (2, int64): [0 < 1]
ct=pd.crosstab(na2, vix2, rownames=['nasdaq'], colnames=['vix'], margins=True, normalize=True)
ct
vix 0 1 All
nasdaq
0 0.097506 0.308390 0.405896
1 0.462585 0.131519 0.594104
All 0.560091 0.439909 1.000000

Calculate from the results of the cross table above.

p1na=ct.iloc[2,0]
p1vix=ct.iloc[0, 2]
p1=np.around(p1na*p1vix,2)
p1
0.23
# Probability of 'na' increase in vix increase condition
p1naVix=ct.iloc[0,0]
p1naVix
0.09750566893424037
p1_2=p1naVix*p1vix
np.around(p1_2, 2)
0.04

The fact that the above two results p1, p1_2 are different means that the two stocks are not independent. Since the two stocks are not independent, the correlation coefficient between the two data will not be zero.(see correlation analysis) The correlation coefficient between two stocks can be calculated using the DataFrame object.corr() function.

data=pd.concat([na, vix], axis=1)
data.columns=['na', 'vix']
data.head()
na vix
Date
2020-03-02 8952.2 33.42
2020-03-03 8684.1 36.82
2020-03-04 9018.1 31.99
2020-03-05 8738.6 39.62
2020-03-06 8575.6 41.94
data.corr()
na vix
na 1.000000 -0.820456
vix -0.820456 1.000000

The object data used in the code above is a combination of na and vix data. This binding is applied to the DataFrame.concat() method. Correlation analysis results for two variables in data indicate that there is a very strong inverse relationship between them. In other words, since the two variables are not independent, the probability for the intersection of the two stocks must be calculated by Bayes' theorem.

댓글

이 블로그의 인기 게시물

[Linear Algebra] 유사변환(Similarity transformation)

유사변환(Similarity transformation) n×n 차원의 정방 행렬 A, B 그리고 가역 행렬 P 사이에 식 1의 관계가 성립하면 행렬 A와 B는 유사행렬(similarity matrix)이 되며 행렬 A를 가역행렬 P와 B로 분해하는 것을 유사 변환(similarity transformation) 이라고 합니다. (1)A=PBP1P1AP=B 식 2는 식 1의 양변에 B의 고유값을 고려한 것입니다. (식 2)BλI=P1APλP1P=P1(APλP)=P1(AλI)P 식 2의 행렬식은 식 3과 같이 정리됩니다. det(BλI)=det(P1(APλP))=det(P1)det((AλI))det(P)=det(P1)det(P)det((AλI))=det(AλI)det(P1)det(P)=det(P1P)=det(I) 유사행렬의 특성 유사행렬인 두 정방행렬 A와 B는 'A ~ B' 와 같...

[sympy] Sympy객체의 표현을 위한 함수들

Sympy객체의 표현을 위한 함수들 General simplify(x): 식 x(sympy 객체)를 간단히 정리 합니다. import numpy as np from sympy import * x=symbols("x") a=sin(x)**2+cos(x)**2 a sin2(x)+cos2(x) simplify(a) 1 simplify(b) x3+x2x1x2+2x+1 simplify(b) x - 1 c=gamma(x)/gamma(x-2) c Γ(x)Γ(x2) simplify(c) (x2)(x1) 위의 예들 중 객체 c의 감마함수(gamma(x))는 확률분포 등 여러 부분에서 사용되는 표현식으로 다음과 같이 정의 됩니다. 감마함수는 음이 아닌 정수를 제외한 모든 수에서 정의됩니다. 식 1과 같이 자연수에서 감마함수는 factorial(!), 부동소수(양의 실수)인 경우 적분을 적용하여 계산합니다. (식 1)Γ(n)={(n1)!n:자연수0xn1exdxn:부동소수 x=symbols('x') gamma(x).subs(x,4) 6 factorial 계산은 math.factorial() 함수를 사용할 수 있습니다. import math math.factorial(3) 6 a=gamma(x).subs(x,4.5) a.evalf(3) 11.6 simpilfy() 함수의 알고리즘은 식에서 공통사항을 찾아 정리하...

sympy.solvers로 방정식해 구하기

sympy.solvers로 방정식해 구하기 대수 방정식을 해를 계산하기 위해 다음 함수를 사용합니다. sympy.solvers.solve(f, *symbols, **flags) f=0, 즉 동차방정식에 대해 지정한 변수의 해를 계산 f : 식 또는 함수 symbols: 식의 해를 계산하기 위한 변수, 변수가 하나인 경우는 생략가능(자동으로 인식) flags: 계산 또는 결과의 방식을 지정하기 위한 인수들 dict=True: {x:3, y:1}같이 사전형식, 기본값 = False set=True :{(x,3),(y,1)}같이 집합형식, 기본값 = False ratioal=True : 실수를 유리수로 반환, 기본값 = False positive=True: 해들 중에 양수만을 반환, 기본값 = False 예 x2=1의 해를 결정합니다. solve() 함수에 적용하기 위해서는 다음과 같이 식의 한쪽이 0이 되는 형태인 동차식으로 구성되어야 합니다. x21=0 import numpy as np from sympy import * x = symbols('x') solve(x**2-1, x) [-1, 1] 위 식은 계산 과정은 다음과 같습니다. x21=0(x+1)(x1)=0x=1or1x4=1의 해를 결정합니다. solve() 함수의 인수 set=True를 지정하였으므로 결과는 집합(set)형으로 반환됩니다. eq=x**4-1 solve(eq, set=True) ([x], {(-1,), (-I,), (1,), (I,)}) 위의 경우 I는 복소수입니다.즉 위 결과의 과정은 다음과 같습니다. x41=(x2+1)(x+1)(x1)=0x=±1,±1=±i,±1 실수...