Variance
As introduced in descriptive statistics, **variance** represents data variability and is calculated as Equation 1, and the square root of the variance becomes the standard deviation (σ).
$$\begin{equation}\tag{1} \begin{aligned}\sigma^2&=E(X-\mu)^2\\&=(x_1-\mu)^2P(X=x_1)+ \cdots+(x_k-\mu)^2P(X=x_k)\\&=\sum^k_{i=1} (x_k-\mu)^2P(X=x_k) \end{aligned} \end{equation}$$Variance, a measure of the spread of a data distribution, is the weighted average of the squared deviations between each data and the mean. Equation 1 is simplified to:
$$\begin{aligned}&\begin{aligned}\sigma^2&=\sum (x-\mu)^2P(X=x)\\&=\sum(x^2-2x\mu+\mu^2)f(x)\\&=\sum x^2f(x) -2\mu \sum xf(x)+ \mu^2\\&=\sum x^2f(x)-\mu^2\\&=E(X^2)-(E(X))^2 \end{aligned}\\ & \because \sum xf(x)=\mu \end{aligned}$$As in the above expression, the calculation of variance consists of the expected value of the square of the variable and the square of the mean. The expected value of that variable squared is called the second moment. In other words, the expected value according to the degree of a variable is expressed as a moment for that degree. Therefore, the variance is calculated as the difference between the square of the second moment and the first moment, and since they are all expected values, a linear combination as in Equation 2 is established.
$$\begin{equation}\tag{2} \begin{aligned} Var(aX+b)&=\sigma^2_{ax+b}\\&=E[((aX+b)-\mu_{aX+b})^2]\\ &=E[((aX+b)-E(aX+b))^2]\\&=E[((aX+b)-aE(X)+b)^2]\\&=E[(a(X-\mu))^2]\\&=a^2E[(x-\mu)^2]\\&=a^2\sigma^2_X \end{aligned} \end{equation}$$A constant added to a variable as in Equation 2 does not affect the variance of that variable.
Example 1)
The probability mass function of the random variable X is:
$$f(x)=\frac{x}{8}, \quad x=1,2,5$$.
Determine E(X) and Var(X).
import numpy as np import pandas as pd from sympy import * import matplotlib.pyplot as plt
x=np.array([1,2,5]) f=x/8 f
Ex=np.sum(x*f) Ex
Var=np.sum(x**2*f)-Ex**2 Var
Example 2)
The probability density function of a continuous random variable X is:
$$f(x)=\frac{x+1}{8}, \quad 2 < x < 4$$
Determine E(X) and Var(X).
The mean and variance are calculated using the integral of the PDF function. The integral operation applies the itegrate()
function of the sympy module.
x=symbols("x") f=(x+1)/8 Ex=integrate(x*f, (x, 2, 4)) Ex
Var=integrate(x**2*f,(x, 2, 4))-Ex**2 Var
Example 3)
Calculate the variance of a random variable X with the probability density function
x=symbols("x") f=1-abs(x) Ex=integrate(x*f, (x, -1,1)) Ex
Var=integrate(x**2*f,(x, -1,1))-Ex**2 Var
Example 4)
Two types of games are played based on the rule that one die is rolled and points are scored according to the eye.
Point | 1 | 2 | 3 | 4 | 5 | 6 |
Game 1(x) | 1 | 2 | 3 | 4 | 5 | 6 |
Game 2(y) | 3 | 0 | 6 | 0 | 0 | 12 |
P(X or Y) | $\displaystyle \frac{1}{6}$ | $\displaystyle \frac{1}{6}$ | $\displaystyle \frac{1}{6}$ | $\displaystyle \frac{1}{6}$ | $\displaystyle \frac{1}{6}$ | $\displaystyle \frac{1}{6}$ |
Determine the expected value and variance for each game.
game=pd.DataFrame([np.arange(1, 7), np.arange(1, 7),[3,0,6,0,0,12], np.repeat(Rational(1,6), 6)], index=["Dice Eye","game1(x)", "game2(Y)", "P(X or Y)"]) X=game.iloc[1,:] Y=game.iloc[2,:] EX=(X*game.iloc[3,:]).sum() EX
VarX=(X**2*game.iloc[3,:]).sum()-EX**2 VarX
#game2 EY=(Y*game.iloc[3,:]).sum() EY
VarY=(Y**2*game.iloc[3,:]).sum()-EY**2 VarY
Combine the two dice games in this example to create a new random variable Z and calculate the mean and variance of the probability distribution.
X=game.iloc[1,:] Y=game.iloc[2,:] Z=X+Y Z
0 | 4 |
1 | 2 |
2 | 9 |
3 | 4 |
4 | 5 |
5 | 18 |
dtype: object |
EZ=np.sum(Z*game.iloc[3,:]) EZ
EX+EY
As shown in the above result, the expected value of the combined variable is equal to the sum of each expected value. However, the variance of the combined variables is not equal to the sum of the variances of each variable. The variance can be calculated by DataFrame ``object.var()``.
VarZ=np.sum(Z**2*game.iloc[3,:])-EZ**2 VarZ
np.var(Z)
VarX+VarY
As the above results show, the variance of the combined variables and the sum of the variances of each variable do not match. This difference can be explained by the process of inducing the variance of the binding variable as shown in Equation 3.
$$\begin{equation}\tag{3} \begin{aligned} &Var[aX+bY]\\&=E[((aX+bY)-(a\mu_X+b\mu_Y)^2)]\\&=E[(a(X-\mu_X)+b(Y-\mu_Y))^2)]\\ &=E[a^2(X-\mu_X)^2+2ab(X-\mu_X)(Y-\mu_Y)+b^2(Y-\mu_Y)] \\ &=a^2E[(X-\mu_X)^2]+2abE[(X-\mu_X)(Y-\mu_Y)]+b^2E[(Y-\mu_Y)] \\ &=a^2Var(X)+b^2Var(Y)\\ & \because \; E[(X-\mu_x)(Y-\mu_Y)]=0 \end{aligned} \end{equation}$$In Equation 3, E[(X-μx)(Y-μY)] denotes the interaction of two variables. If the two variables are independent, the value of that interaction is zero. Therefore, the difference in variance between the variables X and Y in the example and the associated variable Z provides information that the two variables are not independent.
댓글
댓글 쓰기