기본 콘텐츠로 건너뛰기

[ML] 결정트리(Decision Tree) 모델

Descriptive statistics

Contents

  1. Location information
    1. Mode
    2. Mean
    3. Meidan
  2. Variation
    1. Range
    2. Quantile
    3. Mean Absolute Deviation(MAD)
    4. Variance
    5. Standard Deviation
    6. Variation Coefficient

Location information

Mode

Statistics often use information about where most of the data is concentrated. These points are called centroid scales. For example, in a restaurant with multiple menus, after a renovation, the manager wants to focus on one menu. In this case, choosing the menu with the most sales is a reasonable decision. The most observed value is called the mode.

mode
  • The value with the highest frequency based on the number of occurrences of each variable in the data set is called the mode.
  • In a data set, there can be more than one mode.

The mode value is used to indicate the peak, which is the highest frequency, and this value can be checked with the np.unique() and scipy.stats.mode() functions and the pd_object.mode() method. Also, the frequency of each value in the data can be calculated using the np.unique() function and the pd_object.value_counts() method.

Next, try to determine the mode from the generated data using randint(start, end, size) of numpy.random module, a function that generates a random integer of a specified size in a specific interval.

import numpy as np
import pandas as pd
data=np.random.randint(1, 10, 50)
print(data)
[4 6 5 2 8 8 2 2 2 4 5 6 3 4 3 2 1 5 5 3 2 1 7 2 2 3 2 7 9 9 4 7 2 2 3 2 9 6 5 2 9 2 3 8 7 3 4 4 9 2]
new, cnt=np.unique(data, return_counts=True)
print(new)
print(cnt)
[1 2 3 4 5 6 7 8 9]
[ 2 15 7 6 5 3 4 3 5]
da1=pd.DataFrame(data)
da1.T
0 1 2 ... 47 48 49
0 4 6 5 ... 4 9 2

1 rows × 50 columns

da1.value_counts()
2     15
3     7
4     6
5     5
9     5
7     4
6     3
8     3
1     2
dtype: int64
da1.mode()
0
0 2

The unique() function calculates the frequency of the entire data, not just the mode, and the user ultimately has to decide the mode. Instead, the pd_object.mode() method or the stats.mode() function returns only the mode, as shown in the following code.

from scipy import stats
stats.mode(data)
ModeResult(mode=array([2]), count=array([15]))

Example 1)
  Try to determine the mode in the object for the color it follows.

color=["green", "blue", "red", "green", "green", "red", "blue", "black", "white", "yellow", "green", "yellow", "green", "green", "green"]
pd.Series(color).mode()
0 green
dtype: object
stats.mode(color)
ModeResult(mode=array(['green'], dtype='< U6'), count=array([7]))
Data type

In programming languages, data types are the standards for allocating memory for storing or operating data. Python basically includes five types of data types: numeric, character, list, tuple, and dictionary. However, new data types for specific purposes can be created based on these primitive data types. Packages such as numpy, pandas, and scipy used above use their respective data types, but they are data types created based on lists. A list is a form of grouping multiple values into one using square brackets, and has the form ['a','list','list', 2, 3]. Converting this form to numpy array type and pandas series type is as follows.
x= ['a','list', 2, 3]
x
['a', 'list', 2, 3]
np.array(x)
array(['a', 'list', '2', '3'], dtype='< U21')
pd.Series(x)
0 a
1 list
2 2
3 3

dtype: object

pd object mode() calculates the mode for the list variable that is the above nominal type. However, in order to calculate various statistics as well as the mode, it is necessary to convert a list variable (a qualitative variable) into a quantitative variable, which is a numerical type. For this conversion, you can apply LabelEncoder(), a class from the preprocessing module of the Python package sklearn. In Python, a class is a type of function that contains several functions and objects within it. The process of applying a class is slightly different from that of a normal function.

Package

Python consists of a core program and various packages. The package is made for a specific area based on the core program, and the user uses it by importing the package to be used as the core program. The above-mentioned numpy, pandas, and scipy are representative packages used in data cleaning, analysis, statistics, and mathematical operations, and are used by attaching to the core program as follows.
import numpy as np
import pandas as pd
from scipy import stats

There are several ways to attach packages.

class

Python is made up of objects. When an object is simply expressed, it can be regarded as a name and space for storage. It is also simply a name, but it can also be a function that returns a result through a specific operation when data in an appropriate format is input. These objects have properties that are unique to them, called attributes. A class is an object for defining the activity area of an object. Therefore, it contains functions and attributes that can only be used within the class. These functions are denoted by the terms method or member function to distinguish them from ordinary functions. For example, to use the class LabelEncoder() applied above, create an object that will share all methods and attributes of this class.
x=LabelEncoder()

Object x can use all methods and attributes included in LabelEncoder(). In other words, all methods and properties defined on object x can only be used within that object.

from sklearn.preprocessing import LabelEncoder
color=['white', 'red', 'white', 'red', 'black', 'red', 'black', 'red', 'yellow', 'white', 'yellow', 'yellow', 'red', 'yellow', 'black']
col=LabelEncoder()
col.fit(color)
LabelEncoder()
list(col.classes_)
['black', 'red', 'white', 'yellow']
color1=col.transform(color)
color1
array([2, 1, 2, 1, 0, 1, 0, 1, 3, 2, 3, 3, 1, 3, 0])
col.inverse_transform(color1)
array(['white', 'red', 'white', 'red', 'black', 'red', 'black', 'red', 'yellow', 'white', 'yellow', 'yellow', 'red', 'yellow', 'black'], dtype='≤ U6')

Determine the mode for color1 in the above result.

val, count=stats.mode(color1)
print(f'mode: {val}, counts: {count}')
mode: [1], counts: [5]
col.inverse_transform([1])
array(['red'], dtype='≤U6')

Below are the records generated in the running race.

1:00:05, 1:00:04, 0:53:53, 0:51:32, 1:00:09
0:51:39, 1:00:07, 1:00:10, 1:00:42, 1:00:48

It can be seen from this record that most of the time is close to an hour. However, the mode cannot be determined because no value appears more than once. In other words, it is difficult to determine the mode for a continuous variable like the one above. As a result, the mode is primarily used for categorical data. The np.digtize() function is applied to discretize the continuous variable to each range. This function divides the continuous variable into specified intervals and returns the interval to which each value belongs.

Example 2)
 The following code divides 20 random numbers into 10 intervals and displays the interval in which each value is included.

np.random.seed(seed=0)
df=pd.DataFrame(np.around(np.random.randn(20), 3))
df.columns=['da']
df.head()
da
0 1.764
1 0.400
2 0.979
3 2.241
4 1.868
bins=np.around(np.linspace(df.da.min(), df.da.max(), 10), 3)
print(bins)
[-0.977 -0.619 -0.262 0.096 0.453 0.811 1.168 1.526 1.883 2.241]

Example 3)
  Mode of the next object?

x=np.array([8,8,2,3,1,5,0])
stats.mode(x)
ModeResult(mode=array([8]), count=array([2]))

Example 4)
  Find the mode of the next object and the frequency of each value in that object ?

x=np.array([5, 5, 3, 6, 2, 4, 5, 9, 5, 5, 2, 5])
np.unique(x, return_counts=True)
(array([2, 3, 4, 5, 6, 9]), array([2, 1, 1, 6, 1, 1]))
x1=pd.DataFrame(x)
x1.head()
0
0 5
1 5
2 3
3 6
4 2
x1.mode()
0
0 5
x1.value_counts()
5 6
2 2
3 1
4 1
6 1
9 1

dtype: int64

stats.mode(x1)
ModeResult(mode=array([[5]]), count=array([[6]]))

Mean

As mentioned above, for continuous variables such as ratios, the mode of the data set cannot be determined. Instead, as in Equation 1, the average is used as a measure of centroid. This value is called the mean.

$$\begin{align}\tag{1} \mu&=\frac{\sum_{i=0}^N x_i}{N}\\ \mu&:\; \text{mean}\\ x&:\; \text{data}\\ N&:\; \text{size} \end{align}$$

For example, the following is a student's grades for the middle and final semesters of the first and second semesters. Calculating the average of them gives:



$$ \begin{aligned} \text{grade}&=[6, 8, 9, 5]\\ \mu&=\frac{6+8+9+5}{4} \end{aligned}$$
The mean can be calculated using a loop as shown in the following code, but can be determined by applying the np.mean() function.
grade=[6, 8, 9, 5]
total=0
for i in grade:
    total=total+i
total
  
28
mu=total/len(grade)
mu
7.0
np.mean(grade)
7.0

Typically, the mean of each data xi in data set X is denoted by μx or $\bar {x}$. Therefore, the above expression can be expressed as

$$\begin{align} \text{Mean}&=\frac{\text{Add of all values}}{\text{number of values}}\\\\ \overline{x}&=\frac{\sum^n_{x_i}x_i}{n}\\ &=\mu \end{align}$$

Example 5)
&emspl; Calculate the mean from the following frequency table.

value frequency
2 4
5 8
8 6

The frequency of each value is presented, and the sum of the data is calculated as the sum of value × frequency . The average is calculated as:

$$\mu=\frac{2 \cdot 4 + 5 \cdot 8 + 8 \cdot 6}{4 + 8 + 6}$$

It is more useful to apply matrix operations when the number of values or the number of variables is large. The following code uses the np.dot() function to apply a matrix product that works as in Equation 2.

$$\begin{equation}\tag{2} \begin{bmatrix}x_1&x_2&\cdots&x_n\end{bmatrix} \begin{bmatrix}f_1\\f_2\\ \vdots \\ f_n\end{bmatrix} =x_1f_1+x_2f_2+\cdots+x_nf_n \end{equation}$$
value=np.array([2,5,8])
frequency=np.array([[4],[8],[6]])
total=np.dot(value, frequency)
total
array([96])
mu=total/np.sum(frequency)
np.round(mu, 2)
array([5.33])

Example 6)   Determine the data size if the mean is 15 and the sum is 315.

$$\begin{align} 15 &=\frac{315}{n} \\ n&=\frac{315}{15} \end{align}$$
mu=15
total =315
n=total/mu
n
 21.0

The following data set contains one that differs significantly from the others, and these values are called outlier. The mean covers the sum of all numbers as shown in Equation 1. Therefore, the mean is inherently very sensitive to these outliers.

np.random.seed(0)
data=np.random.randint(1, 5, 10)
data
array([1, 4, 2, 1, 4, 4, 4, 4, 2, 4])
data.mean()
3.0
data1=np.append([100], [data[1:]])
data1
 array([100,   4,   2,   1,   4,   4,   4,   4,   2,   4])
data1.mean()
12.9
data[1:].mean()  #mean excluding outliers
 3.2222222222222223

Averages provide an excellent central location of the data set, along with adjustments for outliers, etc. However, it is very important to recognize that the mean by itself is not accurate enough to judge all the information about that data set. To increase this accuracy, you need to increase the size of the data.

Median

The center of the data is the point where much of the data is concentrated. The mean is a method of determining the point, but there is a possibility of distorting the center of the whole due to the presence of outliers, etc. The median can be used as another central measure to compensate for the weakness of the mean.

For example, for a diet prescription for a group of 9 people, it is said that they are classified into weak and strong. The rationale for this prescription may present a problem for the following decision when classifying the group based on the average weight of 43 kg.

weight=np.array([38, 35, 45, 30, 48, 33, 42, 39,100])
weight
array([ 38,  35,  45,  30,  48,  33,  42,  39, 100])
weight.mean()
```
45.55555555555556
weight[:-1].mean()
   38.75

As the code shows, the weight of one member of the group differs significantly from the rest. That is, it contains values that are outliers. In this state, the average weight should be about 45.6 kg, suggesting prescription strength. However, if outliers are excluded, it is less than the suggested value and a weak prescription is issued. If such an outlier exists, the mean is very sensitive to the value and has the potential to act as a basis for unfavorable judgments. Instead of the mean, a value located in the middle of the data can be used as a position value representing the center. These measurements are defined as the median.

Median
  • Sorts all values in ascending or descending order.
  • Determine the total number of data (n)
  • determine the middle value
$$\begin{aligned} &n=\text{odd} \rightarrow \text{round index for median}=\frac{n}{2}\\ &\begin{aligned}n=\text{even} \rightarrow \text{index for median}&=[\frac{n}{2},\; \frac{n}{2}+1]\\ &=\text{average of two position values }\end{aligned} \end{aligned}$$

Sorting the data can be done by applying the function np.sort(). In this example, the number of data is 9, so the median is $\displaystyle \frac{9}{2}=4.5$. Since it is odd, this index rounds up to determine the 5th value as the median. In python, the index starts at 0, so the following code calls the value located at the 4th index of the object.

weightSort=np.sort(weight)
weightSort
array([ 30,  33,  35,  38,  39,  42,  45,  48, 100])
weightSort[4]
39

The median can be calculated directly using np.median().

np.median(weight)
39.0

Example 7)
  Determines the median of 34, 12, 5, 42, 7, 55.

d=np.array([34, 12, 5, 42, 7, 55])
dsort=np.sort(d)
dsort
array([ 5,  7, 12, 34, 42, 55])
#n=6, even
medIdx=len(d)/2
medIdx
3.0
python
med=(dsort[2]+dsort[3])/2
med
23.0
np.median(d)
23.0

Variation

Variation or spread indicates the degree of spread of data and is the basic information that describes the characteristics of data. You can use variation along with location information such as >mean> to describe the distribution of data. For example, the following data are Dow Jones index data for a period of time. Although this data is continuous, it is converted into a categorical variable by dividing each value into specific intervals as shown in the following table.

group Range
Lower Upper
1 29978 30515
2 30515 31048
3 2998.123018.12
4 31048 31581
5 31581 32114
6 3211432647
7 3264733179
8 33179 33712
9 33712 34245
10 34245 34778
11 34778 $\sim$

Various financial data can be obtained using the Python package Python library FinanceDataReader. The following material is a call to the Dow Jones index for a specified period using the DataReader() function of this package. Since this data is continuous type, the pd.cut() function is used to convert it to a categorical type.This function is more convenient than the np.digitize() function because it returns the list result directly by specifying the interval. You can also use np.histogram() to represent the frequency of each interval.

import FinanceDataReader as fdr
st=pd.Timestamp(2021,1, 1)
et=pd.Timestamp(2021, 11,23)
da=fdr.DataReader('DJI', st, et)['Close']
da.head(2)
Date
2021-01-04 30223.89
2021-01-05 30391.60

Name: Close, dtype: float64

da1=pd.cut(da, bins=9, labels=range(1,10), retbins=True
da1[0].head(3)
Date
2021-01-04 1
2021-01-05 1
2021-01-06 2

Name: Close, dtype: category
Categories (9, int64): [1 < 2 < 3 < 4 ... 6 < 7 < 8 << 9]

np.around(da1[1], 0)
array([29976., 30699., 31416., 32132., 32849., 33565., 34282., 34998., 35715., 36431.])
np.histogram(da, bins=9)
(array([ 7, 24, 14,  9, 13, 38, 66, 36, 19]),
     array([29982.62, 30699.15, 31415.68, 32132.21, 32848.74, 33565.27,
  34281.8 , 34998.33, 35714.86, 36431.39]))

Each interval frequency of the above result can be calculated directly by applying the value_counts() method in pandas package and mode() method in the package is also used to calculate the mode of data.

fre=da1[0].value_counts()
fre
7 66
6 38
8 36
2 24
9 19
3 14
5 13
4 9
1 7

Name: Close, dtype: int64

da1[0].mode()
07

Name: Close, dtype: category
Categories (9, int64): [1 < 2 < 3 < 4 ... 6 < 7 < 8 < 9]

The mode for this data is 7. Among the above results, the result of value_counts() shows the frequency of each interval, and a dot plot can be created as shown in Figure 1.1. This figure indicates the degree of distribution of the data and gives an approximate shape.

import matplotlib.pyplot as plt
for i, j in zip(fre.index, fre.values):
    plt.scatter(np.repeat(i, j), range(1, j+1), s=100)
plt.xlabel("Group", size="13", weight="bold")
plt.ylabel("Frequency", size="13", weight="bold")
plt.text(1, -20, 'Figure 1. Dotplot.', size="15", weight="bold")
plt.show()
[desStat01.png]

Figure 2 is a dotplot of different data. It can be seen that this data has a larger difference (spread of data) between the minimum and maximum values compared to Figure 1. Since this difference is the variation of the data, it can be seen that the variation of the data in Figure 2 is large compared to Figure 1.

st=pd.Timestamp(2010,1, 1)
et=pd.Timestamp(2021, 11,23)
da2=fdr.DataReader('DJI', st, et)['Close']
da2Cat=pd.cut(da2, bins=9, labels=range(1,10), retbins=True)
fre2=da2Cat[0].value_counts()
fre2[:3]
3 758
1 541
6 477

Name: Close, dtype: int64

for i, j in zip(fre2.index, fre2.values):
    plt.scatter(np.repeat(i, j), range(1, j+1), s=100)
plt.xlabel("Group", size="13", weight="bold")
plt.ylabel("Frequency", size="13", weight="bold")
plt.text(1, -300, 'Figure 2. Dotplot.', size="15", weight="bold")
plt.show()
![png](.\image\desStat02.png)
Spread(Variation)
- Represents the difference between the maximum and minimum values of the data or the difference from the centroid - If the difference between the values is large, the spread increases.

Figures 1 and 2 are dotplots showing the fluctuations, but as the size of the data increases, the distinction can become blurry in the figure. In this case, numerical values representing fluctuations are used, and the types are as follows.

  • Range
  • Mean Absolute Deviation, MAD
  • Variance
  • Stadard Deviation

Range

It uses the range of the data set as a method of measuring variance (spread). As in Equation 3, the range represents the difference between the maximum and minimum values.

$$\begin{equation}\tag{3} \text{range}=\text{maximum}-\text{minimum} \end{equation}$$

This value can be calculated as the difference between the two values after determining the maximum and minimum values of the data set using the numpy functions np.max() and np.min(). The following is the calculation of the range after extracting 50 random numbers between [1, 100).

d=np.random.randint(1,100, 50)
d
array([22, 37, …, 37, 54])
dmax, dmin=np.max(d), np.min(d)
dmax, dmin
(89, 1)
rng=dmax-dmin
rng
88

Example 8)
  Calculate the range of the following data.

A=set([4, 6, 2, 4, 6, -4, -7, 45])
B=set([4, 6, 2, 4, 6, -4, -7, 145])
print(f'A:{A}\nB:{B}')
A:{2, 4, 6, 45, -7, -4}
    B:{2, 4, 6, 145, -7, -4}
rng_A=max(A)-min(A)
rng_B=max(B)-min(B)
print(f'range of A:{rng_A}\nrange of B:{rng_B}')
range of A:52
    range of B:152

In Example 8, the two groups A and B are equal except for one value. However, that value of B can be regarded as an outlier with a significant difference compared to the other values, and the value of that value makes the range of the two groups very different. In other words, the range is a simple yielding result, but is very sensitive to outliers.

Quantile

A quantile is one of many ways to measure variation. For example, the following data contains significantly large values that can be considered outliers. In such cases, the range cannot accurately represent the variability characteristics of the data set.

x=np.array([1,2, 4,5,6,8,8,9,105])
x
array([  1,   2,   4,   5,   6,   8,   8,   9, 105])
rng=x.max()-x.min() #range
rng
104

There are cases where it is advantageous to understand the variability of data as a quantile, which is expressed by dividing the data into several small groups instead of the overall range of the data. A quartile means that it is roughly divided into 4 parts. It is calculated by the following process.

  • step 1) sort in ascending order
  • step 2) Determine the median (Q2)
  • step 3) Determines the median value between small values (Q1)
  • and large values (Q3) based on the median value determined in step 2.

With the above process, Q1, Q2 and Q3 for {1, 2, 4, 5, 6, 8, 8, 9, 105} of the example data are 4, 6, and 8, respectively. This quantile can be determined using numpy's quantile() function.

np.quantile(x, [0.25, 0.5, 0.75])
array([4., 6., 8.])

Apply a boxplot (Figure 3) to visualize the above results. This graph uses the boxplot() function of the Python library matplotlib.

plt.boxplot((1,2, 4,5,6,8,8,9,105))
plt.title("BoxPlot based on quantile")
plt.text(0.5,-30,'Figure.3 Quantile and boxplot.', size="15", weight="bold")
plt.show()
![png](.\image\desStat03.png)

As shown in Figure 1.3, the top and bottom of the box plot body (square) represent Q1 and Q3, respectively, and the value corresponding to the middle 50% of the data (Q2) exists in this part. This part is called Interquartile Range and is denoted by IQR.

IQR=Q3-Q1

The red line within the box represents Q2. The lower and upper wing lines of the box represent the minimum and maximum values, respectively. Points outside this are considered outliers.

**Example 9)**
  Create a quartile, IQR, and boxplot of two groups of random numbers.

da=np.random.randint(1, 100, size=(2,30))
da
array([[ 6, 39, …, 48,  4, 77],
           [53, 79, 16, …, 78, 22, 74]])
q=np.quantile(da, [0.25, 0.5, 0.75], axis=1)
q
array([[13.5 , 35.25],
           [43.  , 50.5 ],
           [68.25, 77.  ]])

The axis is set to 1 in the np.quantile() function in the code above. This is row-based. So the result of that function is:

Group1(row1)Group2(row2)
Q133.25 29.5
Q251. 49.
Q384.75 64.

IQR will be the difference between rows 1 and 3 of the above result object. IQR can be calculated directly from data using the ```iqr()``` function in the stats module of the scipy package. A boxplot of these results is shown in Figure 4.

plt.boxplot(da.T)
plt.title("BoxPlot of two group")
plt.text(0.5, -30, "Figure 4. Two groups of boxplots.", size="15", weight="bold")
plt.show()
![png](.\image\desStat04.png)

Mean Absolute Deviation(MAD)

The range is the difference between the maximum and minimum values, which is very sensitive to outliers. Instead, the difference between each value around a criterion such as the mean can be used as the variance of the data. In other words, the difference between the mean and each value is called **deviation** and the mean of all deviations in the data can be used as a measure of variance.

The deviation is the difference between each observation and the mean. That is, the mean deviation is calculated as follows:

$$\begin{align} \text{mean deviation}&=\frac{\sum^n_{i=1}(x_i-\overline{x})}{n}\\ n&: \text{data size}\end{align}$$

>Example 10)>
  Calculate the mean deviation of the following data.

d=np.array([4, 5, 6, 8, 10, 11, 12])
d
array([ 4,  5,  6,  8, 10, 11, 12])
mu=np.mean(d)
mu
8.0
dev=np.array([i-mu for i in d])
dev
array([-4., -3., -2.,  0.,  2.,  3.,  4.])
devTotal=dev.sum()
devTotal
0.0
devMean=devTotal/len(dev)
devMean
0.0

Since the mean deviation is the sum of the differences between the values around the mean of the data, it is always 0 as shown in the above result. These results do not provide any clues as to the interpretation of variability for the data. Instead, **Mean Absolute Deviation**(MAD), calculated by converting each variation to a positive number, as in Equation 4, can be made a useful indicator.

$$\begin{align} \tag{4} \text{MAD}&=\frac{\sum^n_{i=1}\vert x_i-\overline{x}\vert}{n}\\ n&: \text{data size}\end{align}$$

MAD can be computed without intermediate calculations using DataFrame.mad() method in the pandas package.

ad=np.array([abs(i-mu) for i in d])
ad
array([4., 3., 2., 0., 2., 3., 4.])
adTotal=ad.sum()
adTotal
18.0
adMean=adTotal/len(ad)
adMean
2.5714285714285716
mad=pd.DataFrame(d).mad()
mad
0    2.571429
    dtype: float64

However, MAD has the disadvantage of being very insensitive to outliers.

Variance

In order to measure the spread of data, it is possible to apply a square instead of an absolute value with a concept similar to MAD. That is, it is calculated as Equation 5, and this result is called the mean squared deviation(msd) or variance. Usually the variance is denoted as σ2.

$$\begin{align} \tag{5} \sigma^2&=\frac{\sum^n_{i=1}(x_i-\overline{x})^2}{n}\\ &\sigma^2:\text{variance}\\ &n: \text{sample size} \end{align}$$

Figure 5 plots MAE and variance functions for data with mean μ.

mu=0
x=np.linspace(1,-1, 100)
plt.plot((x-mu)**2, label=r"MSD, $(x-\mu)^2$")
plt.plot(abs(x-mu), label=r"MSE, $|x-\mu|$")
plt.legend(loc="best")
plt.xticks([])
plt.yticks([])
plt.text(0, -0.3, "Figure 5. MSE vs MAD.", size=15, weight="bold")
plt.show()
![png](.\image\desStat05.png)

As shown in Figure 5, MAE is a straight line and variance is a quadratic curve. Deviation is an important parameter for statistical estimation based on probability. Statistical models for estimating new values based on existing data are built based on the point where the deviation is minimal. Therefore, the function that calculates the deviation should be written in the form of a function that can determine the minimum deviation, and the variance in the form of a quadratic curve that can mathematically calculate the instantaneous change is advantageous.
Example 11)
  The mean and standard deviation of the selling prices of two products for a company are:
  Compare product variations.

MeanVariance
Product A 11950 587.8
Product B 35879 985.8

Product B has a large variance, so it has wider variance.
Example 12)
  The following data are the scores of 6 randomly selected students from a class. Determine the variance of this data.

x=np.array([90, 65, 95 ,75, 70, 85])
x
array([90, 65, 95, 75, 70, 85])
mu=x.mean()
mu
80.0

  Calculates the deviation and squared deviation of each value from the mean.

ScoreDeviation
=$x-\bar{x}$
Squared deviation
=$(x-\bar{x})^2$
A 90.0 10.0 100.0
B 65.0 -15.0 225.0
C 95.0 15.0 225.0
D 75.0 -5.0 25.0
E 70.0 -10.0 100.0
F 85.0 5.0 25.0

From the squared deviations in the table above, the variance can be calculated as follows:

$$\begin{align} \sigma^2&=\frac{100.0+225.0+225.0+25.0+100.0+25.0}{6}\\ &=116.7 \end{align}$$

The variance can be calculated by applying the ``np.var()`` or object.var() methods.

np.around(x.var(), 1)
116.7

The process of calculating variance as in Example 11 can be summarized as follows.

  1. determine the average
  2. Calculates the difference between each value and the mean (=deviation)
  3. Calculate the square of each deviation (=squared deviation)
    The sum of square deviations is commonly denoted SS
  4. The variance is determined by dividing the sum of squared deviations by the size of the data.
      $\displaystyle \frac{\text{SS}}{\text{sample size}}$

Example 13)
  Determine the variance of the data with SS=486, n=120 (number of samples).

$$\sigma^2=\frac{488}{120}=4.05$$

Standard Deviation

Variance has the disadvantage of not being able to use the raw data unit as the square of the deviation. For example, the variance of data over time (sec) has units of sec2. This makes it difficult to directly interpret the data using variance. To restore these units to their original values, apply the square root of the variance as in Equation 6. This result is called **standard deviation** and is denoted by σ.

$$\begin{align}\tag{6} \sigma= \sqrt{\frac{\sum^n_{i=1}(x_i-\bar{x})^2}{n}} \end{align}$$

The standard deviation can be calculated with the np.std() function or the Pandas object.std() method.
Example 14)
  Determine the variance and standard deviation of the following objects.

x=np.array([615, 949, 1173, 172, 940])
mu=x.mean()
SS=np.sum((x-mu)**2)
vari=SS/len(x)
std1=np.sqrt(vari)
print(f'mean:{mu}, Squared Sum:{SS}\nVariance:{round(vari,2)}, Standard Deviation:{round(std1,2)}')
mean:769.8, Squared Sum:604978.8
    Variance:120995.76, Standard Deviation:347.84
var2, std2=x.var(), x.std()
print(f'variance: {np.around(var2, 2)}\nStamdard Deviation:{ np.around(std2, 2)}')
variance: 120995.76
    Stamdard Deviation:347.84

Example 15)
The data has 20 values and the sum of squares is 95220. Standard Deviation?

np.sqrt(95220/20)
69.0

Example 16)
The standard deviation of the data with 24 values is 33. Sum of squares of this data?

$$24 \cdot 33^2=26136$$

Example 16)
A scored 78 points on the exam of which the average is 72 points. Determine the standard deviation in favor of A if the standard deviation of the test is

(1) s=2   (2) s=3   (3) s=4

  A small variance in the data means that the values increase in concentration at a specific point. Also, since the score of A is larger than the mean, the smaller the standard deviation, the higher the probability of being at the top. Therefore, case (1) would be most advantageous.
  In statistical analysis, the entire set of interest for obtaining information is called a population, and a part of the population is called a sample. In general, it is often difficult to obtain a population. Therefore, the purpose of statistical analysis is to estimate the characteristics of a population from a sample that is part of the population. Since the sample size is adjustable and finite, statistics such as sum and mean can be determined. Because of this, not all values are stochastic. For example, the sum of data consisting of [1, 3, 5, 1] is 10. Since the total of this data is in a fixed state, if you know 3 values of the data, the rest is automatically assigned, so out of the 4 values of this data, only 3 values can be used as random variables. This is called the degree of freedom, and in our example, the degree of freedom is 3. Thus, for a finite sample of n values, the degrees of freedom are n-1.

Degree of freedom

The number of independent pieces of information used in the process of calculating a statistical estimate is called the degrees of freedom.

In the case of a population, degrees of freedom cannot be considered because statistics such as overall size or mean are not generally known as the object of estimation. However, since the sample has a finite size, the degrees of freedom must be considered. If the sample size is very large, the difference between n and n-1 is negligible. However, a small sample can make a big difference in the results. Therefore, the degrees of freedom must be taken into account in the mean or variance and standard deviation. The calculation of the statistic in the population and sample, whether or not degrees of freedom is considered, is as follows:

Population Sample
Mean $\displaystyle \mu=\frac{\sum x}{N}$ Mean $\displaystyle \overline{x}=\frac{\sum x}{n-1}$
Variance $\displaystyle \sigma^2=\frac{\sum (x-\mu)^2}{N}$ Variance $\displaystyle s^2=\frac{\sum (x- \overline{x})^2}{n-1}$
Standard Deviation $\displaystyle \sigma=\sqrt{\sigma^2}$ Standard Deviation $\displaystyle s=\sqrt{s^2}$

Variation Coefficient

It is used to compare two materials. For example, say that the mean and standard deviation of a group's height and weight are:

ItemMeanStandard Deviation
Height165 cm19 cm
Weight70 kg12.9 kg

In the above two items, it can be said that numerically the height shows a greater fluctuation than the weight. However, since the units of the two items are different, the difference can be clearly indicated by comparing the results with the units removed rather than just comparing the size of the numbers. As the result of the following expression shows, the value obtained by dividing the standard deviation by the mean in each item can be used as a comparison index with other items because the unit is removed.

$$\begin{align} &\text{heiht}: \frac{19\, \text{cm}}{165 \,\text{cm}}=0.185\\ &\text{weight}: \frac{12.9\, \text{kg}}{70\, \text{kg}}=0.115 \end{align} $$

From the above result, it can be seen that the change in height is larger than the change in weight. This measure is called **coefficient of variance**(Cv) and is calculated as Equation 7.

$$ \begin{equation}\tag{7} \text{Cv}=\frac{\sigma}{\mu}\quad \text{or}\quad \frac{s}{\overline{x}} \end{equation}$$

Example 17)
  Calculate:
  1. The standard deviation of the population of Cv=0.872, μ=7.2?

 0.872*7.2
6.2784

  2. The standard deviation of the population of Cv=0.52, σ=18?

round(18/0.52,3)
34.615

댓글

이 블로그의 인기 게시물

[Linear Algebra] 유사변환(Similarity transformation)

유사변환(Similarity transformation) n×n 차원의 정방 행렬 A, B 그리고 가역 행렬 P 사이에 식 1의 관계가 성립하면 행렬 A와 B는 유사행렬(similarity matrix)이 되며 행렬 A를 가역행렬 P와 B로 분해하는 것을 유사 변환(similarity transformation) 이라고 합니다. $$\tag{1} A = PBP^{-1} \Leftrightarrow P^{-1}AP = B $$ 식 2는 식 1의 양변에 B의 고유값을 고려한 것입니다. \begin{align}\tag{식 2} B - \lambda I &= P^{-1}AP – \lambda P^{-1}P\\ &= P^{-1}(AP – \lambda P)\\ &= P^{-1}(A - \lambda I)P \end{align} 식 2의 행렬식은 식 3과 같이 정리됩니다. \begin{align} &\begin{aligned}\textsf{det}(B - \lambda I ) & = \textsf{det}(P^{-1}(AP – \lambda P))\\ &= \textsf{det}(P^{-1}) \textsf{det}((A – \lambda I)) \textsf{det}(P)\\ &= \textsf{det}(P^{-1}) \textsf{det}(P) \textsf{det}((A – \lambda I))\\ &= \textsf{det}(A – \lambda I)\end{aligned}\\ &\begin{aligned}\because \; \textsf{det}(P^{-1}) \textsf{det}(P) &= \textsf{det}(P^{-1}P)\\ &= \textsf{det}(I)\end{aligned}\end{align} 유사행렬의 특성 유사행렬인 두 정방행렬 A와 B는 'A ~ B' 와 같

[matplotlib] 히스토그램(Histogram)

히스토그램(Histogram) 히스토그램은 확률분포의 그래픽적인 표현이며 막대그래프의 종류입니다. 이 그래프가 확률분포와 관계가 있으므로 통계적 요소를 나타내기 위해 많이 사용됩니다. plt.hist(X, bins=10)함수를 사용합니다. x=np.random.randn(1000) plt.hist(x, 10) plt.show() 위 그래프의 y축은 각 구간에 해당하는 갯수이다. 빈도수 대신 확률밀도를 나타내기 위해서는 위 함수의 매개변수 normed=True로 조정하여 나타낼 수 있다. 또한 매개변수 bins의 인수를 숫자로 전달할 수 있지만 리스트 객체로 지정할 수 있다. 막대그래프의 경우와 마찬가지로 각 막대의 폭은 매개변수 width에 의해 조정된다. y=np.linspace(min(x)-1, max(x)+1, 10) y array([-4.48810153, -3.54351935, -2.59893717, -1.65435499, -0.70977282, 0.23480936, 1.17939154, 2.12397372, 3.0685559 , 4.01313807]) plt.hist(x, y, normed=True) plt.show()

R 미분과 적분

내용 expression 미분 2차 미분 mosaic를 사용한 미분 적분 미분과 적분 R에서의 미분과 적분 함수는 expression()함수에 의해 생성된 표현식을 대상으로 합니다. expression expression(문자, 또는 식) 이 표현식의 평가는 eval() 함수에 의해 실행됩니다. > ex1<-expression(1+0:9) > ex1 expression(1 + 0:9) > eval(ex1) [1] 1 2 3 4 5 6 7 8 9 10 > ex2<-expression(u, 2, u+0:9) > ex2 expression(u, 2, u + 0:9) > ex2[1] expression(u) > ex2[2] expression(2) > ex2[3] expression(u + 0:9) > u<-0.9 > eval(ex2[3]) [1] 0.9 1.9 2.9 3.9 4.9 5.9 6.9 7.9 8.9 9.9 미분 D(표현식, 미분 변수) 함수로 미분을 실행합니다. 이 함수의 표현식은 expression() 함수로 생성된 객체이며 미분 변수는 다음 식의 분모의 변수를 의미합니다. $$\frac{d}{d \text{변수}}\text{표현식}$$ 이 함수는 어떤 함수의 미분의 결과를 표현식으로 반환합니다. > D(expression(2*x^3), "x") 2 * (3 * x^2) > eq<-expression(log(x)) > eq expression(log(x)) > D(eq, "x") 1/x > eq2<-expression(a/(1+b*exp(-d*x))); eq2 expression(a/(1 + b * exp(-d * x))) > D(eq2, "x") a * (b * (exp(-d * x) * d))/(1 + b