Son's Data story

글

데이터 인코딩:labeling and one-hot encoding

목차 이진화(Binarization) 데이터 라벨링(Data Labeling) 클래스 표시 행렬 다중클래스 표시행렬 라벨인코딩(Label ecoding) 순서형 인코딩(Ordinal ecoding) 원-핫인코딩(One-Hot Encoding) 이진화(Binarization) 지정한 값을 기준으로 0과 1과 전환합니다. 이러한 전환은 데이터를 확률화로 전환할 경우 유용합니다. sklearn.preprocessing.Binarizer(*, threshold=0.0, copy=True) 클래스를 사용할 수 있습니다. 이 클래스에서 매개변수 threshold에 지정한 값을 기준으로 이하인 경우 0, 초과된 경우 1로 반환합니다. 다음의 경우 50을 임계값으로 지정하여 그 이하를 0 그 이상을 1로 변환한 것입니다. import numpy as np import pandas as pd import sklearn.preprocessing as sklpre np.random.seed(0) x=np.random.randint(0, 100, size=(5,3)) x array([[44, 47, 64], [67, 67, 9], [83, 21, 36], [87, 70, 88], [88, 12, 58]]) xBinary=sklpre.Binarizer(threshold=50).fit(x) xBinary.transform(x) array([[0, 0, 1], [1, 1, 0], [1, 0, 0], [1, 1, 1], [1, 0, 1]]) 데이터 라벨링(Data Labeling) 클래스 표시 행렬 라벨표시기행렬(label indicator matrix)를 작성합니다. label indicator matrix 클래스의 갯수가 정방행렬의 차원이 됨 클래스의 각 요소들은 올림차순으로 정렬 각 클래스는 그 행렬의 행과 열의 인덱스가

자세한 내용 보기

Analysis of variance

Contents ANOVA Two-Way ANOVA Analysis of variance Analysis of variance(ANOVA) is a statistical method that tests the null hypothesis that all groups have the same mean by comparing the variation within and between groups in two or more groups. The tests for two groups have applied a normal or t distribution, but to compare more groups, use an F distribution that compares the degree of variability between groups. The data for ANOVA consist of the nominal variables ( factors ) that are being compared and the values for each factor, i.e., response variable. Each factor can be classified into several sub-groups and the factors in this group are called treatment (factor levels. The analysis of a single response corresponding to the factor level is called one-way ANOVA A} (anova), or multiple responses are called manova Multivariate analysis is beyond the scope of this book, but one-way and two-way variance analysis can be the foundation for that analysis. The null hyp

자세한 내용 보기

Normality Test

Contents Q-Q plot shapiro-Wilk test Kolmogorov-Smirnov Test Normality Test The central limit theorem approaches normal distribution as the number of samples in the data increases. In particular, the distribution of sample means corresponds to the normal distribution. However, for raw data other than the mean, it is sometimes important to match the normal distribution. For example, in regression, the difference between the observed values and the predicted values by the regression model is called residuals, and is performed on the assumption that the residuals conform to the normal distribution. Whether it fits the assumption or not is the basis for determining the fit of the established model. The normality test uses the following methods: Quantile-Quantile plot: Determination by visual analysis Shaprio-Wilks test: primarily used for number of samples (n < 2000) Kolmogoroves-Smrinov test: used when n>2000 Q-Q plot The Q-Q (Quadrant) plot i

자세한 내용 보기

결측치와 무한값찾기

내용 결측값과 무한값 생성 결측값과 무한값 결정 무한값 결정 Nan 결정 결측치와 무한값찾기 이터 처리 과정에서 누락된 값이나 계산과정에서 무한값을 얻게되는 경우가 있는데 이러한 값들은 이어진 계산과정에서 에러를 발생하는 경우가 많습니다. 그러므로 분석 전 또는 분석 동안 이들의 처리 과정이 중요하며 그 일환으로 먼저 그들의 인식과정을 먼저 살펴봅니다. 결측값과 무한값 생성 numpy의 속성값인 np.nan 으로 결측치(nan(Not Any Number))를 인위적으로 생성할 수 있습니다.또한 무한값은 내장함수인 float() 를 사용하여 생성할 수 있습니다. import math import numpy as np import pandas as pd x=np.random.rand(20) x array([0.44245459, 0.8178457 , 0.19687037, 0.54456459, 0.2971784 , 0.9188978 , 0.37880048, 0.10845443, 0.84552398, 0.73500799, 0.8996776 , 0.38032666, 0.15925506, 0.70421241, 0.46348431, 0.76245393, 0.25619259, 0.92892586, 0.11489276, 0.51422256]) # 특정한 요소들을 nan 또는 infinity로 치환 x[3]=np.nan #nan x[9]=np.nan x[7]=float('inf') #inifinity x[16]=float('-inf') #-infintiy np.around(x, 4) array([0.4425, 0.8178, 0.1969, nan , 0.2972, 0.9189, 0.3788, inf , 0.8455, nan , 0.8997, 0.3803, 0.1593, 0.7042, 0.4635, 0.7625,

자세한 내용 보기

Comparison of two independent groups

Contents Equal Variances in Small Sampless> Different Variances in Small Samples Large sample Comparison of two independent groups The means of two independent probability variables, X and Y, each of which follows a normal distribution, can be compared by applying a hypothesis test: $$\begin{align}\bar{X}&=\frac{\sum^n_{i=1} X_i}{n_X} \sim N\left(\mu_X, \frac{\text{s}_X}{n_X}\right)\\ \bar{Y}&=\frac{\sum^n_{i=1} Y_i}{n_Y} \sim N\left(\mu_Y, \frac{\text{s}_Y}{n_Y}\right)\end{align}$$ Even if each sample group does not assume a normal distribution, approximately normal distribution is satisfied according to the central limit theorem . To compare the two groups, set the following null hypothesis for the difference between each mean: $$\text{H0} : \mu_X -\mu_Y =0$$ The hypothetical test statistics are calculated from a combined distribution of two groups. That is, the mean and standard deviation of the combined probability distributions of X and Y are calculated as

자세한 내용 보기

Hypothesis test

Contents Null and Alternative hypotheses> One-sided and two-sided tests Hypothesis Statistical inference consists of establishing tentative hypotheses about the parameters of a population based on statistics calculated from a sample, and testing steps to accept or reject the hypothesis. In the test stage, the statistic of the sample that is the basis for judgment is called test statistic . The brobability of a more extreme statistic based on that test statistic is called p-value . By comparing the p-value with the significance level, acceptance or rejection of the statistic is determined. p-value < significance level: reject the hypothesis assumed to be true p-value > significance level: Failed to reject the hypothesis assumed to be true Power and Sample size Power is the probability of rejecting a false hypothesis. For example, a power of 90% indicates that there is a 10% chance of accepting an incorrect hypothesis. This is a type 2 error shown in Table

자세한 내용 보기

Son's Data story

이 블로그 검색

글

통계관련 함수와 메서드 사전

데이터 인코딩:labeling and one-hot encoding

Analysis of variance

Normality Test

결측치와 무한값찾기

Comparison of two independent groups

Hypothesis test