Multiclass Classification

내용

Reuters dataset
데이터 준비
신경망 구축
학습과 검정 그리고 추정

이 글은 'Deep Learning with Python'의 3.5절의 내용입니다.

다중 그룹 분류

로이터 뉴스와이어를 46개의 상호 배타적인 주제로 분류하는 네트워크를 구축합니다. 클래스가 많기 때문에 이 문제는 다중 클래스 분류의 인스턴스입니다. 각 데이터 포인트는 하나의 범주로만 분류되어야 하기 때문에 문제는 보다 구체적으로 단일 레이블, 다중 클래스 분류의 인스턴스입니다. 각 데이터 포인트가 여러 범주(이 경우 주제)에 속할 수 있는 경우 다중 레이블, 다중 클래스 분류 문제에 직면하게 됩니다.

1986년에 Reuters에서 발행한 짧은 뉴스와 해당 주제의 집합인 Reuters 데이터 세트를 사용하여 작업할 것입니다. 이것은 텍스트 분류를 위해 널리 사용되는 간단하고 널리 사용되는 장난감 데이터 세트입니다. 46개의 다른 주제가 있습니다. 일부 주제는 다른 주제보다 더 많이 표시되지만 각 주제에는 교육 세트에 최소 10개의 예가 있습니다. IMDB 및 MNIST와 마찬가지로 Reuters 데이터 세트는 Keras의 일부로 패키지로 제공됩니다..

Reuters dataset

import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf
from tensorflow.keras.datasets import reuters
from tensorflow.keras import models, layers

(datr, latr),(date, late)=reuters.load_data(num_words=10000)

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/reuters.npz
2113536/2110848 [==============================] - 0s 0us/step
2121728/2110848 [==============================] - 0s 0us/step

num_words=10000 인수는 훈련 데이터에서 가장 자주 발생하는 상위 10,000개의 단어만 유지한다는 것을 의미합니다.

학습그룹과 검정 그룹의 데이터 수는 다음과 같습니다.

len(datr), len(date)

(8982, 2246)

print(datr[0])

[1, 2, 2, … 15, 17, 12]

위에서 보는 것과 같이 각 데이터는 정수로 코드화 되어 있습니다. 이들의 의미를 살펴보기 위해 다음과 같이 decode화 해 봅니다.

wordIndex=reuters.get_word_index()
reverseWordIndex=dict([(value, key) for (key, value) in wordIndex.items()])
decodeNews=' '.join([reverseWordIndex.get(i-1, '?') for i in datr[0]])
decodeNews

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/reuters_word_index.json
557056/550378 [==============================] - 0s 0us/step
565248/550378 [==============================] - 0s 0us/step
'? the the a not 3 move in by should 22 in rebates dollar 000 reuters four after about and may in on february said on some reuter after about revs that secretary at and which to but a right would sale 31 said end said been for reuter that earlier for reuter and which mln representation improved noted said domestic said high for reuter that under loss for reuter 000 a sources versus after about last with sale 2 was 12 said co reuter 1 vs'

latr[0]

데이터 준비

feature를 원-핫 인코딩 합니다.

def oneHotVector(da, dims=10000):
  result=np.zeros((len(da), dims))
  for i, sequence in enumerate(da):
    result[i, sequence]=1
  return(result)

xtr=oneHotVector(datr)
xte=oneHotVector(date)

xtr.shape

(8982, 10000)

laDim=len(np.unique(latr))
laDim

ytr=oneHotVector(latr, dims=laDim)
yte=oneHotVector(late, dims=laDim)

ytr[0]

array([0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])

신경망 구축

Dense 레이어 스택에서 각 레이어는 이전 레이어의 출력에 있는 정보에만 액세스할 수 있습니다. 한 계층이 분류 문제와 관련된 일부 정보를 삭제하면 이 정보는 이후 계층에서 절대 복구할 수 없습니다.

이 자료는 feature를 46개의 클래스 중의 하나로 분류하는 것으로서 한 계층에서의 출력은 46개가 됩니다. 그러므로 설정하는 layer unit은 이 보다 큰 규모이어야 합니다. 여기서는 64 단위로 설정하였습니다.

model=models.Sequential()
model.add(layers.Dense(64, activation='relu', input_shape=(10000,)))
model.add(layers.Dense(64, activation="relu"))
model.add(layers.Dense(46, activation='softmax'))

모든 입력 샘플에 대해 네트워크는 46차원 출력 벡터를 생성합니다. 여기서 output[i]는 샘플이 클래스 i에 속할 확률입니다. 46개 점수의 합은 1입니다.

이 경우에 사용하기에 가장 좋은 손실 함수는 categorical_crossentropy입니다. 이 함수는 두 확률 분포 사이의 거리를 측정합니다. 여기에서는 네트워크의 확률 분포 출력과 레이블의 실제 분포 사이입니다. 이 두 분포 사이의 거리를 최소화하여 실제 레이블에 가능한 한 가까운 것을 출력하도록 네트워크를 훈련시킵니다.

model.compile(optimizer='rmsprop',loss='categorical_crossentropy',metrics=['accuracy'])

학습과 검정 그리고 추정

학습 데이터 중의 1000개를 검정에 사용합니다.

xtr[:1000].shape

(1000, 10000)

xtr_val=xtr[:1000]
xtr_part=xtr[1000:]
ytr_val=ytr[:1000]
ytr_part=ytr[1000:]

list(map(lambda x: x.shape, [xtr_val, ytr_val, xtr_part, ytr_part]))

[(1000, 10000), (1000, 46), (7982, 10000), (7982, 46)]

hist=model.fit(xtr_part, ytr_part, epochs=20, batch_size=512, validation_data=(xtr_val, ytr_val))

Epoch 1/20
16/16 [==============================] - 1s 56ms/step - loss: 2.5102 - accuracy: 0.5036 - val_loss: 1.6760 - val_accuracy: 0.6640
…
Epoch 19/20
16/16 [==============================] - 1s 43ms/step - loss: 0.1158 - accuracy: 0.9560 - val_loss: 1.1234 - val_accuracy: 0.7950
Epoch 20/20
16/16 [==============================] - 1s 44ms/step - loss: 0.1155 - accuracy: 0.9567 - val_loss: 1.0559 - val_accuracy: 0.8090

hist.history.keys()

dict_keys(['loss', 'accuracy', 'val_loss', 'val_accuracy'])

plt.figure(dpi=100)
epochs=range(1, 21)
plt.plot(epochs, hist.history['loss'], label='loss')
plt.plot(epochs, hist.history['val_loss'], label='val_loss', alpha=0.3)
plt.legend(loc='best', prop={'weight':'bold'})
plt.xlabel("Epoch", weight='bold') 
plt.ylabel("loss", weight='bold') 
plt.show()

plt.figure(dpi=100)
epochs=range(1, 21)
plt.plot(epochs, hist.history['accuracy'], label='accuracy')
plt.plot(epochs, hist.history['val_accuracy'], alpha=0.3, label='val_accuracy')
plt.legend(loc='best', prop={'weight':'bold'})
plt.xlabel("Epoch", weight='bold') 
plt.ylabel("accuracy", weight='bold') 
plt.show()

위 결과에 의하면 약 9 epoch 이후에 과적합이 시작됩니다. 그러므로 새롭게 학습한 모델을 구축하여 검정 데이터 셋에 적용합니다.

model=models.Sequential()
model.add(layers.Dense(64, activation='relu', input_shape=(10000,)))
model.add(layers.Dense(64, activation="relu"))
model.add(layers.Dense(46, activation='softmax'))
model.compile(optimizer='rmsprop',loss='categorical_crossentropy',metrics=['accuracy'])
model.fit(xtr_part, ytr_part, epochs=9, batch_size=64, validation_data=(xtr_val, ytr_val))
result=model.evaluate(xte, yte)

Epoch 1/9
125/125 [==============================] - 2s 14ms/step - loss: 1.6471 - accuracy: 0.6490 - val_loss: 1.1359 - val_accuracy: 0.7440
…
Epoch 9/9
125/125 [==============================] - 2s 13ms/step - loss: 0.1552 - accuracy: 0.9565 - val_loss: 1.2883 - val_accuracy: 0.7870
71/71 [==============================] - 0s 3ms/step - loss: 1.4367 - accuracy: 0.7725

result

[1.4366565942764282, 0.7724844217300415]

위 reult는 검정 데이터 셋에 대한 정확도 입니다. 학습 데이터 셋에 대한 추정치와 실제 라벨값과의 정확도를 계산합니다.

latr

array([ 3,  4,  3, ..., 25,  3, 25])

 retr=model.predict(xtr)

retr

array([[8.9551882e-05, 1.8371086e-03, 5.7139728e-06, ..., 3.4539548e-07,
        1.4424126e-08, 1.9418226e-06],
        …
        [1.2145936e-05, 1.1056991e-06, 1.9900976e-09, ..., 5.8124328e-09,
        1.1078960e-11, 2.6042404e-05]], dtype=float32)

retrMax=np.argmax(retr, axis=1)
retrMax

array([ 3,  4,  3, ..., 25,  3, 25])

np.sum(np.equal(latr, retrMax))/len(latr)

0.94333110665776

#검정 데이터 셋에 대한 정확도 
rete=model.predict(xte)
reteMax=np.argmax(rete, axis=1)
np.sum(np.equal(reteMax, late)/len(late))

0.7724844167408726

sympy.solvers로 방정식해 구하기

sympy.solvers로 방정식해 구하기 대수 방정식을 해를 계산하기 위해 다음 함수를 사용합니다. sympy.solvers.solve(f, *symbols, **flags) f=0, 즉 동차방정식에 대해 지정한 변수의 해를 계산 f : 식 또는 함수 symbols: 식의 해를 계산하기 위한 변수, 변수가 하나인 경우는 생략가능(자동으로 인식) flags: 계산 또는 결과의 방식을 지정하기 위한 인수들 dict=True: {x:3, y:1}같이 사전형식, 기본값 = False set=True :{(x,3),(y,1)}같이 집합형식, 기본값 = False ratioal=True : 실수를 유리수로 반환, 기본값 = False positive=True: 해들 중에 양수만을 반환, 기본값 = False 예 $x^2=1$의 해를 결정합니다. solve() 함수에 적용하기 위해서는 다음과 같이 식의 한쪽이 0이 되는 형태인 동차식으로 구성되어야 합니다. $$x^2-1=0$$ import numpy as np from sympy import * x = symbols('x') solve(x**2-1, x) [-1, 1] 위 식은 계산 과정은 다음과 같습니다. $$\begin{aligned}x^2-1=0 \rightarrow (x+1)(x-1)=0 \\ x=1 \; \text{or}\; -1\end{aligned}$$ 예 $x^4=1$의 해를 결정합니다. solve() 함수의 인수 set=True를 지정하였으므로 결과는 집합(set)형으로 반환됩니다. eq=x**4-1 solve(eq, set=True) ([x], {(-1,), (-I,), (1,), (I,)}) 위의 경우 I는 복소수입니다.즉 위 결과의 과정은 다음과 같습니다. $$x^4-1=(x^2+1)(x+1)(x-1)=0 \rightarrow x=\pm \sqrt{-1}, \; \pm 1=\pm i,\; \pm1$$ 실수...

sons dataStory

이 블로그 검색

[matplotlib]quiver()함수

Multiclass Classification

내용

다중 그룹 분류

Reuters dataset

데이터 준비

신경망 구축

학습과 검정 그리고 추정

태그

댓글

댓글 쓰기

이 블로그의 인기 게시물

[Linear Algebra] 유사변환(Similarity transformation)

[sympy] Sympy객체의 표현을 위한 함수들

sympy.solvers로 방정식해 구하기