자연어처리(NLP) 27일차 (RNN: many to one)

16 min readJul 31, 2019

2019.07.31

출처 : https://www.edwith.org/boostcourse-dl-tensorflow/lecture/43752/

핵심키워드

RNN
Stacking
Dropout
Padding

RNN : Many-to-One

오늘은 어제 배운 RNN에서 Many-to-One 문제를 해결해본다. RNN의 Many-to-One은 sequence of words를 sentiment로 분류하는 Sentiment Classification에 가장 많이 활용되고 있다.

그 중, 여러 hidden layer를 Stacking하는 RNN은 층이 1개인 Simple RNN보다 많은 tasks에서 좋은 성능을 보이고 있다.

층이 여러 개인, RNN은 아래와 같은 성질이 있다.

https://www.edwith.org/boostcourse-dl-tensorflow/lecture/43749/

RNN의 input layer와 가장 가까운 층에서는 semantic information보다 syntactic information을 더 잘 인코딩하고, out layer와 가장 가까운 층에서는 syntactic information보다 semantic information을 더 잘 인코딩하는 성질이 있다. 그 이유에 대해서는 설명하지 않는데, RNN 사용자들의 경험을 토대로 한 성질이다.

task에 따라 loss를 계산하는 방식이 다르다.

Many-to-One Stacked RNN using TensorFlow

필요한 패키지를 import 하고 tensorflow의 eager mode를 사용한다.

import numpy as np
import tensorflow as tf
import matplotlib.pyplot as plt
from tensorflow import keras
from tensorflow.keras import layers
from tensorflow.keras import Sequential, Model
from tensorflow.keras.preprocessing.sequence import pad_sequences
from pprint import pprint
%matplotlib inlineprint(tf.__version__)
1.12.0tf.enable_eager_execution()

2. Dataset을 준비한다. 예제에서는 Richard Feynman의 text와 Albert Einstein의 text를 classification하는 task를 진행한다.

# example data
sentences = ['What I cannot create, I do not understand.',
             'Intellecuals solve problems, geniuses prevent them.',
             'A person who never made a mistake never tied anything new.',
             'The same equations have the same solutions.']y_data = [1, 0, 0, 1] # 1 : Richard feynman, 0 : Albert einstein# creating a token dictionary
char_set = ['<pad>'] + sorted(list(set(''.join(sentences))))
idx2char = {idx : char for idx, char in enumerate(char_set)}
char2idx = {char : idx for idx, char in enumerate(char_set)}print(char_set)
print(idx2char)
print(char2idx)['<pad>', ' ', ',', '.', 'A', 'I', 'T', 'W', 'a', 'b', 'c', 'd', 'e', 'g', 'h', 'i', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'y']{0: '<pad>', 1: ' ', 2: ',', 3: '.', 4: 'A', 5: 'I', 6: 'T', 7: 'W', 8: 'a', 9: 'b', 10: 'c', 11: 'd', 12: 'e', 13: 'g', 14: 'h', 15: 'i', 16: 'k', 17: 'l', 18: 'm', 19: 'n', 20: 'o', 21: 'p', 22: 'q', 23: 'r', 24: 's', 25: 't', 26: 'u', 27: 'v', 28: 'w', 29: 'y'}{'<pad>': 0, ' ': 1, ',': 2, '.': 3, 'A': 4, 'I': 5, 'T': 6, 'W': 7, 'a': 8, 'b': 9, 'c': 10, 'd': 11, 'e': 12, 'g': 13, 'h': 14, 'i': 15, 'k': 16, 'l': 17, 'm': 18, 'n': 19, 'o': 20, 'p': 21, 'q': 22, 'r': 23, 's': 24, 't': 25, 'u': 26, 'v': 27, 'w': 28, 'y': 29}

위에서 봤듯이, 문장을 character 단위로 tokenize하고, 알파벳 순서로 정렬한 다음, 인덱스를 부여한다.

3. sentence의 character별로 index를 이용해 integer로 converting시킨다.

# converting sequence of tokens to sequence of indices
x_data = list(map(lambda sentence : [char2idx.get(char) for char in sentence], sentences))x_data_len = list(map(lambda sentence : len(sentence), sentences))print(x_data)
print('')
print(x_data_len)
print('')
print(y_data)[[7, 14, 8, 25, 1, 5, 1, 10, 8, 19, 19, 20, 25, 1, 10, 23, 12, 8, 25, 12, 2, 1, 5, 1, 11, 20, 1, 19, 20, 25, 1, 26, 19, 11, 12, 23, 24, 25, 8, 19, 11, 3], [5, 19, 25, 12, 17, 17, 12, 10, 26, 8, 17, 24, 1, 24, 20, 17, 27, 12, 1, 21, 23, 20, 9, 17, 12, 18, 24, 2, 1, 13, 12, 19, 15, 26, 24, 12, 24, 1, 21, 23, 12, 27, 12, 19, 25, 1, 25, 14, 12, 18, 3], [4, 1, 21, 12, 23, 24, 20, 19, 1, 28, 14, 20, 1, 19, 12, 27, 12, 23, 1, 18, 8, 11, 12, 1, 8, 1, 18, 15, 24, 25, 8, 16, 12, 1, 19, 12, 27, 12, 23, 1, 25, 15, 12, 11, 1, 8, 19, 29, 25, 14, 15, 19, 13, 1, 19, 12, 28, 3], [6, 14, 12, 1, 24, 8, 18, 12, 1, 12, 22, 26, 8, 25, 15, 20, 19, 24, 1, 14, 8, 27, 12, 1, 25, 14, 12, 1, 24, 8, 18, 12, 1, 24, 20, 17, 26, 25, 15, 20, 19, 24, 3]]

[42, 51, 58, 43]

[1, 0, 0, 1]

4. 전체 문장의 벡터 길이를 통일하기 위해 padding 작업을 한다. max padding length는 55로 하고, padding이나 truncating은 모두 post방식으로 처리한다. 문장이 짧으면 뒤에 0을 추가하고, 문장이 길면 뒤를 자르는 형식이다.

# padding the sequence of indices
max_sequence = 55
x_data = pad_sequences(sequences=x_data, maxlen=max_sequence,
                       padding='post', truncating='post')print(x_data)
print('')
print(x_data_len)
print('')
print(y_data)
[[ 7 14  8 25  1  5  1 10  8 19 19 20 25  1 10 23 12  8 25 12  2  1  5  1
  11 20  1 19 20 25  1 26 19 11 12 23 24 25  8 19 11  3  0  0  0  0  0  0
   0  0  0  0  0  0  0]
 [ 5 19 25 12 17 17 12 10 26  8 17 24  1 24 20 17 27 12  1 21 23 20  9 17
  12 18 24  2  1 13 12 19 15 26 24 12 24  1 21 23 12 27 12 19 25  1 25 14
  12 18  3  0  0  0  0]
 [ 4  1 21 12 23 24 20 19  1 28 14 20  1 19 12 27 12 23  1 18  8 11 12  1
   8  1 18 15 24 25  8 16 12  1 19 12 27 12 23  1 25 15 12 11  1  8 19 29
  25 14 15 19 13  1 19]
 [ 6 14 12  1 24  8 18 12  1 12 22 26  8 25 15 20 19 24  1 14  8 27 12  1
  25 14 12  1 24  8 18 12  1 24 20 17 26 25 15 20 19 24  3  0  0  0  0  0
   0  0  0  0  0  0  0]]

[42, 51, 58, 43]

[1, 0, 0, 1]

5. 모델을 생성한다. 모델은 Sequential API를 사용한다. 그 전에 분류할 class 수, hidden, input, output dimension을 모두 선언하고 embedding layer → simple rnn layer → time distributed(drop out) → simple rnn layer → dropout → dense layer 순으로 쌓는다.

# creating stacked rnn for "many to one" classification with dropout
num_classes = 2
hidden_dims = [10, 10]input_dim = len(char2idx)
output_dim = len(char2idx)
one_hot = np.eye(len(char2idx))model = Sequential()
model.add(layers.Embedding(input_dim=input_dim, output_dim=output_dim,
                          trainable=False, mask_zero=True, input_length=max_sequence,
                          embeddings_initializer=keras.initializers.Constant(one_hot)))
model.add(layers.SimpleRNN(units=hidden_dims[0], return_sequences=True))
model.add(layers.TimeDistributed(layers.Dropout(rate = .2)))
model.add(layers.SimpleRNN(units=hidden_dims[1]))
model.add(layers.Dropout(rate = .2))
model.add(layers.Dense(units=num_classes))model.summary()_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
embedding (Embedding)        (None, 55, 30)            900       
_________________________________________________________________
simple_rnn (SimpleRNN)       (None, 55, 10)            410       
_________________________________________________________________
time_distributed (TimeDistri (None, 55, 10)            0         
_________________________________________________________________
simple_rnn_1 (SimpleRNN)     (None, 10)                210       
_________________________________________________________________
dropout_1 (Dropout)          (None, 10)                0         
_________________________________________________________________
dense (Dense)                (None, 2)                 22        
=================================================================
Total params: 1,542
Trainable params: 642
Non-trainable params: 900
_________________________________________________________________

위에서 Embedding layer의 trainable = False로 한 이유는 one-hot-vector의 training을 건너뛰기 위함이고, mask_zero = True를 통해 zero padding은 training하지 않는다.

첫번째 Simple RNN layer에서 return_sequence = True를 함으로써 [data dimension, max sequence, input dimension] 형태로 데이터를 return 시켜 다음 RNN의 input으로 주기 위함이다.

중간에 TimeDistributed에서 Dropout을 한 이유는, Neural Network이 깊어질 수록(stacked) over-fitting이 될 가능성이 커지므로 0.2를 dropout시켰다.

6. 위에서 생성한 model을 training 시킨다. 먼저 loss function을 정의했다. loss 계산은 train 시에만 하므로, training 플래그를 만들어 test 시에는 loss를 계산하지 않도록 만들 수 있다. 하이퍼파라미터를 선언해준다. Optimizer는 Adam을 사용했다. train data set의 경우 shuffle을 시킨다.

# creating loss function
def loss_fn(model, x, y, training):
    return tf.losses.sparse_softmax_cross_entropy(labels=y, logits=model(x, training))# creating and optimizer
lr = .01
epochs = 30
batch_size = 2
opt = tf.train.AdamOptimizer(learning_rate= lr)# generating data pipline
tr_dataset = tf.data.Dataset.from_tensor_slices((x_data, y_data))
tr_dataset = tr_dataset.shuffle(buffer_size=4)
tr_dataset = tr_dataset.batch(batch_size=batch_size)print(tr_dataset)
<BatchDataset shapes: ((?, 55), (?,)), types: (tf.int32, tf.int32)># training
tr_loss_hist = []for epoch in range(epochs):
    avg_tr_loss = 0
    tr_step = 0
    
    for x_mb, y_mb in tr_dataset:
        with tf.GradientTape() as tape:
            tr_loss = loss_fn(model, x=x_mb, y=y_mb, training=True)
        grads = tape.gradient(target=tr_loss, sources=model.variables)
        opt.apply_gradients(grads_and_vars=zip(grads, model.variables))
        avg_tr_loss += tr_loss
        tr_step += 1
    else:
        avg_tr_loss /= tr_step
        tr_loss_hist.append(avg_tr_loss)
    
    if (epoch + 1) % 5 == 0:
        print('epoch : {:3}, tr_loss : {:.3f}'.format(epoch + 1, avg_tr_loss))
epoch :   5, tr_loss : 0.136
epoch :  10, tr_loss : 0.031
epoch :  15, tr_loss : 0.012
epoch :  20, tr_loss : 0.017
epoch :  25, tr_loss : 0.003
epoch :  30, tr_loss : 0.003

epoch을 5번마다 loss와 함께 출력되도록 했다.

7. Train accuracy와 loss 변화를 확인한다.

yhat = model.predict(x_data)
yhat = np.argmax(yhat, axis=-1)print('accuracy : {:2%}'.format(np.mean(yhat == y_data)))
accuracy : 100.000000%plt.plot(tr_loss_hist)

자연어처리(NLP) 27일차 (RNN: many to one)

핵심키워드

RNN : Many-to-One

Many-to-One Stacked RNN using TensorFlow

Written by 정민수

No responses yet