자연어처리(NLP) 28일차 (RNN: many to many)

정민수
16 min readAug 1, 2019

--

19.08.01

출처 : https://www.edwith.org/boostcourse-dl-tensorflow/lecture/43752/

핵심키워드

  • RNN
  • Part of speech
  • Masking
  • Embedding

RNN : Many to many

어제 배운 RNN : Many to one에 이어서 이번에는 Many to many 방식을 알아본다.

Many-to-many는 개체명 인식(Named Entity Recognition)나, 형태소 분석( Morphological Analysis)과 같은 task에 사용된다.

http://cs231n.stanford.edu/slides/2018/cs231n_2018_lecture10.pdf

Many-to-many와 Many-to-one의 차이는 Many-to-one의 경우 각 time step에서 모두 output이 나오는 것이 아니라, 마지막 token이 입력되었을 때, output이 나온다.

그 반면, Many-to-many는 각 time step마다 모두 output이 나와서 각각의 loss를 계산하여 미니배치들의 loss 평균을 낸다. 이를 sequence loss라 한다. 이 sequence loss를 backpropagate하여 gradient 계산을 통해 학습하는 구조다.

http://cs231n.stanford.edu/slides/2018/cs231n_2018_lecture10.pdf

Many-to-many RNN using TensorFlow

그렇다면 TensorFlow를 통해 many-to-many RNN을 공부해보자.

이번 예제는 품사태깅으로 many-to-many의 대표적인 예제다.

  1. 라이브러리 import
import numpy as np
import tensorflow as tf
import matplotlib.pyplot as plt
from tensorflow import keras
from tensorflow.keras import layers
from tensorflow.keras import Sequential, Model
from tensorflow.keras.preprocessing.sequence import pad_sequences
from pprint import pprint
%matplotlib inline
print(tf.__version__)
1.12.0
tf.enable_eager_execution()

2. Preparing dataset

sentences = [['I', 'feel', 'hungry'],
['tensorflow', 'is', 'very', 'difficult'],
['tensorflow', 'is', 'a', 'framework', 'for', 'deep', 'learning'],
['tensorflow', 'is', 'very', 'fast', 'changing']]
pos = [['pronoun', 'verb', 'adjective'],
['noun', 'verb', 'adverb', 'adjective'],
['noun', 'verb', 'determiner', 'noun' ,'preposition', 'adjective', 'noun'],
['noun', 'verb', 'adverb', 'adjective', 'verb']]

3. Preprocessing dataset : 각 word to index matrix를 만들고, 마찬가지로 pos(part-of-speech) to index matrix를 y-data로 만들어준다.

# creating a token dictionary for word
word_list = sum(sentences, [])
word_list = sorted(set(word_list))
word_list = ['<pad>'] + word_list
word2idx = {word : idx for idx, word in enumerate(word_list)}
idx2word = {idx : word for idx, word in enumerate(word_list)}
print(word2idx, end='\n\n')
print(idx2word, end='\n\n')
print(len(idx2word))
{'<pad>': 0, 'I': 1, 'a': 2, 'changing': 3, 'deep': 4, 'difficult': 5, 'fast': 6, 'feel': 7, 'for': 8, 'framework': 9, 'hungry': 10, 'is': 11, 'learning': 12, 'tensorflow': 13, 'very': 14}

{0: '<pad>', 1: 'I', 2: 'a', 3: 'changing', 4: 'deep', 5: 'difficult', 6: 'fast', 7: 'feel', 8: 'for', 9: 'framework', 10: 'hungry', 11: 'is', 12: 'learning', 13: 'tensorflow', 14: 'very'}

15
# creating a token dictionary for part of speech
pos_list = sum(pos, [])
pos_list = sorted(set(pos_list))
pos_list = ['<pad>'] + pos_list
pos2idx = {pos : idx for idx, pos in enumerate(pos_list)}
idx2pos = {idx : pos for idx, pos in enumerate(pos_list)}
print(pos2idx, end='\n\n')
print(idx2pos, end='\n\n')
print(len(pos2idx))
{'<pad>': 0, 'adjective': 1, 'adverb': 2, 'determiner': 3, 'noun': 4, 'preposition': 5, 'pronoun': 6, 'verb': 7}

{0: '<pad>', 1: 'adjective', 2: 'adverb', 3: 'determiner', 4: 'noun', 5: 'preposition', 6: 'pronoun', 7: 'verb'}

8

4. 위까지 전처리한 데이터를 padding 과정과 함께 embedding 시킨다. 여기서 <‘pad’> 토큰 때문에 masking이라는 개념이 등장했다. padding된 토큰들은 실제 학습시, loss 계산을 하지 않기 위해 masking을 만들어서 불필요한 계산을 없앴다.

예를 들어, 배치사이즈가 2인 경우, 미니배치로 [[1, 2, 0]]의 데이터가 있을 때, 맨 마지막 0은 padding된 것이라하자. 이를 학습한 결과가 [[y_1, y_2, y_3]]이라 할때 loss function에서 [[L₁, L₂, L₃]]이 아닌, [[L₁, L₂, 0]]이 sequence loss = L₂(L₁ + L₂) / 2(batch size) 로 계산되게끔 하는 것이다.

# converting sequence of tokens to sequence of indices
max_sequence = 10
x_data = list(map(lambda sentence : [word2idx.get(token) for token in sentence], sentences))
y_data = list(map(lambda sentence : [pos2idx.get(token) for token in sentence], pos))
# padding the sequence of indices
x_data = pad_sequences(sequences=x_data, maxlen=max_sequence, padding='post')
x_data_mask = ((x_data != 0) * 1).astype(np.float32)
x_data_len = list(map(lambda sentence : len(sentence), sentences))
y_data = pad_sequences(sequences=y_data, maxlen=max_sequence, padding='post')# checking data
print(x_data, end='\n\n')
print(x_data_len, end='\n\n')
print(x_data_mask, end='\n\n')
print(y_data)
[[ 1 7 10 0 0 0 0 0 0 0]
[13 11 14 5 0 0 0 0 0 0]
[13 11 2 9 8 4 12 0 0 0]
[13 11 14 6 3 0 0 0 0 0]]

[3, 4, 7, 5]

[[1. 1. 1. 0. 0. 0. 0. 0. 0. 0.]
[1. 1. 1. 1. 0. 0. 0. 0. 0. 0.]
[1. 1. 1. 1. 1. 1. 1. 0. 0. 0.]
[1. 1. 1. 1. 1. 0. 0. 0. 0. 0.]]

[[6 7 1 0 0 0 0 0 0 0]
[4 7 2 1 0 0 0 0 0 0]
[4 7 3 4 5 1 4 0 0 0]
[4 7 2 1 7 0 0 0 0 0]]

5. Keras의 Sequential API를 통해 model create한다.

# creating rnn for "manny to many" sequence tagging
num_classes = len(pos2idx)
hidden_dim = 10
input_dim = len(word2idx)
output_dim = len(word2idx)
one_hot = np.eye(len(word2idx))
model = Sequential()
model.add(layers.Embedding(input_dim=input_dim, output_dim=output_dim, mask_zero=True,
trainable=False, input_length=max_sequence,
embeddings_initializer=keras.initializers.Constant(one_hot)))
model.add(layers.SimpleRNN(units=hidden_dim, return_sequences=True))
model.add(layers.TimeDistributed(layers.Dense(units=num_classes)))
model.summary()_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
embedding (Embedding) (None, 10, 15) 225
_________________________________________________________________
simple_rnn (SimpleRNN) (None, 10, 10) 260
_________________________________________________________________
time_distributed (TimeDistri (None, 10, 8) 88
=================================================================
Total params: 573
Trainable params: 348
Non-trainable params: 225
_________________________________________________________________

6. Loss Function 정의 : sequence_loss 부분에 마지막 * masking을 하게 되어 masking을 적용한다.

# creating loss function
def loss_fn(model, x, y, x_len, max_sequence):
masking = tf.sequence_mask(x_len, maxlen=max_sequence, dtype=tf.float32)
valid_time_step = tf.cast(x_len, dtype=tf.float32)
sequence_loss = tf.losses.sparse_softmax_cross_entropy(labels=y, logits=model(x),
reduction='none') * masking
sequence_loss = tf.reduce_sum(sequence_loss, axis=-1) / valid_time_step
sequence_loss = tf.reduce_mean(sequence_loss)

return sequence_loss

7. Hyperparameter, Optimizer 설정

# hypterparameters
lr = 0.1
epochs = 30
batch_size = 2
opt = tf.train.AdamOptimizer(learning_rate=lr)

8. Data Pipeline 설정

# generating data pipeline
tr_dataset = tf.data.Dataset.from_tensor_slices((x_data, y_data, x_data_len))
tr_dataset = tr_dataset.shuffle(buffer_size = 4)
tr_dataset = tr_dataset.batch(batch_size = 2)
print(tr_dataset)<BatchDataset shapes: ((?, 10), (?, 10), (?,)), types: (tf.int32, tf.int32, tf.int32)>

9. 모델 학습

# training
tr_loss_hist = []
for epoch in range(epochs):
avg_tr_loss = 0
tr_step = 0

for x_mb, y_mb, x_mb_len in tr_dataset:
with tf.GradientTape() as tape:
tr_loss = loss_fn(model, x=x_mb, y=y_mb, x_len=x_mb_len, max_sequence=max_sequence)
grads = tape.gradient(target=tr_loss, sources=model.variables)
opt.apply_gradients(grads_and_vars=zip(grads, model.variables))
avg_tr_loss += tr_loss
tr_step += 1
else:
avg_tr_loss /= tr_step
tr_loss_hist.append(avg_tr_loss)

if (epoch + 1) % 5 == 0:
print('epoch : {:3}, tr_loss : {:.3f}'.format(epoch + 1, avg_tr_loss))
epoch : 5, tr_loss : 0.273
epoch : 10, tr_loss : 0.027
epoch : 15, tr_loss : 0.006
epoch : 20, tr_loss : 0.002
epoch : 25, tr_loss : 0.001
epoch : 30, tr_loss : 0.001

10. Checking Performance

yhat = model.predict(x_data)
yhat = np.argmax(yhat, axis=-1) * x_data_mask
pprint(list(map(lambda row : [idx2pos.get(elm) for elm in row], yhat.astype(np.int32).tolist())), width=120)
pprint(pos)
[['pronoun', 'verb', 'adjective', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>'],
['noun', 'verb', 'adverb', 'adjective', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>'],
['noun', 'verb', 'determiner', 'noun', 'preposition', 'adjective', 'noun', '<pad>', '<pad>', '<pad>'],
['noun', 'verb', 'adverb', 'adjective', 'verb', '<pad>', '<pad>', '<pad>', '<pad>', '<pad>']]
[['pronoun', 'verb', 'adjective'],
['noun', 'verb', 'adverb', 'adjective'],
['noun', 'verb', 'determiner', 'noun', 'preposition', 'adjective', 'noun'],
['noun', 'verb', 'adverb', 'adjective', 'verb']]
plt.plot(tr_loss_hist)
plt.show()

--

--