어텐션(Attention) (4)

1. Seq2Seq + Attention을 이용한 이미지 캡셔닝(Image Captioning)

■ 시퀀스투시퀀스(Sequence‑to‑Sequence, seq2seq) (4) 에서 일반적인 Seq2Seq 구조를 이용한 이미지 캡셔닝을 구현하였다. 이번 예시에서는 여기에 어텐션 메커니즘을 추가하여 구현한다.

■ 아래의 그림은 어텐션 메커니즘을 통해 이미지를 설명하는 문장(캡션)을 생성하는 과정에 대한 그림이다.

[출처] https://www.tensorflow.org/text/tutorials/image_captioning

■ 위의 그림에서 이미지가 입력으로 들어가는 부분을 Encoder, 캡션이 입력으로 들어가는 부분을 Decoder로 본다면,

■ 이미지가 Encoder의 입력으로 들어가서, CNN 모델을 거쳐 특징 맵(feature map)(또는 특징 벡터)으로 변환된다.

■ Seq2Seq + Attention에서 기계 번역같은 문제는 Encoder가 입력 문장을 처리하면서 각 시점(time step)에 해당하는 은늑 상태들을 생성한다. Decoder는 번역된 문장의 각 단어를 생성할 때마다, 어텐션 메커니즘을 통해 Encoder의 모든 시점의 은닉 상태들 중에서 현재 생성할 단어와 가장 관련이 깊은 부분에 집중(가중치를 부여)하여 정보를 가져온다.

■ 이미지 캡셔닝에서는 이미지가 Encoder의 입력이며, Encoder의 출력이 이미지 특징 벡터라면, 어텐션 메커니즘을 통해 이미지의 특징 벡터가 Decoder의 은닉 상태와 합쳐지는 방식은, 번역 문제에서 Encoder의 모든 시점의 은닉 상태를 참조하는 것과 본질적으로 다르지 않다.

■ 즉, Encoder에서는 이미지의 특징 맵 또는 특징 벡터를 반환하면 된다. 예를 들어, 바다나우 어텐션을 사용한다면, Query는 그대로 \( t-1 \) 시점의 Decoder 은닉 상태를 사용하고, Key와 Value로 Encoder의 이미지 특징 벡터를 사용하면 된다.

2. Seq2Seq + Attention을 이용한 이미지 캡셔닝 구현

■ 시퀀스투시퀀스(Sequence‑to‑Sequence, seq2seq) (4) 에서 일반적인 Seq2Seq 구조를 이용하여 이미지 캡셔닝을 구현하였다. 이번 예시에서는 여기에 어텐션 메커니즘을 추가하여 이미지 캡셔닝을 구현한다. 데이터 전처리 및 데이터 로드 과정은 동일하다. 참고로 batch_first = True이다.

2.1 Seq2Seq + Attention의 Encoder

■ 아래의 EncoderCNN 클래스는 사전 훈련된(pre-trained) ResNet50 모델을 기반으로 이미지 특징(feature)을 추출하는 CNN 인코더이다.

import torchvision.transforms as T

transforms = T.Compose([
    T.Resize(226),
    T.RandomCrop(224),
    T.ToTensor(),
    T.Normalize((0.485, 0.456, 0.406),(0.229, 0.224, 0.225))
])

class EncoderCNN(nn.Module):
    def __init__(self):
        super(EncoderCNN, self).__init__()

        resnet = models.resnet50(pretrained=True)
        for param in resnet.parameters():
            param.requires_grad_(False) # freeze

        modules = list(resnet.children())[:-2] # 기존 resnet 모델에서 마지막 avgpool, fc 계층 제외
        self.resnet = nn.Sequential(*modules)

    def forward(self, images):
        features = self.resnet(images) # features.shape: (batch_size, 2048, 7, 7)
        features = features.permute(0, 2, 3, 1) # features.shape: (batch_size, 7, 7, 2048)
        features = features.view(features.size(0), -1, features.size(-1)) # features.shape: (batch_size, 49, 2048)
        return features

■ EncoderCNN에서는 pre-trained resnet50 모델을 불러온다. 그리고 학습 과정에서 해당 모델의 파라미터(가중치)가 학습되지 않도록 동결(freeze)한다.

■ 즉, Encoder 부분은 학습 중 가중치 파라미터가 업데이트되지 않고, 이미 학습된 특징 추출 능력을 그대로 사용한다.

■ 그리고 원래 이미지 분류를 위해 사용되었던 모델 상류의 두 개 레이어(평균 풀링 계층(avgpool)과 완전 연결 계층(fc))를 제거한다. 이는 분류 직전의 컨볼루션 특징 맵(feature map)을 얻기 위함이다.

■ 두 레이어를 제거한 남은 레이어들을 묶어서 self.resnet으로 저장한다.

■ 그다음, Encoder의 순전파 부분을 보면, 입력 이미지(images)를 수정된 Resnet 모델(self.resnet)에 통과시켜 특징 맵을 얻는다.

- (배치 크기, 채널, 높이, 너비) 형태의 images는 self.resnet()을 거쳐 (배치 크기, 채널 수, 높이, 너비) = (batch_size, 2048, 7, 7) 형태의 특징 맵(features)가 된다.

- 예시에 사용하는 flickr8k의 이미지는 컬러 이미지이다. 즉, features = self.resnet(images)는 (batch_size, 3, 224, 224) 크기의 4D 텐서 images를 입력으로 받아, nn.Conv2d()를 통해 3개의 입력 채널을 2048개의 출력 채널로 변환하는 컨볼루션 연산을 수행한다.

■ 이때, 어텐션 메커니즘을 위하여 permute와 view 함수를 거쳐 특징 맵의 형상을 바꿔준다.

- features = features.permute(0, 2, 3, 1)을 통해 특징 맵의 차원 순서를 변경한다. permute(0, 2, 3, 1)이므로 특징 맵(features)의 형태는 (batch_size, 7, 7, 2048) 형태가 된다.

- 이는 공간 차원(7 x 7)을 채널 차원(2048) 앞으로 옮긴 것이다.

- features = features.view(features.size(0), -1, features.size(-1))을 통해 공간 차원(7 x 7)을 하나로 평탄화(flatten)한다.

- view() 함수에 첫 번째 차원은 features.size(0) = batch_size, 마지막 차원(세 번째 차원)은 features.size(-1) = 2048로 두고, 두 번째 차원에는 -1을 넣었다. 그러므로 두 번째 차원은 \( 7 \times 7 = 49 \)가 된다.

■ EncoderCNN 클래스에서 반환되는 features의 형상은 (batch_size, 49, 2048) 형태가 된다.

■ 이는 각 배치마다 특징 벡터의 차원이 2048인, 49개의 이미지 특징 벡터가 담겨 있는 상태이다. 즉, Encoder에서 최종적으로 반환하는 features는 배치 단위의 이미지 특징 벡터이며, 이 49개의 특징 벡터가 어텐션 메커니즘에 사용된다.

2.2 바다나우 어텐션(Bahdanau Attention)

■ 이 예에서는 바다나우 어텐션을 사용한다.

class Attention(nn.Module):
    def __init__(self, encoder_dim, decoder_dim, attention_dim):
        super(Attention, self).__init__()

        self.Wb = nn.Linear(decoder_dim, attention_dim)
        self.Wc = nn.Linear(encoder_dim, attention_dim)
        self.Wa = nn.Linear(attention_dim, 1)

    def forward(self, features, hidden_state):
        # features.shape: (batch_size, 49, encoder_dim=2048)
        
        Wc_H = self.Wc(features) # (batch_size, 49, attention_dim)
        Wb_s = self.Wb(hidden_state) # (batch_size, attention_dim)

        ## column-wise sum
        combined_states = torch.tanh(Wb_s.unsqueeze(1) + Wc_H)
        # combined_states.shape: (batch_size, 49, attention_dim)

        attention_scores = self.Wa(combined_states)
        # attention_scores.shape: (batch_size, 49, 1)
        attention_scores = attention_scores.squeeze(2)
        # attention_scores.shape: (batch_size, 49)

        alpha = F.softmax(attention_scores, dim=1)
        # alpha.shape: (batch_size, 49)

        ## alpha의 차원을 확장하여 value인 인코더의 특징 features와 요소별 곱셈이 가능하도록 한다.
        attention_weights = features * alpha.unsqueeze(2)

        attention_value = attention_weights.sum(dim=1)

        return attention_value

2.2.1 어텐션 점수(Attention score) 계산 과정

■ 바다나우 어텐션의 Attention score는 \( \text{Attention score} \left( s_{t-1}, H \right) = W^T_a \tanh \left( W_b \cdot s_{t-1} + W_c \cdot H \right) \)를 통해 계산하였다.

■ 이 식에서 \( W_a, W_b, W_c \)는 가중치 파라미터로, 이는 파이토치에서 제공하는 nn.Linear() 층을 사용해 코드로 구현할 수 있다. 이 내용에 대한 코드는 다음과 같다.

self.Wb = nn.Linear(decoder_dim, attention_dim)

self.Wc = nn.Linear(encoder_dim, attention_dim)

self.Wa = nn.Linear(attention_dim, 1)

■ \( \text{Attention score} \left( s_{t-1}, H \right) = W^T_a \tanh \left( W_b \cdot s_{t-1} + W_c \cdot H \right) \)를 계산하는 과정을 순서대로 나타내면

- ① \( W_b \cdot s_{t-1} + W_c \cdot H \)

- 여기서 \( W_b \)와 \( W_c \)는 학습해야 할(업데이트해야 할) 가중치 파라미터이다. 이 가중치 파라미터는 nn.Linear()로 만들 수 있다.

- 여기서 \( s_{t-1} \)은 Decoder의 은닉 상태, \( H \)는 Encoder에서 반환되는 피처 맵(features)이다.

- 그러므로, \( W_b \cdot s_{t-1} + W_c \cdot H \)를 계산하기 위해서는

- Decoder 은닉 상태의 차원이 decoder_dim이면, nn.Linear()인 \( W_b \)를 통해 attention_dim으로

- Encoder의 특징 맵(특징 벡터)의 차원이 encoder_dim이면, nn.Linear()인 \( W_c \)를 통해 attention_dim으로 변경해야 \( W_b \cdot s_{t-1} \)과 \( W_c \cdot H \)의 \( + \) 연산이 가능하다.

- 이 내용에 대한 코드는 다음과 같다.

# features.shape: (batch_size, 49, encoder_dim=2048)

Wc_H = self.Wc(features) # (batch_size, 49, attention_dim)

Wb_s = self.Wb(hidden_state) # (batch_size, attention_dim)

- 이때 \( W_b \cdot s_{t-1} \)과 \( W_c \cdot H \)의 \( + \) 연산은 column-wise sum. 컬럼별 합이다.

- 배치 크기를 무시했을 때, \( W_c \cdot H \)의 형상은 (49, attention_dim), \( W_b \cdot s_{t-1} \)의 형상은 (attention_dim)이다.

- 이 예에서는 attention_dim 차원을 가지는 \( W_b \cdot s_{t-1} \) 벡터를 attention_dim 차원을 가지는 \( W_c \cdot H \)의 모든 49개의 이미지 특징 벡터에 더하기 위해,

- 다음과 같이 unsqueeze() 함수를 이용하여 \( W_b \cdot s_{t-1} \)의 차원을 확장한 다음, column-wise sum 연산을 수행한다.

Wb_s.unsqueeze(1) + Wc_H

a = torch.zeros(2, 3, 5)
b = torch.ones(2, 5)

print(a);print('--'*30);print(b)
```#결과#```
tensor([[[0., 0., 0., 0., 0.],
         [0., 0., 0., 0., 0.],
         [0., 0., 0., 0., 0.]],

        [[0., 0., 0., 0., 0.],
         [0., 0., 0., 0., 0.],
         [0., 0., 0., 0., 0.]]])
------------------------------------------------------------
tensor([[1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1.]])
````````````

a.shape, b.shape
```#결과#```
(torch.Size([2, 3, 5]), torch.Size([2, 5]))
````````````

b = b.unsqueeze(1)
b
```#결과#```
tensor([[[1., 1., 1., 1., 1.]],

        [[1., 1., 1., 1., 1.]]])
````````````

b.shape
```#결과#```
torch.Size([2, 1, 5])
````````````

c = a + b
c.shape
```#결과#```
torch.Size([2, 3, 5])
````````````

c
```#결과#```
tensor([[[1., 1., 1., 1., 1.],
         [1., 1., 1., 1., 1.],
         [1., 1., 1., 1., 1.]],

        [[1., 1., 1., 1., 1.],
         [1., 1., 1., 1., 1.],
         [1., 1., 1., 1., 1.]]])
````````````

- ② \( \tanh \left( W_b \cdot s_{t-1} + W_c \cdot H \right) \)

- 그다음, column-wise sum을 수행한 결과를 \( \tanh \) 함수의 입력으로 넣는다.

combined_states = torch.tanh(Wb_s.unsqueeze(1) + Wc_H)
# combined_states.shape: (batch_size, 49, attention_dim)

- ③ \( W^T_a \cdot \tanh \left( W_b \cdot s_{t-1} + W_c \cdot H \right) \)

- 마지막으로 가중치 파라미터 \( W^T_a \)를 \( \tanh \left( W_b \cdot s_{t-1} + W_c \cdot H \right) \)에 곱하여 어텐션 스코어를 계산한다.

attention_scores = self.Wa(combined_states)
# attention_scores.shape: (batch_size, 49, 1)

■ 여기까지가 Query와 Key로 유사도(어텐션 스코어)를 계산하는 과정이다.

2.2.2 어텐션 분포(어텐션 가중치) 계산 과정

■ 그다음, 어텐션 스코어를 Softmax에 통과시켜 어텐션 분포(어텐션 가중치)를 계산한다.

attention_scores = attention_scores.squeeze(2)
# attention_scores.shape: (batch_size, 49)

- squeeze(2)를 통해 불필요한 마지막 차원을 제거한다.

alpha = F.softmax(attention_scores, dim=1) 
# alpha.shape: (batch_size, 49)

2.2.3 어텐션 값(=context vector) 계산 과정

■ 이제, 2.2.2에서 계산한 어텐션 분포(어텐션 가중치) alpha와 Value의 가중합으로 어텐션 값. 즉, context vector를 계산한다. 여기서의 Value는 2.2.1의 어텐션 스코어 식에 적은 \( H \)이며, 이는 EncoderCNN 클래스에서 반환한 피처 맵(features)이다.

## alpha의 차원을 확장하여 value인 인코더의 특징 features와 요소별 곱셈이 가능하도록 한다.
attention_weights = features * alpha.unsqueeze(2)
# (batch_size, 49, encoder_dim) * (batch_size, 49, 1) = (batch_size, 49, encoder_dim)

attention_value = attention_weights.sum(dim=1)
# attention_value.shape: (batch_size, encoder_dim)

■ 최종적으로 Attention 클래스는 (batch_size, encoder_dim) 형태의 어텐션 값(context vector)를 반환한다.

2.3 Seq2Seq + Attention의 Decoder

■ 아래의 DecoderRNN 클래스는 어텐션 메커니즘을 이용해 CNN 인코더에서 추출된 이미지 특징으로 단어를 예측(생성)하는 언어 모델이다.

class DecoderRNN(nn.Module):
    def __init__(self,embedding_size, vocab_size, attention_dim, encoder_dim, decoder_dim,drop_prob=0.3):
        super(DecoderRNN, self).__init__()

        self.vocab_size = vocab_size

        self.embedding = nn.Embedding(vocab_size,embedding_size)
        self.attention = Attention(encoder_dim,decoder_dim,attention_dim)

        self.init_h = nn.Linear(encoder_dim, decoder_dim)
        self.init_c = nn.Linear(encoder_dim, decoder_dim)
        ## LSTM Cell, features + embedding
        self.lstm_cell = nn.LSTMCell(embedding_size+encoder_dim,decoder_dim,bias=True)

        self.fc = nn.Linear(decoder_dim,vocab_size)
        self.drop = nn.Dropout(drop_prob)

    def forward(self, features, captions):
        embedded = self.embedding(captions)

        # Initialize LSTM state
        hidden, cell = self.init_hidden_state(features)  # (batch_size, decoder_dim)

        seq_length = len(captions[0])-1 
        batch_size = captions.size(0)
        num_features = features.size(1)

        preds = torch.zeros(batch_size, seq_length, self.vocab_size).to(device)

        for i in range(seq_len):
            context = self.attention(features, hidden)
            lstm_input = torch.cat((embedded[:, i], context), dim=1)
            hidden, cell = self.lstm_cell(lstm_input, (hidden, cell))
            output = self.fc(self.drop(hidden))
            preds[:,i] = output
        return preds

    def init_hidden_state(self, encoder_out):
        mean_encoder_out = encoder_out.mean(dim=1)
        hidden = self.init_h(mean_encoder_out)  # (batch_size, decoder_dim)
        cell = self.init_c(mean_encoder_out)
        return hidden, cell

■ 이 예에서는 Encoder의 출력인 features = encoder_out (49개의 특징 벡터)의 평균을 계산하여 Decoder LSTM의 초기 은닉 상태(hidden state)와 초기 셀(또는 메모리) 상태(cell state)를 얻는다.

self.init_h = nn.Linear(encoder_dim=2048, decoder_dim)
self.init_c = nn.Linear(encoder_dim=2048, decoder_dim)

def init_hidden_state(self, encoder_out):
    mean_encoder_out = encoder_out.mean(dim=1)
    hidden = self.init_h(mean_encoder_out)  # (batch_size, decoder_dim)
    cell = self.init_c(mean_encoder_out)
    return hidden, cell
    
# features.shpae: (batch_size, 49, encoder_dim=2048)

# Initialize LSTM state
hidden, cell = self.init_hidden_state(features)  # (batch_size, decoder_dim)

■ features.mean(dim=1)을 수행하여 dim=1에 해당하는, 2048 차원을 가지는 49개의 이미지 특징 벡터들의 평균을 계산한다. 이는 각 2048 차원 벡터의 요소별 평균을 구하는 것과 같다.

■ 이 연산을 통해 각 이미지는 49개의 특징 벡터들이 평균내어져 하나의 2048 차원을 가지는 특징 벡터로 요약된다.

■ 이때, 배치를 사용하므로 배치 개수만큼의 2048 차원을 가지는 평균 벡터들이 만들어진다.

features = torch.rand(2, 3, 5) # (batch_size, L, D)
features
```#결과#```
tensor([[[0.4765, 0.0261, 0.2957, 0.4766, 0.0569],
         [0.9290, 0.0382, 0.4784, 0.8878, 0.2390],
         [0.4195, 0.4202, 0.5542, 0.5192, 0.8342]],

        [[0.1639, 0.2569, 0.4947, 0.8305, 0.1147],
         [0.3769, 0.4744, 0.3315, 0.6004, 0.3078],
         [0.0474, 0.5703, 0.7736, 0.4492, 0.6599]]])
````````````

features.mean(dim=1)
```#결과#```
tensor([[0.6083, 0.1615, 0.4428, 0.6279, 0.3767],
        [0.1961, 0.4339, 0.5333, 0.6267, 0.3608]])
````````````

features.mean(dim=1).shape
```#결과#```
torch.Size([2, 5]) # (batch_size, D)
````````````

■ 그다음, LSTM의 초기 hidden state와 cell state는 각각 self.init_h = nn.Linear(encoder_dim=2048, decoder_dim), self.init_c = nn.Linear(encoder_dim=2048, decoder_dim)를 통과해 (batch_size, decoder_dim) 형상을 갖게 된다.

init_h = nn.Linear(5, 3) # (D=encoder_dim, decoder_dim)
init_c = nn.Linear(5, 3) # (D=encoder_dim, decoder_dim)

mean = features.mean(dim=1)
h = init_h(mean)
c = init_c(mean)

h.shape, c.shape
```#결과#```
(torch.Size([2, 3]), torch.Size([2, 3])) # (batch_size, decoder_dim)
````````````

■ 바다나우 어텐션에서는 다음 그림과 같이 입력 \( x \)와 context vector를 연결(concatenation)한 것을 현재 시점인 \( t \) 시점 LSTM 계층의 새로운 입력으로 사용한다.

■ 입력 \( x \)가 위의 그림처럼 임베딩 벡터라면, 위의 그림에 대한 과정을 다음과 같이 구현할 수 있다.

■ 먼저, 다음과 같이 캡션을 임베딩 계층에 통과시켜 임베딩으로 변환한다.

embedded = self.embedding(captions)

■ 그리고 다음과 같이 초기 LSTM의 hidden state와 cell state를 만든다.

# Initialize LSTM state
hidden, cell = self.init_hidden_state(features)  # (batch_size, decoder_dim)

■ 아래의 seq_length는 입력으로 받은 captions에서 <EOS> 토큰을 제외했을 때의 captions의 길이이다. 즉, 모든 시점(time steps)이다. preds는 예측(또는 생성) 결과를 담을 텐서이다.

seq_length = len(captions[0])-1 
batch_size = captions.size(0)
num_features = features.size(1)

preds = torch.zeros(batch_size, seq_length, self.vocab_size).to(device)

for i in range(seq_len):
    context = self.attention(features, hidden)
    lstm_input = torch.cat((embedded[:, i], context), dim=1)
    hidden, cell = self.lstm_cell(lstm_input, (hidden, cell))
    output = self.fc(self.drop(hidden))
    preds[:,i] = output
return preds

■ 위의 코드처럼 for 문을 통해 모든 시점을 순서대로 순회한다.

■ 먼저, 어텐션 클래스에 features와 hidden을 보내 어텐션 값(context vector)를 계산하는데, 여기서 features는 이미지 특징 맵(또는 특징 벡터)이고, hidden은 위의 hidden, cell = self.init_hidden_state(features)에서 얻은 디코더의 초기 은닉 상태이다.

context = self.attention(features, hidden)

■ 그다음, 아래와 같이 현재 캡션 단어의 임베딩 벡터(embedded[:, i])와 컨텍스트 벡터(context)를 연결한다. 그리고 이 것을 LSTM의 입력으로 사용한다. LSTM 계층은 이전 시점들의 결과를 입력받아 현재 시점의 은닉 상태와 셀 상태를 반환한다. 이 과정은 모든 시점(seq_len)마다 반복된다.

lstm_input = torch.cat((embedded[:, i], context), dim=1)

hidden, cell = self.lstm_cell(lstm_input, (hidden, cell))

■ 즉, 이미지 특징 벡터의 평균을 기반으로 만들어진 초기 은닉 상태가 바다나우 어텐션의 Query인 \( t - 1 \) 시점의 Decoder 은닉 상태로 사용된 것으로 볼 수 있다.

■ for i in range(seq_len): 문 내에서 context = self.attention(features, hidden)부터

hidden, cell = self.lstm_cell(lstm_input, (hidden, cell))까지의 과정은, 각 단어를 처리할 때마다 바다나우 어텐션 메커니즘을 사용하여 Decoder의 은닉 상태와 이미지 특징을 기반으로 어텐션 가중치를 계산하고, 이를 사용하여 컨텍스트 벡터를 얻은 다음, 단일 시점의 단어 임베딩과 컨텍스트 벡터를 합쳐 LSTM의 입력으로 사용하는 과정이다.

■ 모든 스텝(seq_len)이 끝나면 전체 시퀀스에 대한 예측 결과(preds)를 반환한다.

    output = self.fc(self.drop(hidden))
    preds[:,i] = output
return preds

참고) torch.nn.LSTMCell()

■ 이 예에서는 nn.LSTM()을 사용하지 않고, 다음과 같이 nn.LSTMCell()을 사용하였다.

torch.nn.LSTMCell(input_size, hidden_size, bias=True, device=None, dtype=None)

self.lstm_cell = nn.LSTMCell(embedding_size+encoder_dim,decoder_dim,bias=True)

■ LSTMCell처럼 RNNCell, GRUCell와 같이 -Cell이 붙은 클래스들은 각각의 input, output, hidden 등을 갖는 하나의 Cell을 구현한 것이다.

■ 다음 그림처럼 가장 아래 층의 빨간 색이 입력(input), 맨 위쪽의 파란 색이 출력(output), 중간의 녹색이 3-layer의 LSTM 모델을 표현한 것일 때, 그림의 빨간 박스가 테두리로 쳐진 녹색 칸 하나가 LSTMCell을 나타낸다.

[출처] https://discuss.pytorch.kr/t/nn-rnn-nn-rnncell/214/2

■ 그래서 nn.LSTMCell은 초기화 시에 각 타임 스텝(time step)에서 사용할 입력(input)과 히든(hidden)의 크기를 받게 된다.

■ 이렇게 nn.LSTMCell을 사용하여 직접 LSTM 모델을 구현하는 경우, 순서의 흐름에 따라 각 단계(time step)에서 이전 단계의 동일한 nn.LSTMCell에서 출력된 은닉 상태와 셀 상태를 다시 입력으로 넣어주는 구현을 하면 된다.

■ 이 예에서는 바다나우 어텐션을 사용하기 때문에 nn.LSTMCell()의 input_size로 embedding_size와 encoder_dim을 더한 값을 사용하였다.

■ Attention 클래스에서 반환하는 attention_value = context의 shape은 (batch_size, encoder_dim=2048)이고 embedded[:, i]의 shape은 (batch_size, embedding_size)이다.

■ dim = 1을 기준으로 두 벡터(embedded[:, i]와 context)를 연결(concat)한

(batch_size, embedding_size + encoder_dim) 형상의 lstm_input을 Decoder의 LSTM 계층의 입력으로 사용하기 때문에 nn.LSTMCell()의 input_size를 embedding_size + encoder_dim로 설정한 것이다.

2.4 Seq2Seq + Attention의 Seq2Seq

■ EncoderCNN 클래스와 DecoderRNN 클래스를 연결하는 EncoderDecoder 클래스는 다음과 같다.

class EncoderDecoder(nn.Module):
    def __init__(self, embedding_size, vocab_size, attention_dim, encoder_dim, decoder_dim, drop_prob=0.3):
        super().__init__()
        self.encoder = EncoderCNN()
        self.decoder = DecoderRNN(
            embedding_size = embedding_size,
            vocab_size = vocab_size,
            attention_dim = attention_dim,
            encoder_dim = encoder_dim,
            decoder_dim = decoder_dim
        )

    def forward(self, images, captions):
        features = self.encoder(images)
        outputs = self.decoder(features, captions)
        return outputs

■ EncoderCNN 클래스와 DecoderRNN 클래스를 사용하므로, 초기화(__init__)에서 각 클래스의 인스턴스를 생성한다.

■ 그다음, 순전파 과정에서 입력 images를 EncoderCNN 클래스의 인스턴스(self.encoder)에 통과시켜 features를 얻은 다음, 얻은 features와 입력 captions를 DecoderRNN 클래스의 인스턴스(self.decoder)에 통과시켜 예측 결과를 반환한다.

■ 즉, EncoderDecoder 클래스는 Encoder와 Decoder를 연결하여 이미지를 입력하면 캡션을 생성하는 하나의 모델로 만든 것이다.

참고)

GitHub - sankalp1999/Image_Captioning_With_Attention: CaptionBot : Sequence to Sequence Modelling where Encoder is CNN(Resnet-50) and Decoder is LSTMCell with soft attention mechanism

GitHub - sankalp1999/Image_Captioning_With_Attention: CaptionBot : Sequence to Sequence Modelling where Encoder is CNN(Resnet-50

CaptionBot : Sequence to Sequence Modelling where Encoder is CNN(Resnet-50) and Decoder is LSTMCell with soft attention mechanism - sankalp1999/Image_Captioning_With_Attention

github.com

참고)

Image Captioning With Attention - Pytorch

Explore and run machine learning code with Kaggle Notebooks | Using data from multiple data sources

www.kaggle.com

참고) H_02. Seq2seq and attention - Article 1 - Deep Learning Bible - 3. Natural Language Processing - 한글

H_02. Seq2seq and attention - Article 1

# Attention mechanism 많은 연구자들이 신경망의 « Attention Mechanism»에 관심을 가지고 있습니다. 이 게시물은 딥 러닝 Attention 메커…

wikidocs.net

'자연어처리' 카테고리의 다른 글

트랜스포머(Transformer) (2) (0)	2025.04.19
트랜스포머(Transformer) (1) (0)	2025.04.15
어텐션(Attention) (3) (0)	2025.04.13
어텐션(Attention) (2) (0)	2025.04.07
어텐션(Attention) (1) (0)	2025.04.06

Hyun_Jae

어텐션(Attention) (4)

1. Seq2Seq + Attention을 이용한 이미지 캡셔닝(Image Captioning)

2. Seq2Seq + Attention을 이용한 이미지 캡셔닝 구현

2.1 Seq2Seq + Attention의 Encoder

2.2 바다나우 어텐션(Bahdanau Attention)

2.2.1 어텐션 점수(Attention score) 계산 과정

2.2.2 어텐션 분포(어텐션 가중치) 계산 과정

2.2.3 어텐션 값(=context vector) 계산 과정

2.3 Seq2Seq + Attention의 Decoder

참고) torch.nn.LSTMCell()

2.4 Seq2Seq + Attention의 Seq2Seq

'자연어처리' 카테고리의 다른 글

티스토리툴바

어텐션(Attention) (4)

1. Seq2Seq + Attention을 이용한 이미지 캡셔닝(Image Captioning)

2. Seq2Seq + Attention을 이용한 이미지 캡셔닝 구현

2.1 Seq2Seq + Attention의 Encoder

2.2 바다나우 어텐션(Bahdanau Attention)

2.2.1 어텐션 점수(Attention score) 계산 과정

2.2.2 어텐션 분포(어텐션 가중치) 계산 과정

2.2.3 어텐션 값(=context vector) 계산 과정

2.3 Seq2Seq + Attention의 Decoder

참고) torch.nn.LSTMCell()

2.4 Seq2Seq + Attention의 Seq2Seq

'자연어처리' 카테고리의 다른 글

'자연어처리' Related Articles

티스토리툴바