Tacotron 무지성 구현

Tacotron 1

Tacotron 무지성 구현 - 4/N

CHoiHK 2021. 7. 29. 01:03

이번 포스팅에서는 Encoder를 구현해보려고 합니다.

Encoder의 Embedding, PreNet, CBHG 각 Layer들의 차원을 맞추는 과정에서

조금 어려운 부분이 있었습니다.

제가 헤맸던 부분은,

Embeding Layer의 인자 중 하나인 embedding_dim을

고정된 값으로 생각해서 구현하다가

잘못된 것을 깨닫고 처음부터 다시 코드를 작성했습니다 :(

이 부분은 Decoder와 연관이 있을 것이므로 잘 생각해보고

구현해야 할 듯합니다.

전처리 부분은 그럭저럭 말로 설명이 잘 됐는데,

Encoder를 포함한 앞으로 구현할 Decoder, Post CBHG 포스팅에서는

조금 난해한 설명이 있을 수도 있으니 양해 부탁드립니다.

작성되고 나서도 설명 부분은 제가 수정할 수 있으니 양해 바랍니다.

Table & Encoder Structure

아래의 이미지는 논문 5페이지 상단 부분에 있는 Table 1입니다.

앞으로 구현할 때 많이 참고할 것 같습니다.

Hyper Parameters

import os

class Hparams():
    # speaker name
    speaker = 'KSS'
    
    # Audio Pre-processing
    origin_sample_rate = 44100
    sample_rate = 22050
    n_fft = 1024
    hop_length = 256
    win_length = 1024
    n_mels = 80
    reduction = 5
    n_specs = n_fft // 2 + 1
    fmin = 0
    fmax = sample_rate // 2
    min_level_db = -80
    ref_level_db = 0
    
    # Text Pre-processing
    PAD = '_'
    EOS = '~'
    SPACE = ' '
    SPECIAL = '.,!?'
    JAMO_LEADS = "".join([chr(_) for _ in range(0x1100, 0x1113)])
    JAMO_VOWELS = "".join([chr(_) for _ in range(0x1161, 0x1176)])
    JAMO_TAILS = "".join([chr(_) for _ in range(0x11A8, 0x11C3)])
    symbols = PAD + EOS + JAMO_LEADS + JAMO_VOWELS + JAMO_TAILS + SPACE + SPECIAL

    _symbol_to_id = {s: i for i, s in enumerate(symbols)}
    _id_to_symbol = {i: s for i, s in enumerate(symbols)}
    
    # Pre-processing paths (text, mel, spec)
    data_dir = os.path.join('data')
    out_texts_dir = os.path.join(data_dir, 'texts')
    out_mels_dir = os.path.join(data_dir, 'mels')
    out_specs_dir = os.path.join(data_dir, 'specs')
    
    # Embedding Layer
    in_dim = 256
    
    # Encoder Pre-net Layer
    prenet_dropout_ratio = 0.5
    prenet_linear_size = 256

    # CBHG
    cbhg_K = 16
    cbhg_mp_k = 2
    cbhg_mp_s = 1
    cbhg_mp_p = 1
    cbhg_mp_d = 2
    cbhg_conv_proj_size = 128
    cbhg_conv_proj_k = 3
    cbhg_conv_proj_p = 1    
    cbhg_gru_hidden_dim = 128

Hparams 클래스에 #Embedding Layer 주석 부분부터 새롭게 추가된 인자들입니다.

in_dim 인자는 Embedding Layer에서

embedding_dim 인자 값으로 입력될 값입니다.

prenet_dropout_ratio와 prenet_linear_size 인자는 PreNet에 들어갈 인자들이며,

Table 1을 참고했습니다.

아래의 이미지는 논문 3페이지 내용 중 CBHG MODULE 부분입니다.

cbhg_K 인자는 자연어처리에서 N-gram의 효과를 보기 위해서 Conv1d Layer에

적용했다고 합니다.

논문 Table 1에서는 Encoder CBHG K 값을 16으로 기재했으므로,

저는 cbhg_K 라는 변수에 16으로 지정했습니다.

N-gram과 관련된 내용은 아래의 링크를 참고해주세요.

https://wikidocs.net/21692

위키독스

온라인 책을 제작 공유하는 플랫폼 서비스

wikidocs.net

인자 이름 끝 자리에 알파벳 하나처럼 이니셜만 있는 인자들은

MaxPool1d, Conv1d Layer에서

kernel_size, stride, padding, dilation의 앞 단어를 따온 인자들입니다.

cbhg_conv_proj_k 값을 제외하고는 전부 차원을 맞추기 위해

제가 별도로 추가한 인자들입니다.

cbhg_conv_proj_size와 cbhg_gru_hidden_dim 인자 또한 Table 1에 기재된

값을 기준으로 Hparams에 추가했습니다.

CustomLinear & BatchNormConv1d Layer

이미지에서 드래그된 부분을 보면,

"Batch normalization is used for all convolusional layers."

라고 적혀있습니다.

이 말은 CBHG에 들어가는 모든 Conv1d Layer는

BatchNorm1d Layer와 붙어 다닌다는 뜻입니다.

또한, FC라고 적혀있는 부분은 Fully Connected라는 말일 텐데,

아마 pytorch에서는 torch.nn.Linear을 사용하는 것과 같은 의미일 것입니다.

그리고 FC와 ReLU, Dropout이 자주 붙어있는 것을 볼 수 있습니다.

저는 언급한 두 부분의 Layer들을 각각 결합하여

한 번에 사용할 수 있게끔 만들도록 하겠습니다.

class CustomLinear(torch.nn.Module):
    def __init__(self, in_features, out_features, 
                        bias=True, device=None, dtype=None, 
                        activation=None, dropout=None):
        super(CustomLinear, self).__init__()
        self.in_features = in_features
        self.out_features = out_features
        self.bias = bias
        self.device = device
        self.dtype = dtype
        self.dropout = dropout
        self.activation = activation
        self.linear = torch.nn.Linear(in_features=self.in_features, 
                                      out_features=self.out_features,
                                      bias=self.bias,
                                      device=self.device,
                                      dtype=self.dtype)
        
        # activation functions
        if self.activation != None:
            if self.activation == 'relu':
                self.activation = torch.nn.ReLU()
            elif self.activation == 'leakyrelu':
                self.activation = torch.nn.LeakyReLU()
            elif self.activation == 'sigmoid':
                self.activation = torch.nn.Sigmoid()
            elif self.activation == 'tanh':
                self.activation = torch.nn.Tanh()
            else:
                self.activation = torch.nn.ReLU()
        else:
            self.activation = None
            
            
        # dropout function
        if self.dropout != None:
            self.dropout = torch.nn.Dropout(self.dropout)
        else:
            self.dropout = None
        
        
    def forward(self, x):
        x = self.linear(x)
        
        if self.activation != None:
            x = self.activation(x)
            
        if self.dropout != None:
            x = self.dropout(x)
            
        return x

linear = CustomLinear(10, 10, 1, activation='relu', dropout=0.5)

print(linear)


####
CustomLinear(
  (linear): Linear(in_features=10, out_features=10, bias=True)
  (activation): ReLU()
  (dropout): Dropout(p=0.5, inplace=False)
)

class BatchNormConv1D(torch.nn.Module):
    def __init__(self, in_channels, out_channels, kernel_size, 
                        stride=1, padding=0, dilation=1, groups=1, eps=1e-05, momentum=0.1,
                        affine=True, track_running_stats=True, 
                        bias=True, padding_mode='zeros', device=None, dtype=None,
                        activation=None):
        super(BatchNormConv1D, self).__init__()
        self.in_channels = in_channels
        self.out_channels = out_channels
        self.kernel_size = kernel_size
        self.stride = stride
        self.padding = padding
        self.dilation = dilation
        self.groups = groups
        self.eps = eps
        self.momentum = momentum
        self.affine = affine
        self.track_running_stats = track_running_stats
        self.bias = bias
        self.padding_mode = padding_mode
        self.device = device
        self.dtype = dtype
        self.activation = activation
        
        self.conv1d = torch.nn.Conv1d(in_channels=self.in_channels, 
                                      out_channels=self.out_channels, 
                                      kernel_size=self.kernel_size, 
                                      stride=self.stride, 
                                      padding=self.padding, 
                                      dilation=self.dilation, 
                                      groups=self.groups, 
                                      bias=self.bias, 
                                      padding_mode=self.padding_mode, 
                                      device=self.device, 
                                      dtype=self.dtype)
        
        self.batchnorm1d = torch.nn.BatchNorm1d(num_features=self.out_channels,
                                                eps=self.eps,
                                                momentum=self.momentum,
                                                affine=self.affine,
                                                track_running_stats=self.track_running_stats,
                                                device=self.device,
                                                dtype=self.dtype)

        # activation functions
        if self.activation != None:
            if self.activation == 'relu':
                self.activation = torch.nn.ReLU()
            elif self.activation == 'leakyrelu':
                self.activation = torch.nn.LeakyReLU()
            elif self.activation == 'sigmoid':
                self.activation = torch.nn.Sigmoid()
            elif self.activation == 'tanh':
                self.activation = torch.nn.Tanh()
            else:
                self.activation = torch.nn.ReLU()
        else:
            self.activation = None
            
        
    def forward(self, x):
        x = self.conv1d(x)
        x = self.batchnorm1d(x)
        
        if self.activation != None:
            x = self.activation(x)
            
        return x

bnc1d = BatchNormConv1D(10, 10, 1, activation='relu')

print(bnc1d)


####
BatchNormConv1D(
  (conv1d): Conv1d(10, 10, kernel_size=(1,), stride=(1,))
  (batchnorm1d): BatchNorm1d(10, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (activation): ReLU()
)

조금 많이 복잡해 보이지만,

사실 pytorch tutorial에서 기본적인 인자들만 모두 가져와서

그대로 적어놓은 것뿐입니다.

사용 방법은 각 레이어를 정의해놓은 코드블럭 밑에

따로 옮겨두었으니 참고하면 됩니다.

Embedding Layer

사실 Embedding Layer는 별도의 클래스로 만들 필요도 없고,

Encoder에 포함할 필요도 없긴 합니다.

그래도 저는 깔끔한 것을 좋아하기 때문에 따로 만들도록 하겠습니다.

class Embedding(torch.nn.Module):
    def __init__(self, Hparams):
        super(Embedding, self).__init__()
        self.Hparams = Hparams
        layers = [torch.nn.Embedding(num_embeddings=len(self.Hparams.symbols), 
                                     embedding_dim=self.Hparams.in_dim)]

        self.embedding_layer = torch.nn.Sequential(*layers)

    def forward(self, x):
        return self.embedding_layer(x)

Encoder Layer를 통과한 Text의 결과값은 아래와 같습니다.

Hparams에서 in_dim=512를 지정한 것과 같이

가장 마지막 shape가 in_dim과 동일한 것을 볼 수 있습니다.

아래의 이미지에서 볼 수 있듯이

논문에서는 in_dim(embedding_dim) 값을 256으로 지정했습니다.

저는 처음에 언급했던 문제점을 해결하다 보니

in_dim(embedding_dim)을 512로 두고 진행했습니다.

다음 포스팅에서 학습을 진행할 때는 256으로 맞춰놓고 진행할 예정입니다.

그리고 지금은 학습 중이 아니라 실험 중이므로 batch_size를 4로 지정했습니다.

dataset = KSSDataset(hparams)

dataloader = torch.utils.data.DataLoader(dataset, 
                                         batch_size=4, 
                                         shuffle=True, 
                                         collate_fn=collate_fn)

embedding = Embedding(hparams)
for step, values in enumerate(dataloader):
    texts, _, _, _ = values
    embedding_result = embedding(texts)
    
    print('Embedding result: {}'.format(embedding_result.size()))
    if step == 4:
        break
        
        
####
Embedding result: torch.Size([4, 48, 512])
Embedding result: torch.Size([4, 48, 512])
Embedding result: torch.Size([4, 46, 512])
Embedding result: torch.Size([4, 55, 512])
Embedding result: torch.Size([4, 58, 512])

PreNet Layer

PreNet Layer부터가 약간 머리가 아프기 시작합니다.

사실 알고 보면 별거 없긴 한데,

이 부분부터 차원을 잘 맞춰야 마지막까지 수월합니다.

PreNet은 Decoder에서 AttentionRNN의 전 단계에서도 사용되기 때문에

랜덤생성된 Mel shape과 일치하도록 input shape을 조정해줄 필요가 있습니다.

저는 그 부분을 use_decoder=False로 하여금 조절할 수 있게 하려고 합니다.

그리고 중요한 부분이 하나 있는데요,

바로 Reduction Facotr에 대한 부분입니다.

저는 잘 이해가 가지 않는 부분인데, 소리의 연속성을 이용해서

각 Decoder step에서 만들어지는 Frame을 Reduction Facotr의 배수만큼

만들게되면 효과가 우수하다고 하네요.

논문에서는 Reduction Factor를 2로 사용했다고 하는데,

5도 잘 작동한다고 하니 저도 5를 사용하겠습니다.

Reduction Factor에 대해 열심히 구글링해서 찾아본 결과로는 대부분 5를 사용하는 듯 합니다. 하지만 Tacotron2에서는 사용되지 않는다고 하니 참고바랍니다.

Reduction Factor를 사용하게되면, PreNet과 Post CBHG Layer 사용 시

각 Layer에 대한 Input과 Output shape이 달라지게 되는데요,

이 점 유의해야 할 것 같습니다.

<아래 언급된 부분에 대한 포스팅은 수정되었습니다.>

왜냐하면, 앞서 제가 진행했던 Pre-processing 단계에서는

Reduction Factor에 대한 언급이 없었기 때문에

Encoder와 Decoder에만 Reduction Factor를 고려해서 구현하고 출력하게되면

딱 Reduction Factor의 배수만큼 Original과 Prediction이 차이가 나기 때문에

결과적으로 Loss를 계산할 수 없게 됩니다.

따라서 다음 포스팅에서 학습 단계를 다룰 때는

Pre-processing 단계에서 Reduction Factor를 고려한 다음 진행해야할 것 같습니다.

class PreNet(torch.nn.Module):
    def __init__(self, Hparams, use_decoder=False):
        super(PreNet, self).__init__()
        self.Hparams = Hparams
        
        self.prenet_input_dim = self.Hparams.in_dim
        if use_decoder==True:
            self.prenet_input_dim = self.Hparams.n_mels*self.Hparams.reduction
            
        in_sizes = [self.prenet_input_dim] + [self.Hparams.prenet_linear_size]
        out_sizes = [self.Hparams.prenet_linear_size] + [self.Hparams.prenet_linear_size//2]

        layers = []
        for (in_features, out_features) in zip(in_sizes, out_sizes):
            layers += [CustomLinear(in_features=in_features, 
                                    out_features=out_features,
                                    activation='relu',
                                    dropout=self.Hparams.prenet_dropout_ratio)]

        self.prenet = torch.nn.Sequential(*layers)

    def forward(self, x):
        return self.prenet(x)

dataset = KSSDataset(hparams)

dataloader = torch.utils.data.DataLoader(dataset, 
                                         batch_size=4, 
                                         shuffle=True, 
                                         collate_fn=collate_fn)

embedding = Embedding(hparams)
prenet = PreNet(hparams)
dec_prenet = PreNet(hparams, use_decoder=True)
for step, values in enumerate(dataloader):
    texts, _, _, _ = values
    embedding_result = embedding(texts)
    prenet_result = prenet(embedding_result)
    
    print('Embedding result: {} \nPreNet result: {} \n'.format(
        embedding_result.size(), prenet_result.size()))
    if step == 4:
        break
        


####
Embedding result: torch.Size([4, 67, 512]) 
PreNet result: torch.Size([4, 67, 128]) 

Embedding result: torch.Size([4, 38, 512]) 
PreNet result: torch.Size([4, 38, 128]) 

Embedding result: torch.Size([4, 72, 512]) 
PreNet result: torch.Size([4, 72, 128]) 

Embedding result: torch.Size([4, 55, 512]) 
PreNet result: torch.Size([4, 55, 128]) 

Embedding result: torch.Size([4, 49, 512]) 
PreNet result: torch.Size([4, 49, 128])

Hparams에서는 prenet_linear_size를 258로 지정했는데 결과는 128로 나왔습니다.

이유는 다음과 같습니다.

in_sizes = [self.Hparams.in_dim] + [self.Hparams.prenet_linear_size]
out_sizes = [self.Hparams.prenet_linear_size] + [self.Hparams.prenet_linear_size//2]

위의 코드블럭을 보면 out_sizes 리스트의

두 번째 값으로 들어오는 값은 prenet_linear_size//2 이기 때문에

output shape가 256//2=128로 지정되었기 때문입니다.

굳이 이렇게 안 해도 괜찮지만 특정 숫자를 넣게 된다면

다음 Layer인 CBHG Layer의 첫 input shape도 알맞게 수정해야 해서

조금 귀찮은 부분이 있었습니다.

따라서 저는 CBHG의 input shape를 PreNet과 자연스럽게 연결하기 위해

위와 같이 했다고 보시면 되겠습니다.

in_dim(embedding_dim) 값과 연결해볼 수도 있겠네요.

다음에 한번 시도해봐야겠습니다.

다음은 제가 사전에 정의한 Custom Linear Layer입니다.

아래의 이미지를 보면,

FC-256-ReLU -> Dropout(0.5) -> FC-128-ReLU -> Dropout(0.5)

처럼 되어있는 걸 볼 수 있습니다.

torch.nn.Linear() -> torch.nn.ReLU() -> torch.nn.Dropout() 순서이므로

Custom Linear Layer의 activation 인자 값에

원하는 activation function을 입력하시면 됩니다.

하지만 논문에서는 activation function을 ReLU function을 사용했으므로

저는 ReLU function을 입력하기 위해 'relu'를 지정했습니다.

그리고 Custom Linear Layer 코드 내부에서

dropout 인자에 0에서 1 사이의 숫자가 들어오게 되면

torch.nn.Dropout() Layer가 동작하게끔 설계했으므로

원하는 dropout ratio를 입력하시면 됩니다.

또한 논문에서는 dropout ratio 값을 0.5로 지정했으므로

저 또한 0.5로 지정했습니다.

Decoder에서 사용되는 PreNet은 Decoder를 다루는 포스팅에서

출력해보도록 하겠습니다.

CBHG

Encoder 부분에서 가장 코드가 긴 부분입니다.

논문을 보면서 코드를 작성하긴 했지만, 아직도 부족한 부분이 보이는 듯합니다.

Tacotron2에서는 CBHG 부분은 아예 없기도 하고,

Tacotron1과 Vocoder를 Melgan으로 사용하면 Post CBHG는 필요 없다고 합니다.

구현을 하긴 하지만 비중 있게 다루지는 않겠습니다.

Encoder에 사용되는 CBHG와

Decoder 다음에 사용될 Post CBHG는

코드 내부에서 지정되는 인자 값에 차이가 있습니다.

아래 두 개의 이미지를 보시면 알 수 있듯이

Conv1D bank에 사용되는 K

Conv1D projections에 사용되는 features와 target features

위 세 개의 값에 차이가 있습니다.

K = 16, features = 128, target features = 128 Hparams에서 prenet_linear_size 값을 통해 CBHG 내부 코드를 따라가다보면 Encoder CBHG에 해당하는 값들은 논문 Table 1 에서 지정한 값들과 동일합니다. (embedding_dim 제외)

K = 8, features = 256, target features = 80

저는 Post CBHG를 구현하기 위해

아래의 코드블럭과 같이 post=False를 추가했습니다.

def __init__(self, Hparams, use_post=False):

구현된 CBHG 코드를 보면 if self.use_post== True: 부분을 볼 수 있을 텐데,

주석으로 달아놓은 부분을 따라서 보다 보면

Post CBHG 사용 시 Mel shape와 강제로 맞추기 위한 부분임을 알 수 있습니다.

(다음 설명은 코드블럭 아래에 있습니다.)

class CBHG(torch.nn.Module):
    def __init__(self, Hparams, use_post=False):
        super(CBHG, self).__init__()
        self.Hparams = Hparams
        self.use_post = use_post
        
        
        self.conv1d_bank_channels = self.Hparams.prenet_linear_size//2 # Conv1D bank
        self.target_dim = self.conv1d_bank_channels # Conv1D projections
        self.gru_hidden_dim = self.Hparams.cbhg_gru_hidden_dim # Bidirectional GRU
        
        if self.use_post == True:
            self.conv1d_bank_channels = self.Hparams.n_mels # Conv1D bank
            self.Hparams.cbhg_K = self.Hparams.cbhg_K //2 # Conv1D bank
            self.Hparams.cbhg_conv_proj_size = self.Hparams.cbhg_conv_proj_size * 2 # Conv1D projections
            self.gru_hidden_dim = self.Hparams.n_mels # Bidirectional GRU

        # For Residual connection
        residual_layers = [torch.nn.Identity()]
            
        # Conv1D bank
        conv1d_bank_layers = []
        for k in range(1, self.Hparams.cbhg_K+1):
            conv1d_bank_layers += [BatchNormConv1D(in_channels=self.conv1d_bank_channels, 
                                                   out_channels=self.conv1d_bank_channels,
                                                   kernel_size=k,
                                                   padding=self.Hparams.cbhg_K//2,
                                                   activation='relu')]

        # Conv1D projections
        conv1d_proj_layers = []

        conv1d_proj_layers += [torch.nn.MaxPool1d(kernel_size=self.Hparams.cbhg_mp_k, 
                                                  stride=self.Hparams.cbhg_mp_s, 
                                                  padding=self.Hparams.cbhg_mp_p, 
                                                  dilation=self.Hparams.cbhg_mp_d)]

        in_sizes = [self.Hparams.cbhg_K * self.conv1d_bank_channels] + [self.Hparams.cbhg_conv_proj_size]
        out_sizes = [self.Hparams.cbhg_conv_proj_size] + [self.conv1d_bank_channels]
        
        for idx, (in_size, out_size) in enumerate(zip(in_sizes, out_sizes)):
            if idx != len(in_sizes)-1:
                activation = 'relu'
            elif idx == len(in_sizes)-1:
                activation = None
            
            conv1d_proj_layers += [BatchNormConv1D(in_channels=in_size, 
                                                   out_channels=out_size,
                                                   kernel_size=self.Hparams.cbhg_conv_proj_k,
                                                   padding=self.Hparams.cbhg_conv_proj_p,
                                                   activation=activation)]
        
        pre_highway_layers = [CustomLinear(in_features=out_sizes[-1],
                                           out_features=self.target_dim,
                                           activation=None,
                                           dropout=None)]
        
        # Highway Layers
        highway_layers = []
        for _ in range(4):
            highway_layers += [Highway(in_features=self.target_dim,
                                       out_features=self.target_dim,
                                       activations=['relu', 'sigmoid'],
                                       dropout=None)]

        # Bidirectional GRU
        gru_layers = [torch.nn.GRU(input_size=self.target_dim,
                                   hidden_size=self.gru_hidden_dim,
                                   batch_first=True,
                                   bidirectional=True)]

        self.residual = torch.nn.Sequential(*residual_layers)
        self.conv1d_bank = torch.nn.ModuleList(conv1d_bank_layers)
        self.conv1d_proj = torch.nn.Sequential(*conv1d_proj_layers)
        self.pre_highway = torch.nn.Sequential(*pre_highway_layers)
        self.highway = torch.nn.Sequential(*highway_layers)
        self.gru = torch.nn.Sequential(*gru_layers)

    def forward(self, x):
        short_cut = self.residual(x)
        x = x.permute(0,2,1) # convert to new shape (batch_size, in_dim, T_dim)
        T = x.size(-1)
        x = torch.cat([layer(x)[:,:,:T] for layer in self.conv1d_bank], dim=1) # (B, in_dim, T_in)
        x = self.conv1d_proj(x)
        x = x.permute(0,2,1) # return to origin shape (batch_size, T_dim, in_dim)
        x = x + short_cut
        x = self.pre_highway(x)
        x = self.highway(x)
        x, _ = self.gru(x)
        return x

아래의 코드블럭은 CBHG 구조에서 Conv1D bank + stacking에 해당하는 부분과

제가 별도로 지정한 Residual Connection 부분입니다.

Residual Connection은 forward 부분에서 input 이름을 잘 지정하여

forward 단에서 처리해도 상관없지만 저는 그냥 Layer를 정의해서 처리했습니다.

제가 사전에 정의한 BatchNormConv1D를 통해 지정된 cbhg_K 값으로 1부터 cbhg_k까지

각기 다른 kernel_size와 padding 값으로 이루어진 Conv1D Layer가 생성될 것입니다.

Conv1D를 사용할 때는 원하는 shape으로 통과하게 해야 할 텐데,

앞서 제가 만든 데이터의 shape은 Conv1D에 지정한 차원과 일치하지 않습니다.

따라서 permute() 함수를 통해 데이터의 shape을 변경합니다.

아무리 Conv1D의 output shape을 잘 맞춘다고 하더라도

우리가 예상한 것과 다를 수도 있으므로 T 값을 지정하여

Conv1D를 통과하는 데이터의 shape을 맞출 수 있도록 합니다.

그리고 Conv1D Layer를 통과하면서 달라진 shape를 T 값을 통해 걸러내고

모든 값들을 리스트에 담게 됩니다.

마지막으로 torch.cat() 함수를 통해 데이터를

원하는 shape(dim=1)을 기준으로 concatenate 합니다.

__init__(self, Hparams, use_post=False):

    # For Residual connection
    residual_layers = [torch.nn.Identity()]

    # Conv1D bank
    conv1d_bank_layers = []
    for k in range(1, self.Hparams.cbhg_K+1):
        conv1d_bank_layers += [BatchNormConv1D(in_channels=self.conv1d_bank_channels, 
                                               out_channels=self.conv1d_bank_channels,
                                               kernel_size=k,
                                               padding=self.Hparams.cbhg_K//2,
                                               activation='relu')]

    self.conv1d_bank = torch.nn.ModuleList(conv1d_bank_layers)

forward(self, x):
    short_cut = self.residual(x)
    x = x.permute(0,2,1)
    T = x.size(-1)
    x = torch.cat([layer(x)[:,:,:T] for layer in self.conv1d_bank], dim=1) # (B, in_dim, T_in)

그다음 부분은 Conv1D projections 부분입니다.

MaxPool1d 부분은 굳이 padding과 dilation을 해야 하는지 모르겠습니다.

포스팅한 후에 조금 실험을 해봐서 수정을 하던지 해야겠습니다.

in_sizes, out_sizes는 PreNet Layer와 동일한 방법으로 차원을 맞춥니다.

앞 단에서 torch.cat() 함수를 이용해

cbhg_K 개수만큼의 Layer가 dim=1을 기준으로 concatenate 됐기 때문에

마지막 shape는 상당히 큰 값으로 변해있을 겁니다.

그 값은 self.Hparams.cbhg_K * self.target_dim 개수 만큼 될 것이므로

in_sizes 리스트의 첫 값으로 지정해줍니다.

그다음, 아래의 이미지에서 Conv1D projection Layer 사이에 ReLU function은

가운데 부분에만 존재해야 하므로

if elif 문을 사용해서 두 번째 인덱스(두 번째 Layer가 될 때)는

ReLU function이 들어가지 않게 만들어줍니다.

마지막으로 Linear Layer 부분이 있는데,

이 부분은 pre_highway_layers를 이름으로

제가 사전에 정의한 CustomLinear로 하여금 다음 GRU Layer에 입력되는 차원에 맞도록

조절하게 합니다.

Residual Connection 부분은 바로 직전의 코드블럭을 참고해주시고

CBHG 구조에서 Residual Connection이 발생하는 지점 또한 확인하면 되겠습니다.

Max pooling, Conv1D projections, Linear Layer

__init__(self, Hparams, use_post=False):

    # Conv1D projections
    conv1d_proj_layers = []

    conv1d_proj_layers += [torch.nn.MaxPool1d(kernel_size=self.Hparams.cbhg_mp_k, 
                                              stride=self.Hparams.cbhg_mp_s, 
                                              padding=self.Hparams.cbhg_mp_p, 
                                              dilation=self.Hparams.cbhg_mp_d)]

    in_sizes = [self.Hparams.cbhg_K * self.conv1d_bank_channels] + [self.Hparams.cbhg_conv_proj_size]
    out_sizes = [self.Hparams.cbhg_conv_proj_size] + [self.conv1d_bank_channels]

    for idx, (in_size, out_size) in enumerate(zip(in_sizes, out_sizes)):
        if idx != len(in_sizes)-1:
            activation = 'relu'
        elif idx == len(in_sizes)-1:
            activation = None

        conv1d_proj_layers += [BatchNormConv1D(in_channels=in_size, 
                                               out_channels=out_size,
                                               kernel_size=self.Hparams.cbhg_conv_proj_k,
                                               padding=self.Hparams.cbhg_conv_proj_p,
                                               activation=activation)]
        
    pre_highway_layers = [CustomLinear(in_features=out_sizes[-1],
                                       out_features=self.target_dim,
                                       activation=None,
                                       dropout=None)]

    self.conv1d_proj = torch.nn.Sequential(*conv1d_proj_layers)
    self.pre_highway = torch.nn.Sequential(*pre_highway_layers)
    

forward(self, x):
    short_cut = self.residual(x) # 앞 단의 코드 블럭 참고
    x = self.conv1d_proj(x)
    x = x.permute(0,2,1) # return to origin shape (batch_size, T_dim, in_dim)
    x = x + short_cut # 앞 단의 코드 블럭 참고
    x = self.pre_highway(x)

마지막으로 Highway Layer와 GRU Layer입니다.

Highway Layer는 높은 퀄리티의 Features를 추출하기 위해 사용된다고 합니다.

Highway Network에 대해서는 아래의 링크를 참고했습니다.

링크 내부 포스트 마지막 부분에 관련 논문이 있으니

관심 있으신 분은 참고하시면 좋을 듯합니다.

https://lazyer.tistory.com/8

Highway_Network

highway_network Highway Networks _ Lazyer [arXiv:1505.00387v2] - Rupesh Kumar Srivastava - 03.11.15 Abstract 학습 모델의 깊이가 증가함에 따라 성능이 증가한다는 사실은 일반화되었다. 하지만 깊이가 증..

lazyer.tistory.com

Highway Layer는 다음과 같이 구현했습니다.

bias 데이터를 초기화하는 방법은

LayerName.bias.data.zero_(), LayerName.bias.data.fill_(-1)처럼 할 수 있는데,

저는 CustomLinear Layer를 별도로 지정했기 때문에

bias가 직접적으로 있는 부분은 torch.nn.Linear()를 정의한 self.linear입니다.

따라서 제가 구현한 방식에서 bias를 초기화할 때는 아래와 같이 진행하면 됩니다.

LayerName.linear.bias.data.zero_(), LayerName.linear.bias.data.fill_(-1)

class Highway(torch.nn.Module):
    def __init__(self, in_features, out_features, activations=['relu', 'sigmoid'], dropout=None):
        super(Highway, self).__init__()
        self.in_features = in_features
        self.out_features = out_features
        self.relu = activations[0]
        self.sigmoid = activations[1]
        self.dropout = dropout
        self.H = CustomLinear(in_features=self.in_features, 
                              out_features=self.out_features, 
                              activation=self.relu, 
                              dropout=self.dropout)
        self.H.linear.bias.data.zero_()
        self.T = CustomLinear(in_features=self.in_features, 
                              out_features=self.out_features, 
                              activation=self.sigmoid, 
                              dropout=self.dropout)
        self.T.linear.bias.data.fill_(-1)

    def forward(self, x):
        return self.H(x) * self.T(x) + x * (1.0 - self.T(x))

highway = Highway(10, 10, activations=['relu', 'sigmoid'], dropout=None)

print(highway)


####
Highway(
  (H): CustomLinear(
    (linear): Linear(in_features=10, out_features=10, bias=True)
    (activation): ReLU()
  )
  (T): CustomLinear(
    (linear): Linear(in_features=10, out_features=10, bias=True)
    (activation): Sigmoid()
  )
)

아래의 이미지에서 볼 수 있듯이 Highway Layer를 4개 추가합니다.

GRU Layer는 Birectional GRU Layer입니다.

결론적으로 target_dim으로 원하는 값으로 설정한다 한들,

결과값은 target_dim보다 두 배 큰 값이 결과값으로 나온다는 말입니다.

따라서 다음에 구현할 Decoder나

Post CBHG를 지난 Mel이

Linear-scale Spectrogram으로 변환될 때 사용할 Linear Layer에서

각각의 input shape을 잘 살펴보아야 할 것입니다.

__init__(self, Hparams, post=False):
    # Highway Layers
    highway_layers = []
    for _ in range(4):
        highway_layers += [Highway(in_features=self.target_dim,
                                   out_features=self.target_dim,
                                   activations=['relu', 'sigmoid'],
                                   dropout=None)]

    # Bidirectional GRU
    gru_layers = [torch.nn.GRU(input_size=self.target_dim,
                               hidden_size=self.gru_hidden_dim,
                               batch_first=True,
                               bidirectional=True)]

    self.highway = torch.nn.Sequential(*highway_layers)
    self.gru = torch.nn.Sequential(*gru_layers)
    
forward(self, x):
    x = self.highway(x)
    x, _ = self.gru(x)

구현된 CBHG까지 output shape을 확인해봅니다.

dataset = KSSDataset(hparams)

dataloader = torch.utils.data.DataLoader(dataset, 
                                         batch_size=4, 
                                         shuffle=True, 
                                         collate_fn=collate_fn)

embedding = Embedding(hparams)
prenet = PreNet(hparams, use_decoder=False)
cbhg = CBHG(hparams, use_post=False)
for step, values in enumerate(dataloader):
    texts, mels, _, _ = values
    embedding_result = embedding(texts)
    prenet_result = prenet(embedding_result)
    cbhg_result = cbhg(prenet_result)
    
    print('Embedding result: {} \nPreNet result: {} \nCBHG result: {} \n'.format(
        embedding_result.size(), prenet_result.size(), cbhg_result.size()))
    if step == 4:
        break
        
        
        
        
####
Embedding result: torch.Size([4, 42, 512]) 
PreNet result: torch.Size([4, 42, 128]) 
CBHG result: torch.Size([4, 42, 256]) 

Embedding result: torch.Size([4, 50, 512]) 
PreNet result: torch.Size([4, 50, 128]) 
CBHG result: torch.Size([4, 50, 256]) 

Embedding result: torch.Size([4, 52, 512]) 
PreNet result: torch.Size([4, 52, 128]) 
CBHG result: torch.Size([4, 52, 256]) 

Embedding result: torch.Size([4, 48, 512]) 
PreNet result: torch.Size([4, 48, 128]) 
CBHG result: torch.Size([4, 48, 256]) 

Embedding result: torch.Size([4, 51, 512]) 
PreNet result: torch.Size([4, 51, 128]) 
CBHG result: torch.Size([4, 51, 256])

예상한 대로 CBHG의 output shape은

Bidirectional GRU의 영향 때문에 output shape은 2배가 됐습니다.

마지막으로 Encoder를 각 Layer 별로 묶은 뒤 데이터 별 output shape과

Encoder의 모델 구조를 확인해보고 포스팅을 마치겠습니다.

Encoder

class Encoder(torch.nn.Module):
    def __init__(self, Hparams, use_decoder=False, use_post=False):
        super(Encoder, self).__init__()
        self.Hparams = Hparams
        self.use_decoder = use_decoder
        self.use_post = use_post
        
        layers = [Embedding(self.Hparams),
                  PreNet(self.Hparams, use_decoder=self.use_decoder),
                  CBHG(self.Hparams, use_post=self.use_post)]

        self.encoder = torch.nn.Sequential(*layers)

    def forward(self, x):
        return self.encoder(x)

dataset = KSSDataset(hparams)

dataloader = torch.utils.data.DataLoader(dataset, 
                                         batch_size=4, 
                                         shuffle=True, 
                                         collate_fn=collate_fn)

embedding = Embedding(hparams)
prenet = PreNet(hparams, use_decoder=False)
cbhg = CBHG(hparams, use_post=False)
encoder = Encoder(hparams, use_decoder=False, use_post=False)
for step, values in enumerate(dataloader):
    texts, mels, _, _ = values
    
    embedding_result = embedding(texts)
    prenet_result = prenet(embedding_result)
    cbhg_result = cbhg(prenet_result)
    encoded = encoder(texts)
    
    print('Embedding result: {} \nPreNet result: {} \nCBHG result: {} \nEncoder result: {} \n'.format(
        embedding_result.size(), prenet_result.size(), cbhg_result.size(), encoded.size()))
    if step == 4:
        break
        
        
####
Embedding result: torch.Size([4, 50, 512]) 
PreNet result: torch.Size([4, 50, 128]) 
CBHG result: torch.Size([4, 50, 256]) 
Encoder result: torch.Size([4, 50, 256]) 

Embedding result: torch.Size([4, 50, 512]) 
PreNet result: torch.Size([4, 50, 128]) 
CBHG result: torch.Size([4, 50, 256]) 
Encoder result: torch.Size([4, 50, 256]) 

Embedding result: torch.Size([4, 47, 512]) 
PreNet result: torch.Size([4, 47, 128]) 
CBHG result: torch.Size([4, 47, 256]) 
Encoder result: torch.Size([4, 47, 256]) 

Embedding result: torch.Size([4, 72, 512]) 
PreNet result: torch.Size([4, 72, 128]) 
CBHG result: torch.Size([4, 72, 256]) 
Encoder result: torch.Size([4, 72, 256]) 

Embedding result: torch.Size([4, 42, 512]) 
PreNet result: torch.Size([4, 42, 128]) 
CBHG result: torch.Size([4, 42, 256]) 
Encoder result: torch.Size([4, 42, 256])

encoder = Encoder(hparams, use_decoder=False, use_post=False)

print(encoder)


####
Encoder(
  (encoder): Sequential(
    (0): Embedding(
      (embedding_layer): Sequential(
        (0): Embedding(74, 512)
      )
    )
    (1): PreNet(
      (prenet): Sequential(
        (0): CustomLinear(
          (linear): Linear(in_features=512, out_features=256, bias=True)
          (activation): ReLU()
          (dropout): Dropout(p=0.5, inplace=False)
        )
        (1): CustomLinear(
          (linear): Linear(in_features=256, out_features=128, bias=True)
          (activation): ReLU()
          (dropout): Dropout(p=0.5, inplace=False)
        )
      )
    )
    (2): CBHG(
      (residual): Sequential(
        (0): Identity()
      )
      (conv1d_bank): ModuleList(
        (0): BatchNormConv1D(
          (conv1d): Conv1d(128, 128, kernel_size=(1,), stride=(1,), padding=(8,))
          (batchnorm1d): BatchNorm1d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          (activation): ReLU()
        )
        (1): BatchNormConv1D(
          (conv1d): Conv1d(128, 128, kernel_size=(2,), stride=(1,), padding=(8,))
          (batchnorm1d): BatchNorm1d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          (activation): ReLU()
        )
        (2): BatchNormConv1D(
          (conv1d): Conv1d(128, 128, kernel_size=(3,), stride=(1,), padding=(8,))
          (batchnorm1d): BatchNorm1d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          (activation): ReLU()
        )
        (3): BatchNormConv1D(
          (conv1d): Conv1d(128, 128, kernel_size=(4,), stride=(1,), padding=(8,))
          (batchnorm1d): BatchNorm1d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          (activation): ReLU()
        )
        (4): BatchNormConv1D(
          (conv1d): Conv1d(128, 128, kernel_size=(5,), stride=(1,), padding=(8,))
          (batchnorm1d): BatchNorm1d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          (activation): ReLU()
        )
        (5): BatchNormConv1D(
          (conv1d): Conv1d(128, 128, kernel_size=(6,), stride=(1,), padding=(8,))
          (batchnorm1d): BatchNorm1d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          (activation): ReLU()
        )
        (6): BatchNormConv1D(
          (conv1d): Conv1d(128, 128, kernel_size=(7,), stride=(1,), padding=(8,))
          (batchnorm1d): BatchNorm1d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          (activation): ReLU()
        )
        (7): BatchNormConv1D(
          (conv1d): Conv1d(128, 128, kernel_size=(8,), stride=(1,), padding=(8,))
          (batchnorm1d): BatchNorm1d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          (activation): ReLU()
        )
        (8): BatchNormConv1D(
          (conv1d): Conv1d(128, 128, kernel_size=(9,), stride=(1,), padding=(8,))
          (batchnorm1d): BatchNorm1d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          (activation): ReLU()
        )
        (9): BatchNormConv1D(
          (conv1d): Conv1d(128, 128, kernel_size=(10,), stride=(1,), padding=(8,))
          (batchnorm1d): BatchNorm1d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          (activation): ReLU()
        )
        (10): BatchNormConv1D(
          (conv1d): Conv1d(128, 128, kernel_size=(11,), stride=(1,), padding=(8,))
          (batchnorm1d): BatchNorm1d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          (activation): ReLU()
        )
        (11): BatchNormConv1D(
          (conv1d): Conv1d(128, 128, kernel_size=(12,), stride=(1,), padding=(8,))
          (batchnorm1d): BatchNorm1d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          (activation): ReLU()
        )
        (12): BatchNormConv1D(
          (conv1d): Conv1d(128, 128, kernel_size=(13,), stride=(1,), padding=(8,))
          (batchnorm1d): BatchNorm1d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          (activation): ReLU()
        )
        (13): BatchNormConv1D(
          (conv1d): Conv1d(128, 128, kernel_size=(14,), stride=(1,), padding=(8,))
          (batchnorm1d): BatchNorm1d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          (activation): ReLU()
        )
        (14): BatchNormConv1D(
          (conv1d): Conv1d(128, 128, kernel_size=(15,), stride=(1,), padding=(8,))
          (batchnorm1d): BatchNorm1d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          (activation): ReLU()
        )
        (15): BatchNormConv1D(
          (conv1d): Conv1d(128, 128, kernel_size=(16,), stride=(1,), padding=(8,))
          (batchnorm1d): BatchNorm1d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          (activation): ReLU()
        )
      )
      (conv1d_proj): Sequential(
        (0): MaxPool1d(kernel_size=2, stride=1, padding=1, dilation=2, ceil_mode=False)
        (1): BatchNormConv1D(
          (conv1d): Conv1d(2048, 128, kernel_size=(3,), stride=(1,), padding=(1,))
          (batchnorm1d): BatchNorm1d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          (activation): ReLU()
        )
        (2): BatchNormConv1D(
          (conv1d): Conv1d(128, 128, kernel_size=(3,), stride=(1,), padding=(1,))
          (batchnorm1d): BatchNorm1d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
        )
      )
      (pre_highway): Sequential(
        (0): CustomLinear(
          (linear): Linear(in_features=128, out_features=128, bias=True)
        )
      )
      (highway): Sequential(
        (0): Highway(
          (H): CustomLinear(
            (linear): Linear(in_features=128, out_features=128, bias=True)
            (activation): ReLU()
          )
          (T): CustomLinear(
            (linear): Linear(in_features=128, out_features=128, bias=True)
            (activation): Sigmoid()
          )
        )
        (1): Highway(
          (H): CustomLinear(
            (linear): Linear(in_features=128, out_features=128, bias=True)
            (activation): ReLU()
          )
          (T): CustomLinear(
            (linear): Linear(in_features=128, out_features=128, bias=True)
            (activation): Sigmoid()
          )
        )
        (2): Highway(
          (H): CustomLinear(
            (linear): Linear(in_features=128, out_features=128, bias=True)
            (activation): ReLU()
          )
          (T): CustomLinear(
            (linear): Linear(in_features=128, out_features=128, bias=True)
            (activation): Sigmoid()
          )
        )
        (3): Highway(
          (H): CustomLinear(
            (linear): Linear(in_features=128, out_features=128, bias=True)
            (activation): ReLU()
          )
          (T): CustomLinear(
            (linear): Linear(in_features=128, out_features=128, bias=True)
            (activation): Sigmoid()
          )
        )
      )
      (gru): Sequential(
        (0): GRU(128, 128, batch_first=True, bidirectional=True)
      )
    )
  )
)

상당히 긴 포스팅이었습니다.

제가 느끼기에 Encoder는 Decoder보다 훨씬 쉽다고 여겼는데,

앞으로 구현해야 하는 Decoder를 포스팅할 때 느낄 고통이 간접적으로 느껴지네요.

다음 포스팅에는 Attention과 더불어 Decoder와 Post CBHG

그리고 Linear-scale Spectrogram까지 출력하는 코드를 구현해오겠습니다.

긴 글 읽어주셔서 감사합니다.

제가 작성한 포스트에 오류가 있다면 언제든지 댓글 부탁드립니다 :)