Descrepancy between Eager and Graph mode

TheLem · July 25, 2022, 2:27am

Hi,
I’m implementing yet another Tacotron 2 (Text-To-Speech) using TF2.8.0.
It consists of an encoder/decoder architecture with an attention module providing an alignment between the text and the frames of a (mel-)spectrogram we want to produce. Learning a good alignment between mel-spec and text is the key of this model.

I face a discrepancy between eager and graph mode when training it.

Basically eager mode succeeds at learning the alignment and sounds great while graph mode just fails at providing such alignment. However it does compile and run without any error. Actually in graph mode it seems able to learn something, but it’s far away from something intelligible, since it fails at learning a proper alignment. The learning curves look in both case like any typical converging model.

The main problem is the Eager mode is roughly ~8-10 times slower.

Are there things to care about when switching from eager to graph mode that I miss ?
LSTMCell ? for loop ?

I can’t share the whole code since it wouldn’t fit in a page but, and I have no idea which part is the most relevant since the code compile without any warning or error.

    def call(self, batch, training=False):                                              
            
        phon, mels, mels_len = batch                                                    
        mels = tf.transpose(mels, perm=[0,2,1])                                         
            
        
        x = self.tokenizer(phon) #TextVectorization module                                       
        y = self.char_embedding(x) # Embedding module
        mask = self.char_embedding.compute_mask(x)          
                            
        y = self.encoder(y, mask) #Convolution layers with batch norm + biLSTM
                
        mels, gates, alignments = self.decoder(y, mels, mask) # Teacher Forcing : LSTMCell and Attention                        
        
        residual = self.decoder.postnet(tf.squeeze(mels,-1)) # Convolution Layers with batch norm                    
        mels_post = mels + tf.expand_dims(residual, -1)                                 
                
        return (mels, mels_post, mels_len), gates, alignments

The teacher forcing scheme is done through a for loop iterating over the ground truth mel-spectrogram.

        mels_size = tf.shape(mel_gt)[0]
        mels_out, gates_out, alignments = tf.TensorArray(tf.float32, size=mels_size), tf.TensorArray(tf.float32, size=mels_size), tf.TensorArray(tf.float32, size=mels_size)



        self.prepare_decoder(enc_out)

        for i in tf.range(mels_size):

            mel_in = mel_gt[i]
            mel_out, gate_out, alignment = self.decode(mel_in, enc_out, self.W_enc_out, enc_out_mask) #LSTMCell s and Attention module

            mels_out = mels_out.write(i, mel_out)
            gates_out = gates_out.write(i, gate_out)
            alignments = alignments.write(i, alignment)

        mels_out = mels_out.stack()
        gates_out = gates_out.stack()
        alignments = alignments.stack()

In graph mode it learns since the first epoch a nice looking alignment.
good

In eager mode I can wait for days without any alignment (roughly diagonal tho, so it still learn a lil bit something)
bad