Applying Dropout

Robert_Pope · July 8, 2021, 4:40pm

When you add a dropout layer to a model (like below), does the dropout only apply to the preceding layer or does it apply to all the hidden layers?

model = tf.keras.models.Sequential([
  tf.keras.layers.Flatten(input_shape=(28, 28)),
  tf.keras.layers.Dense(128, activation='relu'),
  tf.keras.layers.Dense(64, activation='relu'),
  tf.keras.layers.Dropout(0.2),
  tf.keras.layers.Dense(10, activation='softmax')
])

Bhack · July 8, 2021, 7:30pm

It really depends by the arch.

E.g. This was the position some years ago for the convolutional layers:
http://mipal.snu.ac.kr/images/1/16/Dropout_ACCV2016.pdf

More in general one interesting recent work co-authored by Google is autodropout but I don’t know why the code isn’t available:

https://arxiv.org/abs/2101.01761

Robert_Pope · July 8, 2021, 7:45pm

Thanks. I guess my question is more specific to tf.keras.layers.Dropout().

If I want to use dropout regularization throughout my model, do I need to add a second Dropout layer after tf.keras.layers.Dense(128, activation=‘relu’)?

Bhack · July 8, 2021, 9:53pm

Generally it is ok but this is also a quite small model.

pgaleone · July 9, 2021, 5:36am

When using the tf.keras.layers.Dropout layer, the Dropout operation is applied only to the preceding layer.

Sergii_Kavun · July 9, 2021, 6:25am

yes, if you want to apply the Dropout layer, it needs to do to each layer separately after it. In addition, the Dropout layer also can be used for the Input layer. Moreover, “drop_level” is a hyperparameter in Dropout(drop_level).

8bitmp3 · July 9, 2021, 2:00pm

Some examples:

The original Transformers paper - Attention is all you need - mentioned the following:

Residual Dropout We apply dropout [27] to the output of each sub-layer, before it is added to the
sub-layer input and normalized. In addition, we apply dropout to the sums of the embeddings and the
positional encodings in both the encoder and decoder stacks.

Here’s an example with code that uses a Transformer-based architecture: Timeseries classification with a Transformer model

Notice how dropout is applied throughout the architecture:

Model: "model"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
==================================================================================================
input_1 (InputLayer)            [(None, 500, 1)]     0                                            
__________________________________________________________________________________________________
layer_normalization (LayerNorma (None, 500, 1)       2           input_1[0][0]                    
__________________________________________________________________________________________________
multi_head_attention (MultiHead (None, 500, 1)       7169        layer_normalization[0][0]        
                                                                 layer_normalization[0][0]        
__________________________________________________________________________________________________
dropout (Dropout)               (None, 500, 1)       0           multi_head_attention[0][0]       
__________________________________________________________________________________________________
tf.__operators__.add (TFOpLambd (None, 500, 1)       0           dropout[0][0]                    
                                                                 input_1[0][0]                    
__________________________________________________________________________________________________
layer_normalization_1 (LayerNor (None, 500, 1)       2           tf.__operators__.add[0][0]       
__________________________________________________________________________________________________
conv1d (Conv1D)                 (None, 500, 4)       8           layer_normalization_1[0][0]      
__________________________________________________________________________________________________
dropout_1 (Dropout)             (None, 500, 4)       0           conv1d[0][0]                     
__________________________________________________________________________________________________
conv1d_1 (Conv1D)               (None, 500, 1)       5           dropout_1[0][0]                  
__________________________________________________________________________________________________
tf.__operators__.add_1 (TFOpLam (None, 500, 1)       0           conv1d_1[0][0]                   
                                                                 tf.__operators__.add[0][0]       
__________________________________________________________________________________________________
layer_normalization_2 (LayerNor (None, 500, 1)       2           tf.__operators__.add_1[0][0]     
__________________________________________________________________________________________________
multi_head_attention_1 (MultiHe (None, 500, 1)       7169        layer_normalization_2[0][0]      
                                                                 layer_normalization_2[0][0]      
__________________________________________________________________________________________________
dropout_2 (Dropout)             (None, 500, 1)       0           multi_head_attention_1[0][0]     
__________________________________________________________________________________________________
tf.__operators__.add_2 (TFOpLam (None, 500, 1)       0           dropout_2[0][0]                  
                                                                 tf.__operators__.add_1[0][0]     
__________________________________________________________________________________________________
layer_normalization_3 (LayerNor (None, 500, 1)       2           tf.__operators__.add_2[0][0]     
__________________________________________________________________________________________________
conv1d_2 (Conv1D)               (None, 500, 4)       8           layer_normalization_3[0][0]      
__________________________________________________________________________________________________
dropout_3 (Dropout)             (None, 500, 4)       0           conv1d_2[0][0]                   
__________________________________________________________________________________________________
conv1d_3 (Conv1D)               (None, 500, 1)       5           dropout_3[0][0]                  
__________________________________________________________________________________________________
tf.__operators__.add_3 (TFOpLam (None, 500, 1)       0           conv1d_3[0][0]                   
                                                                 tf.__operators__.add_2[0][0]     
__________________________________________________________________________________________________
layer_normalization_4 (LayerNor (None, 500, 1)       2           tf.__operators__.add_3[0][0]     
__________________________________________________________________________________________________
multi_head_attention_2 (MultiHe (None, 500, 1)       7169        layer_normalization_4[0][0]      
                                                                 layer_normalization_4[0][0]      
__________________________________________________________________________________________________
dropout_4 (Dropout)             (None, 500, 1)       0           multi_head_attention_2[0][0]     
__________________________________________________________________________________________________
tf.__operators__.add_4 (TFOpLam (None, 500, 1)       0           dropout_4[0][0]                  
                                                                 tf.__operators__.add_3[0][0]     
__________________________________________________________________________________________________
layer_normalization_5 (LayerNor (None, 500, 1)       2           tf.__operators__.add_4[0][0]     
__________________________________________________________________________________________________
conv1d_4 (Conv1D)               (None, 500, 4)       8           layer_normalization_5[0][0]      
__________________________________________________________________________________________________
dropout_5 (Dropout)             (None, 500, 4)       0           conv1d_4[0][0]                   
__________________________________________________________________________________________________
conv1d_5 (Conv1D)               (None, 500, 1)       5           dropout_5[0][0]                  
__________________________________________________________________________________________________
tf.__operators__.add_5 (TFOpLam (None, 500, 1)       0           conv1d_5[0][0]                   
                                                                 tf.__operators__.add_4[0][0]     
__________________________________________________________________________________________________
layer_normalization_6 (LayerNor (None, 500, 1)       2           tf.__operators__.add_5[0][0]     
__________________________________________________________________________________________________
multi_head_attention_3 (MultiHe (None, 500, 1)       7169        layer_normalization_6[0][0]      
                                                                 layer_normalization_6[0][0]      
__________________________________________________________________________________________________
dropout_6 (Dropout)             (None, 500, 1)       0           multi_head_attention_3[0][0]     
__________________________________________________________________________________________________
tf.__operators__.add_6 (TFOpLam (None, 500, 1)       0           dropout_6[0][0]                  
                                                                 tf.__operators__.add_5[0][0]     
__________________________________________________________________________________________________
layer_normalization_7 (LayerNor (None, 500, 1)       2           tf.__operators__.add_6[0][0]     
__________________________________________________________________________________________________
conv1d_6 (Conv1D)               (None, 500, 4)       8           layer_normalization_7[0][0]      
__________________________________________________________________________________________________
dropout_7 (Dropout)             (None, 500, 4)       0           conv1d_6[0][0]                   
__________________________________________________________________________________________________
conv1d_7 (Conv1D)               (None, 500, 1)       5           dropout_7[0][0]                  
__________________________________________________________________________________________________
tf.__operators__.add_7 (TFOpLam (None, 500, 1)       0           conv1d_7[0][0]                   
                                                                 tf.__operators__.add_6[0][0]     
__________________________________________________________________________________________________
global_average_pooling1d (Globa (None, 500)          0           tf.__operators__.add_7[0][0]     
__________________________________________________________________________________________________
dense (Dense)                   (None, 128)          64128       global_average_pooling1d[0][0]   
__________________________________________________________________________________________________
dropout_8 (Dropout)             (None, 128)          0           dense[0][0]                      
__________________________________________________________________________________________________
dense_1 (Dense)                 (None, 2)            258         dropout_8[0][0]                  
==================================================================================================

A discussion on StackOverflow: machine learning - Where Dropout should be inserted.? Fully Connected Layer.? Convolutional Layer.? or Both.? - Stack Overflow

pgaleone · July 9, 2021, 3:55pm

hey, there’s my answer there machine learning - Where Dropout should be inserted.? Fully Connected Layer.? Convolutional Layer.? or Both.? - Stack Overflow

Lance_N · August 3, 2021, 5:37am

Dropout is generally used after a Dense or Convolutional layer. It affects the hidden neurons passed to the following layer.

A Convolutional layer outputs a set of feature maps. In 2D image-processing, these feature maps are gray-scale images that correspond to common shapes in the image set: cat eyes v.s. cat noses, for example. Applying Dropout after Convolutional layers does not do what you would expect, because the values in feature maps are strongly correlated: it is like putting a slice of Swiss cheese over a picture- you can still see the picture through the holes!

Convolutional layers, Dropout and BatchNormalization interact in complex ways. The best discussion that I have found on this topic is right here:
https://stackoverflow.com/questions/59634780/correct-order-for-spatialdropout2d-batchnormalization-and-activation-function