Why input is scaled in tf.nn.dropout in tensorflow?

tensorflow keras dropout example
tensorflow inverted dropout
dropout scaling
mc dropout tensorflow
tf.nn.dropout v1
tf.layers.dropout example
pytorch dropout scaling
dropout implementation

I can't understand why dropout works like this in tensorflow. The blog of CS231n says that, "dropout is implemented by only keeping a neuron active with some probability p (a hyperparameter), or setting it to zero otherwise." Also you can see this from picture(Taken from the same site)

From tensorflow site, With probability keep_prob, outputs the input element scaled up by 1 / keep_prob, otherwise outputs 0.

Now, why the input element is scaled up by 1/keep_prob? Why not keep the input element as it is with probability and not scale it with 1/keep_prob?

This scaling enables the same network to be used for training (with keep_prob < 1.0) and evaluation (with keep_prob == 1.0). From the Dropout paper:

The idea is to use a single neural net at test time without dropout. The weights of this network are scaled-down versions of the trained weights. If a unit is retained with probability p during training, the outgoing weights of that unit are multiplied by p at test time as shown in Figure 2.

Rather than adding ops to scale down the weights by keep_prob at test time, the TensorFlow implementation adds an op to scale up the weights by 1. / keep_prob at training time. The effect on performance is negligible, and the code is simpler (because we use the same graph and treat keep_prob as a tf.placeholder() that is fed a different value depending on whether we are training or evaluating the network).

Why input is scaled in tf.nn.dropout in tensorflow?, The remaining elements are scaled up by 1.0 / (1 - rate) , so that the expected value is preserved. tf.nn.dropout(x, rate = 2/3, noise_shape=[1,10], seed=1).​numpy() array([[0. For example, setting rate=0.1 would drop 10% of input elements. Scaling enables the same model to be used for training and evaluation. If you use a single neural net at test time without dropout, then the weight of that network will be scaled-down versions of the trained weights. If a unit is retained with probability p during training, the outgoing weights of that unit are multiplied by p at test time.

Let's say the network had n neurons and we applied dropout rate 1/2

Training phase, we would be left with n/2 neurons. So if you were expecting output x with all the neurons, now you will get on x/2. So for every batch, the network weights are trained according to this x/2

Testing/Inference/Validation phase, we dont apply any dropout so the output is x. So, in this case, the output would be with x and not x/2, which would give you the incorrect result. So what you can do is scale it to x/2 during testing.

Rather than the above scaling specific to Testing phase. What Tensorflow's dropout layer does is that whether it is with dropout or without (Training or testing), it scales the output so that the sum is constant.

tf.nn.dropout, consists in randomly setting a fraction rate of input units to 0 at each update during training time, which helps prevent overfitting. The units that are kept are scaled by 1 / (1 - rate) , so that their sum is unchanged at training time and inference time. With probability keep_prob, outputs the input element scaled up by 1 / keep_prob, otherwise outputs 0. The scaling is so that the expected sum is unchanged. The scaling is so that the expected sum is unchanged.

If you keep reading in cs231n, the difference between dropout and inverted dropout is explained.

Since we want to leave the forward pass at test time untouched (and tweak our network just during training), tf.nn.dropout directly implements inverted dropout, scaling the values.

tf.compat.v1.layers.Dropout, has the effect of making the training process noisy, forcing nodes within a layer to probabilistically take on more or less responsibility for the inputs. Note: The behavior of dropout has changed between TensorFlow 1.x and 2.x. When converting 1.x code, please use named arguments to ensure behavior stays consistent. See also: tf.ke

Here is a quick experiment to disperse any remaining confusion.

Statistically the weights of a NN-layer follow a distribution that is usually close to normal (but not necessarily), but even in the case when trying to sample a perfect normal distribution in practice, there are always computational errors.

Then consider the following experiment:

DIM = 1_000_000                      # set our dims for weights and input
x = np.ones((DIM,1))                 # our input vector
#x = np.random.rand(DIM,1)*2-1.0     # or could also be a more realistic normalized input

probs = [1.0, 0.7, 0.5, 0.3]         # define dropout probs

W = np.random.normal(size=(DIM,1))   # sample normally distributed weights
print("W-mean = ", W.mean())         # note the mean is not perfect --> sampling error!

# DO THE DRILL
h = defaultdict(list)
for i in range(1000):
  for p in probs:
    M = np.random.rand(DIM,1)
    M = (M < p).astype(int)
    Wp = W * M
    a = np.dot(Wp.T, x)
    h[str(p)].append(a)

for k,v in h.items():
  print("For drop-out prob %r the average linear activation is %r (unscaled) and %r (scaled)" % (k, np.mean(v), np.mean(v)/float(k)))

Sample output:

x-mean =  1.0
W-mean =  -0.001003985674840264
For drop-out prob '1.0' the average linear activation is -1003.985674840258 (unscaled) and -1003.985674840258 (scaled)
For drop-out prob '0.7' the average linear activation is -700.6128015029908 (unscaled) and -1000.8754307185584 (scaled)
For drop-out prob '0.5' the average linear activation is -512.1602655283492 (unscaled) and -1024.3205310566984 (scaled)
For drop-out prob '0.3' the average linear activation is -303.21194422742315 (unscaled) and -1010.7064807580772 (scaled)

Notice that the unscaled activations diminish due to the statistically imperfect normal distribution.

Can you spot an obvious correlation between the W-mean and the average linear activation means?

A Gentle Introduction to Dropout for Regularizing Deep Neural , requires scaling to be implemented during the test phase. cell an RNNCell, a projection to output_size is added to it. input_keep_prob unit Tensor or float between 0 and 1, input keep probability; if it is constant and 1, no input dropout will be added. output_keep_prob unit Tensor or float between 0 and 1, output keep probability; if it is constant and 1

Inverted dropout, For each element of x , with probability rate , outputs 0 , and otherwise scales up the input by 1 / (1-rate) . The scaling is such that the expected sum is unchanged. For an input it is [batch, in_height, Why input is scaled in tf.nn.dropout in tensorflow? asked Jul 11, 2019 in Machine Learning by ParasSharma1 (13.5k points)

tf.compat.v1.nn.dropout, Inputs not set to 0 are scaled up by 1/(1 - rate) such that the sum over all inputs is unchanged. Note that the Dropout layer only applies when training is set to  Yes, you can use tf.nn.dropout to do DropConnect, just use tf.nn.dropout to wrap your weight matrix instead of your post matrix multiplication. You can then undo the weight change by multiplying by the dropout like so. dropConnect = tf.nn.dropout( m1, keep_prob ) * keep_prob Code Example

tf.keras.layers.Dropout, I can't understand why dropout works like this in tensorflow. The blog of CS231n says that, With the update of Tensorflow, the class tf.layer.dropout should be used instead of tf.nn.dropout. This supports an is_training parameter. Using this allows your models to define keep_prob once, and not rely on your feed_dict to manage the external parameter. This allows for better refactored code.

Comments
  • That's the so called weight scaling inference rule.
  • I'm sorry, i'm new to this concept. Maybe I'm missing something obvious. Can you give a simpler explanation? I mean why 1/keep_prob? What will be the difference if i use keep_prob vs 1/keep_prob. BTW, i understand from your explanation why the code gets simpler.
  • The aim is to keep the expected sum of the weights the same&mdash;and hence the expected value of the activations the same&mdash;regardless of keep_prob. If (when doing dropout) we disable a neuron with probability keep_prob, we need to multiply the other weights by 1. / keep_prob to keep this value the same (in expectation). Otherwise, for example, the non-linearity would produce a completely different result depending on the value of keep_prob.