Receptive Fields on ConvNets (Receptive Field size confusion)

resnet receptive field
fully connected layer cnn
pooling layer
receptive field pooling
dilated convolution receptive field
max pooling
feature map size calculation
convolutional neural network

I was reading this from a paper: "Rather than using relatively large receptive fields in the first conv. layers we use very small 3 × 3 receptive fields throughout the whole net, which are convolved with the input at every pixel (with stride 1). It is easy to see that a stack of two 3 × 3 conv. layers (without spatial pooling in between) has an effective receptive field of 5 × 5; three such layers have a 7 × 7 effective receptive field."

How do they end up with a recpetive field of 7x7 ?

This is how i understand it: Suppose that we have one image that is 100x100.

1st layer: zero-pad the image and convole it with the 3x3 filter, output another 100x100 filtered image.

2nd layer: zero-pad the previous filtered image and convolve it with another 3x3 filter, output another 100x100 filtered image.

3nd layer: zero-pad the previous filtered image and convolve it with another 3x3 filter, output the final 100x100 filtered image.

What am I missing there ?

Here's one way to think of it. Consider the following small image, with each pixel numbered as such:

00 01 02 03 04 05 06
10 11 12 13 14 15 16
20 21 22 23 24 25 26
30 31 32 33 34 35 36
40 41 42 43 44 45 46
50 51 52 53 54 55 56
60 61 62 63 64 65 66

Now consider the pixel 33 at the center. With the first 3x3 convolution, the generated value at pixel 33 will incorporate the values of pixels 22, 23, 24, 32, 33, 34, 42, 43, and 44. But notice that each of those pixels will also incorporate their surrounding pixels' values as well.

With the next 3x3 convolution, pixel 33 will again incorporate the values of its surrounding pixels, but now, the value of those pixels incorporates their surrounding pixels from the original image. In effect, this means that the value of pixel 33 is governed by the values reaching out to a 5x5 "square of influence" you could say.

Each additional 3x3 convolution has the effect of stretching the effective receptive field by another pixel in each direction.

I hope that didn't just make it more confusing...

Reviews: Understanding the Effective Receptive Field in Deep , The paper introduces the notion of effective receptive-field size in convnets, showing This paper studies the characteristics of receptive fields in deep convolutional to be meaningful, I found the conclusions to be a little bland and confused. For example, if all kernels are of size 1, naturally the receptive field is also of size 1. If all strides are 1, then the receptive field will simply be the sum of \((k_l-1)\) over all layers, plus 1, which is simple to see.

                  16          ------> Layer3
                13 14 15      ------->Layer2
              8 9 10 11 12    ------->Layer1
              1 2 3 4 5 6 7   ------->Input Layer

Let's consider in 1D instead of 2D for better clarity. Consider each numerical value as one pixel and each vertical level as a convolution layer. Now let's decide on receptive field(F) = 3, Padding(P)=0 and stride(S)=1 for each layer. W is number of 0's at each layer. Therefore by the formula:

             W_j+1 = ((W_j - F + 2P)/S +1)

In this case we have 7 pixels at Input Layer, so by using above formula you can easily calculate the number of layers at each above Layer. Now if you see the pixel named 16 at Layer3, it's receiving it's inputs from 13 14 and 15 since F=3. Similarly 13, 14 and 15 are getting their inputs from (8 9 10),(9 10 11) and (10 11 12) respectively for the same reasons as S=1 and F=3.

Similarly,8 will be getting inputs from (1 2 3), 9 from (2 3 4) ,......., 12 from (5 6 7).

So if you see w.r.t to 16, it is getting it's input from all the bottom 7 pixels.

Now the main advantages of using small receptive fields are two fold. First, there will be less number of parameters as compared to using large receptive fields and other we have included the non linearity well in combinations of those bottom 7 pixels which would not be possible if used large receptive fields.I would recommend you to please check out this awesome link(below) of course CS231 and all these things are beautifully explained there.

Convolutional Neural Networks (CNNs/ConvNets)

Computing Receptive Fields of Convolutional Neural Networks, The key parameter to associate an output feature to an input region is the receptive field of the convolutional network, which is defined as the size  We study characteristics of receptive fields of units in deep convolutional networks. The receptive field size is a crucial issue in many visual tasks, as the output must respond to large enough areas in the image to capture information about large objects.

I think a good answer has been provided by @Aenimated1. But the link provided by @chirag provides a good way to put the answer. I paste the link here again for any other person coming here :


And the specific extract that answers the question is :

Suppose that you stack three 3x3 CONV layers on top of each other (with non-linearities in between, of course). In this arrangement, each neuron on the first CONV layer has a 3x3 view of the input volume. A neuron on the second CONV layer has a 3x3 view of the first CONV layer, and hence by extension a 5x5 view of the input volume. Similarly, a neuron on the third CONV layer has a 3x3 view of the 2nd CONV layer, and hence a 7x7 view of the input volume.

To buttress this answer, I came across this post which might be very useful. It answers whatever doubt one has about receptve field:

Convolutional Neural Networks (CNNs / ConvNets), Then using an example receptive field size of 3x3, every neuron in the Conv Layer This worked out so because our receptive fields were 3 and we used zero This has confused many people in the history of ConvNets and little is known  I do not understand the value of the receptive field for the dilated convolution. In fact, for a regular convolution 3x3 for example, the receptive field is 3x3. Two 3x3 the receptive field is 5x5.

Assume that we have a network architecture that only comprise of multiple convolution layers. For each convolution layer, we define a square kernel size and a dilation rate. Also, Assume that the stride is 1. Thus, you can compute the receptive field of the network by the following piece of python code:

K=[3,3]   # Kernel Size
R=[1,2]  # Dilation Rate

d=1 # Depth
for k,r in zip(K,R):
    support=k+(k-1)*(r-1) # r-dilated conv. adds r-1 zeros among coefficients
    print('depth=%d, K=%d, R=%d, kernel support=%d'%(d,k,r,support))
print('Receptive Field: %d'%RF)

As an example, let's compute the receptive field (RF) of the well-known DnCNN (denoising convolutional neural network) [1]. Use the above piece of code with the following inputs to compute the RF of that network. (you will get RF=35).

# In DnCNN-S, the network has 17 convolution layers.
K=[3]*17  # Kernel Size
R=[1]*17  # Dilation Rate

[1] Zhang, Kai, et al. "Beyond a gaussian denoiser: Residual learning of deep cnn for image denoising." IEEE Transactions on Image Processing 26.7 (2017): 3142-3155.

[PDF] What are the Receptive, Effective Receptive, and Projective Fields of , Receptive Field (RF): is a local region (including its depth) on the output is simply equal to filter size over the previous layer but ERF traces the terchangeably (and hence confused) in the computer vision community. Receptive field size (horizontal) = 3039 Receptive field size (vertical) = 3039 Effective stride (horizontal) = 32 Effective stride (vertical) = 32 Effective padding (horizontal) = 1482 Effective padding (vertical) = 1482

ConvNet Receptive Fields Schema. Showing the outputs, inputs and , The receptive field of a unit contains all inputs that are used to compute the unit's Note that this is only a schema and exact dimensions are not meaningful in this +6 Figure 7: Confusion matrices for FBCSP-and ConvNet-based decoding. Receptive field, region in the sensory periphery within which stimuli can influence the electrical activity of sensory cells. The receptive field encompasses the sensory receptors that feed into sensory neurons and thus includes specific receptors on a neuron as well as collectives of receptors

Adaptive Local Receptive Field Convolutional Neural Networks for , Request PDF | Adaptive Local Receptive Field Convolutional Neural Networks for To this end, we first use different sizes of local receptive fields to produce Such a deep layered ConvNet structure enables the network to learn the best However, there is a confusing plethora of different neural network  The size of the receptive field governs the spatial frequency of the information: small receptive fields are stimulated by high spatial frequencies, fine detail; large receptive fields are stimulated by low spatial frequencies, coarse detail. Retinal ganglion cell receptive fields convey information about discontinuities in the distribution of light falling on the retina; these often specify the edges of objects.

Convolutional neural network, In deep learning, a convolutional neural network (CNN, or ConvNet) is a class of deep neural receptive fields. Receptive field size and location varies systematically across the cortex to form a complete map of visual space. A convolutional layer contains units whose receptive fields cover a patch of the previous layer. Spatial Attention Attracts Receptive Fields Throughout the Human Visual Brain - Duration: 5:57. Cell Press 2,564 views