Hot questions for Using Neural networks in cudnn


I've developed a NN Model with Keras, based on the LSTM Layer. In order to increase speed on Paperspace (a GPU Cloud processing infrastructure), I've switched the LSTM Layer with the new CuDNNLSTM Layer. However this is usable only on machines with GPU cuDNN support. PS: CuDNNLSTM is available only on Keras master, not in the latest release.

So I've generated the weights and saved them to hdf5 format on the Cloud, and I'd like to use them locally on my MacBook. Since CuDNNLSTM layer is not available, only for my local installation I've switched back to LSTM.

Reading this tweet about CuDNN from @fchollet I thought it would work just fine, simply reading the weights back into the LSTM model.

However, when I try to import them Keras is throwing this error:

Traceback (most recent call last):
tensorflow.python.framework.errors_impl.InvalidArgumentError: Dimension 0 in both shapes must be equal, but are 2048 and 4096 for 'Assign_2' (op: 'Assign') with input shapes: [2048], [4096].
ValueError: Dimension 0 in both shapes must be equal, but are 2048 and 4096 for 'Assign_2' (op: 'Assign') with input shapes: [2048], [4096]

Analyzing the hdf5 files with h5cat I can see that the two structures are different.


I cannot load weights generated from CuDNNLSTM into a LSTM model. Am i doing something in the wrong way? How can I get them to work seamlessly?

Here is my model:

SelectedLSTM = CuDNNLSTM if is_gpu_enabled() else LSTM
# ...
model = Sequential()
model.add(SelectedLSTM(HIDDEN_DIM, return_sequences=True, input_shape=(SEQ_LENGTH, vocab_size)))
model.add(SelectedLSTM(HIDDEN_DIM, return_sequences=False))

model.compile(loss='categorical_crossentropy', optimizer='rmsprop')


The reason is that the CuDNNLSTM layer has a bias twice as large as that of LSTM. It's because of the underlying implementation of cuDNN API. You can compare the following equations (copied from cuDNN user's guide) to the usual LSTM equations:

CuDNN uses two bias terms, so the number of bias weights is doubled. To convert it back to what LSTM uses, the two bias terms need to be summed.

I've submitted a PR to do the conversion and it's merged. You can install the latest Keras from GitHub and the problem in weight loading should be solved.


I have a next step prediction model on times series which is simply a GRU with a fully-connected layer on top of it. When I train it using CPU after 50 epochs I get a loss of 0.10 but when I train it with GPU the loss is 0.15 after 50 epochs. Doing more epochs doesnt really lower the losses in either cases.

Why is performance after training on CPU better than GPU?

I have tried changing the random seeds for both data and model, and these results are independent of the random seeds.

I have:

Python 3.6.2

PyTorch 0.3.0





I also use PyTorch's weight normalizaton torch.nn.utils.weight_norm on the GRU and on the fully-connected layer.


After trying many things I think I found the problem. Apparently the CUDNN libraries are sub-optimal in PyTorch. I don't know if it is a bug in PyTorch or a bug in CUDNN but doing

torch.backends.cudnn.enabled = False

solves the problem. With the above line, training with GPU or CPU gives the same loss at the same epoch.


It seems that it is the interaction of weight normalization and CUDNN which results in things going wrong. If I remove weight normalization it works. If I remove CUDNN it works. It seems that only in combination they do not work in PyTorch.


as the title mentioned, I want to find the definition of _cudnn_convolution_full_forward, but I search through all the project in pytorch and failed. And I cannot find and doc about this function.

any one can help me?


All the cudnn convolution functions are defined here:

This function doesn't exist anymore in the latest versions of pytorch. The closest thing that there is there is cudnn_convolution_forward. In version 0.1.12, the function is in the same file:

I would recommend against using using an unpublic api (one starting with _) and use a public method instead, but you probably already know that.

In otherwords you should be using

torch.backends.cudnn.enabled = True

and then conv2d or conv3d depending on your use.


For example, we have RGB-image with 3 channels (Red, Green, Blue). And we use convolutional neural network.

Does each convolutional filter always have 3 different coefficients for each of the channels (R,G,B) of image?

  1. I.e. does filter-W1 has 3 different coefficient matrices: W1[::0], W1[::1], W1[::2] as shown in the picture below?

  2. Or are often used the same coefficients in one filter in modern neural networks (W1[::0] = W1[::1] = W1[::2])?

Taken by link:


Convolutional Layer


The extent of the connectivity along the depth axis is always equal to the depth of the input volume. It is important to emphasize again this asymmetry in how we treat the spatial dimensions (width and height) and the depth dimension: The connections are local in space (along width and height), but always full along the entire depth of the input volume.


Here what is represented is the first hidden (here convolutional layer). Every single filter has a 3 channels because your input (for this layer your images) has 3 channels (RGB). Resulting in 2 feature maps that you concatenate (that explains the Output Volume of (3x3)x2 size).

More generally, for an input of (for simplicity let's consider a batche size of 1) of size (1x)WxHxC, every filter will have a size of NxNxC (for simplicity let's consider a stride of 1 and a 'SAME' padding even if for your example it is a 'VALID' padding), so for F filters yout output will have a size of (1x)WxHxF.

Hope it is clear enough (for your example W = H = 7, C = 3, N = 3 and F = 2).

Do not hesitate to comment if it is not clear enough :)


We're trying to build a forward convolutional neural network on FPGA. The configuration of our build is based on LeNet-5 architecture.

In the first convolution layer, there is no problem. Just 1 input (photo) and gives 6 output (6 feature map) with 6 (5*5) filter.

By the way, we trained our network and data on spyder-tensorflow etc.

But at the second convolution layer, there is 6 input (which are outputs of first max pooling layer) and 16 output with 16 (5*5*6) filter. Our research asistant said to us that "you have 6 input and (5*5) filter which has depth of 6. It means every input corresponds the filters neighbour depth of filter. At the end of the convolution, you can sum all of the multiplication results so that you have just 1 output for 1 filter."

But in which process we will sum the multiplication results.

In python/spyder/tensorflow conv2d function doing something and we get the results. but in hardware, I must know how this proceed.

Thank you for help. Sorry my english.

Here is the explanation with picture


take a moment and have a look at this:

I found this gif very helpful when learning how convolution is calculated and done in detail. Hopefully, this helps you understand how it is proceeded in "hardware".