Hot questions for Using Neural networks in tensorflow gpu

Question:

I'm training a CNN model with tensorflow. I only achieve a GPU utilization of 60% (+- 2-3%) without big drops.

Sun Oct 23 11:34:26 2016       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 367.57                 Driver Version: 367.57                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 1070    Off  | 0000:01:00.0     Off |                  N/A |
|  1%   53C    P2    90W / 170W |   7823MiB /  8113MiB |     60%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|    0      3644    C   /usr/bin/python2.7                            7821MiB |
+-----------------------------------------------------------------------------+

Since it's a Pascal card I am using CUDA 8 with cudnn 5.1.5 The CPU usage is around 50% (evenly distributed over 8 threads. i7 4770k), so the CPU performance should not be the bottleneck.

I'm using the binary file format of Tensorflow with and read with tf.TFRecordReader()

I'm creating batches of images like this:

#Uses tf.TFRecordReader() to read single Example
label, image = read_and_decode_single_example(filename_queue=filename_queue) 
image = tf.image.decode_jpeg(image.values[0], channels=3)
jpeg = tf.cast(image, tf.float32) / 255.
jpeg.set_shape([66,200,3])
images_batch, labels_batch = tf.train.shuffle_batch(
    [jpeg, label], batch_size= FLAGS.batch_size,
    num_threads=8,
    capacity=2000, #tried bigger values here, does not change the performance
    min_after_dequeue=1000) #here too

Here is my training loop:

sess = tf.Session()

sess.run(init)
tf.train.start_queue_runners(sess=sess)
for step in xrange(FLAGS.max_steps):
    labels, images = sess.run([labels_batch, images_batch])
    feed_dict = {images_placeholder: images, labels_placeholder: labels}
    _, loss_value = sess.run([train_op, loss],
                                 feed_dict=feed_dict)

I don't have much experience with tensorflow, and I don't now where the bottleneck could be. If you need any more code snippets to help identify the issue, I will provide them.

UPDATE: Bandwidth test results

==5172== NVPROF is profiling process 5172, command: ./bandwidthtest

Device: GeForce GTX 1070
Transfer size (MB): 3960

Pageable transfers
  Host to Device bandwidth (GB/s): 7.066359
  Device to Host bandwidth (GB/s): 6.850315

Pinned transfers
  Host to Device bandwidth (GB/s): 12.038037
  Device to Host bandwidth (GB/s): 12.683915

==5172== Profiling application: ./bandwidthtest
==5172== Profiling result:
Time(%)      Time     Calls       Avg       Min       Max  Name
 50.03%  933.34ms         2  466.67ms  327.33ms  606.01ms  [CUDA memcpy DtoH]
 49.97%  932.32ms         2  466.16ms  344.89ms  587.42ms  [CUDA memcpy HtoD]

==5172== API calls:
Time(%)      Time     Calls       Avg       Min       Max  Name
 46.60%  1.86597s         4  466.49ms  327.36ms  606.15ms  cudaMemcpy
 35.43%  1.41863s         2  709.31ms  632.94ms  785.69ms  cudaMallocHost
 17.89%  716.33ms         2  358.17ms  346.14ms  370.19ms  cudaFreeHost
  0.04%  1.5572ms         1  1.5572ms  1.5572ms  1.5572ms  cudaMalloc
  0.02%  708.41us         1  708.41us  708.41us  708.41us  cudaFree
  0.01%  203.58us         1  203.58us  203.58us  203.58us  cudaGetDeviceProperties
  0.00%  187.55us         1  187.55us  187.55us  187.55us  cuDeviceTotalMem
  0.00%  162.41us        91  1.7840us     105ns  61.874us  cuDeviceGetAttribute
  0.00%  79.979us         4  19.994us  1.9580us  73.537us  cudaEventSynchronize
  0.00%  77.074us         8  9.6340us  1.5860us  28.925us  cudaEventRecord
  0.00%  19.282us         1  19.282us  19.282us  19.282us  cuDeviceGetName
  0.00%  17.891us         4  4.4720us     629ns  8.6080us  cudaEventDestroy
  0.00%  16.348us         4  4.0870us     818ns  8.8600us  cudaEventCreate
  0.00%  7.3070us         4  1.8260us  1.7040us  2.0680us  cudaEventElapsedTime
  0.00%  1.6670us         3     555ns     128ns  1.2720us  cuDeviceGetCount
  0.00%     813ns         3     271ns     142ns     439ns  cuDeviceGet

Answer:

After getting some more experience with tensorflow I realized that the GPU usage heavily depends on the networks size, batch size and preprocessing. Using a bigger network with more conv layers (Resnet style for example) increases the GPU usage because more computations are involved and less overhead (in relation to the computation) is produced by transferring data etc.

Question:

I'm using tensorflow with Titan-X GPUs and I've noticed that, when I run the CIFAR10 example, the Volatile GPU-utilization is pretty constant around 30%, whereas when I train my own model, the Volatile GPU-utilization is far from steady, it is almost always 0% and spikes at 80/90% before going back to 0%, over and over again.

I thought that this behavior was due to the way I was feeding the data to the network (I was fetching the data after each step, which took some time). But after implementing a queue to feed the data and avoid this latency between steps, the problem persisted (see below for the queuing system).

Any idea?

batch = 128 # size of the batch
x = tf.placeholder("float32", [None, n_steps, n_input])
y = tf.placeholder("float32", [None, n_classes])

# with a capacity of 100 batches, the bottleneck should not be the data feeding
queue = tf.RandomShuffleQueue(capacity=100*batch,
                  min_after_dequeue=80*batch,
                  dtypes=[tf.float32, tf.float32],
                  shapes=[[n_steps, n_input], [n_classes]])
enqueue_op = queue.enqueue_many([x, y])
X_batch, Y_batch = queue.dequeue_many(batch)

sess = tf.Session()

def load_and_enqueue(data):
    while True:
        X, Y = data.get_next_batch(batch)
        sess.run(enqueue_op, feed_dict={x: X, y: Y})

train_thread = threading.Thread(target=load_and_enqueue, args=(data))
train_thread.daemon = True
train_thread.start()

for _ in xrange(max_iter):
    sess.run(train_op)

Answer:

After doing some experiments, I found the answer so I post it since it could be useful to someone else.

First, get_next_batch is approximately 15x slower than train_op (thanks to Eric Platon for pointing this out).

However, I thought that the queue was being fed up to capacity and that only after the training was supposed to begin. Hence, I thought that even if get_next_batch was way slower, the queue should hide this latency, in the beginning at least, since it holds capacity examples and it would need to fetch new data only after it reaches min_after_dequeue which is lower than capacity and that it would result in a somehow steady GPU utilization.

But actually, the training begins as soon as the queue reaches min_after_dequeue examples. Thus, the queue is being dequeued as soon as the queue reaches min_after_dequeue examples to run the train_op, and since the time to feed the queue is 15x slower than the execution time of train_op, the number of elements in the queue drops below min_after_dequeue right after the first iteration of the train_op and the train_op has to wait for the queue to reach again min_after_dequeue examples.

When I force the train_op to wait until the queue is fed up to capacity (with capacity = 100*batch) instead of starting automatically when it reaches min_after_dequeue (with min_after_dequeue=80*batch), the GPU utilization is steady for like 10 seconds before going back to 0%, which is understandable since the queue reaches min_after_dequeue example in less than 10 seconds.

Question:

I've made this neural net to figure out whether a house is a good buy or a bad buy. For some reasons the code is not updating weights and biases. My loss stays same. This is my code:

I've made this neural net to figure out whether a house is a good buy or a bad buy. For some reasons the code is not updating weights and biases. My loss stays same. This is my code:

import pandas as pd
import tensorflow as tf

data = pd.read_csv("E:/workspace_py/datasets/good_bad_buy.csv")

features = data.drop(['index', 'good buy'], axis = 1)
lbls = data.drop(['index', 'area', 'bathrooms', 'price', 'sq_price'], axis = 1)

features = features[0:20]
lbls = lbls[0:20]

print(features)
print(lbls)
n_examples = len(lbls)

# Model

# Hyper parameters

epochs = 100
learning_rate = 0.1
batch_size = 1

input_data = tf.placeholder('float', [None, 4])
labels = tf.placeholder('float', [None, 1])

weights = {
            'hl1': tf.Variable(tf.random_normal([4, 10])),
            'hl2': tf.Variable(tf.random_normal([10, 10])),
            'hl3': tf.Variable(tf.random_normal([10, 4])),
            'ol': tf.Variable(tf.random_normal([4, 1]))
            }

biases = {
            'hl1': tf.Variable(tf.random_normal([10])),
            'hl2': tf.Variable(tf.random_normal([10])),
            'hl3': tf.Variable(tf.random_normal([4])),
            'ol': tf.Variable(tf.random_normal([1]))
            }

hl1 = tf.nn.relu(tf.add(tf.matmul(input_data, weights['hl1']), biases['hl1']))
hl2 = tf.nn.relu(tf.add(tf.matmul(hl1, weights['hl2']), biases['hl2']))
hl3 = tf.nn.relu(tf.add(tf.matmul(hl2, weights['hl3']), biases['hl3']))
ol = tf.nn.sigmoid(tf.add(tf.matmul(hl3, weights['ol']), biases['ol']))

loss = tf.reduce_mean((labels - ol)**2)
train = tf.train.AdamOptimizer(learning_rate).minimize(loss)

sess = tf.Session()
sess.run(tf.global_variables_initializer())

iterations = int(n_examples/batch_size)


for epoch_no in range(epochs):
    ptr = 0
    for iteration_no in range(iterations):
        epoch_input = features[ptr:ptr+batch_size]
        epoch_label = lbls[ptr: ptr+batch_size]
        ptr = ptr + batch_size
        _, err = sess.run([train, loss], feed_dict={input_data: features, labels: lbls})
    print("Error at epoch ", epoch_no, ": ", err)

print(sess.run(ol, feed_dict={input_data: [[2104, 3, 399900, 190.0665]]}))

This is the dataset:

Features:

    area  bathrooms   price    sq_price
0   2104          3  399900  190.066540
1   1600          3  329900  206.187500
2   2400          3  369000  153.750000
3   1416          2  232000  163.841808
4   3000          4  539900  179.966667
5   1985          4  299900  151.083123
6   1534          3  314900  205.280313
7   1427          3  198999  139.452698
8   1380          3  212000  153.623188
9   1494          3  242500  162.315930
10  1940          4  239999  123.710825
11  2000          3  347000  173.500000
12  1890          3  329999  174.602645
13  4478          5  699900  156.297454
14  1268          3  259900  204.968454
15  2300          4  449900  195.608696
16  1320          2  299900  227.196970
17  1236          3  199900  161.731392
18  2609          4  499998  191.643542
19  3031          4  599000  197.624546

labels:

    good buy
0        1.0
1        0.0
2        1.0
3        0.0
4        1.0
5        0.0
6        0.0
7        1.0
8        0.0
9        0.0
10       1.0
11       1.0
12       1.0
13       1.0
14       0.0
15       1.0
16       0.0
17       1.0
18       1.0
19       1.0

Any suggestions on how to fix this? I've tried tf.reduce_sum other than tf.reduce_mean. I've also tried a larger batch_size.


Answer:

There are several things not ok with your code. First, you mean

    epoch_input = features[ptr:ptr+batch_size]
    epoch_label = lbls[ptr: ptr+batch_size]
    ptr = ptr + batch_size
    // _, err = sess.run([train, loss], feed_dict={input_data: features, labels: lbls}
    _, err = sess.run([train, loss], feed_dict={input_data: epoch_input, labels: epoch_label}

Now it uses minibatch.

Debugging the gradient:

You can always check some stuff by adding

loss = tf.Print(loss, [tf.reduce_sum(weights['hl1'])])

This will print the elements of that list [tf.reduce_sum(weights['hl1'])]. To investigate further your problem, you can check the gradients instead of using minimize

grads = tf.reduce_sum(tf.gradients(loss, ol)[0])
sess.run(grads, {input_data: features, labels: lbls})

And finally, the loss function is inappropriate/numerical instable for classification. With your version, I get:

variables
   Variable:0
   Variable_1:0
   Variable_2:0
   Variable_3:0
   Variable_4:0
   Variable_5:0
   Variable_6:0
   Variable_7:0
I tensorflow/core/kernels/logging_ops.cc:79] [-6.2784553]
-----------------------------------------
name MatMul_grad
gradient [[ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.]
 [ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.]
 [ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.]
 [ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.]]
value [[-0.59977376 -0.30060738  0.55068201  0.15304407  1.39992142  0.07495346
  -0.87189424 -0.22595075 -0.30094525 -1.2688272 ]
 [-0.44018757  1.08651936 -0.26267499 -0.54463315  0.47019768  0.69873857
   0.56195319  0.20222363  0.38143152 -0.92212462]
 [-0.39977714 -1.07244122  0.41926911  1.4951371  -2.28751612  0.45676312
   0.88010246 -0.88077509 -1.25860023  0.56874037]
 [-0.98260719 -1.30747247 -1.4460088   1.0717535   0.08794415 -0.53184992
  -1.17537284 -0.51598179 -0.15323587  0.91142744]]
-----------------------------------------
name MatMul_1_grad
gradient [[ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.]
 [ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.]
 [ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.]
 [ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.]
 [ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.]
 [ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.]
 [ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.]
 [ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.]
 [ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.]
 [ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.]]
value [[-0.1170694   0.12174897  0.91696155  0.59427398  0.90844423  0.29010534
  -0.34039831 -0.62824941  0.37833953  0.27777222]
 [-0.34947088  1.09264851  0.27353975  1.31722498 -0.42032316 -2.74952078
  -0.66349608 -0.61844724 -0.82141227  1.21691799]
 [ 0.10453336 -1.68631995  0.45700032 -1.58120835 -1.23378754 -0.05648948
  -1.64761281 -0.57684237 -0.06499017 -0.49623618]
 [ 1.47821534 -0.5329541   0.09209292  1.78089786  1.71149898  0.30547267
   0.39544162  1.00369155  1.0097307  -0.92320329]
 [ 1.27038908 -2.17246103 -0.31276336  0.8945803   0.30964327  1.15329361
   0.9711507  -0.36301252 -0.05652813  0.63399518]
 [-0.30909851 -0.41660413 -0.50603527  0.11735299 -0.26837045  0.16547598
  -0.33875859 -0.46821991  0.25723135 -0.80380815]
 [-0.86255074 -1.11751068  0.01365725  0.66119182  0.48947951  1.6353699
  -0.794447    0.43182942 -0.97692633 -1.62605619]
 [ 1.38552308  0.83679706 -0.87287223  2.59401655 -0.61855     0.38301265
   1.09983373  0.49209142  1.03003716 -1.33537853]
 [ 0.74452382  1.57940936 -0.90974236 -1.2211293  -1.1076287   0.92846316
  -0.46856263 -0.3179535   0.75120807 -0.86442506]
 [ 0.31622764 -0.35965034 -0.02351121 -0.0650174   0.4714573   0.35687482
   1.43354905  0.39608309  0.42744714 -0.37226421]]
-----------------------------------------
name MatMul_2_grad
gradient [[ 0.  0.  0.  0.]
 [ 0.  0.  0.  0.]
 [ 0.  0.  0.  0.]
 [ 0.  0.  0.  0.]
 [ 0.  0.  0.  0.]
 [ 0.  0.  0.  0.]
 [ 0.  0.  0.  0.]
 [ 0.  0.  0.  0.]
 [ 0.  0.  0.  0.]
 [ 0.  0.  0.  0.]]
value [[-1.50904143  0.00228321  1.45787132  0.68312413]
 [-0.16627057  1.31303644  1.16326404  0.72901946]
 [ 0.8004092   0.37329885  0.89361066 -0.19850619]
 [ 1.58354807 -1.05612624  0.69891322 -0.32565734]
 [-1.57602286 -0.41256282  0.69086516 -0.54095054]
 [ 1.72376788 -0.53928965 -0.71574098 -0.94974124]
 [-0.62061429  1.51380932 -0.72585452 -0.07695383]
 [ 0.35537818  1.49691582  0.03931179  0.93435526]
 [ 0.20697887  1.39266443  0.73217523 -0.64737892]
 [ 1.00519872  0.90984046  1.68565321 -0.28157935]]
-----------------------------------------
name MatMul_3_grad
gradient [[ 0.]
 [ 0.]
 [ 0.]
 [ 0.]]
value [[ 0.94082022]
 [ 0.14753926]
 [-0.08765228]
 [ 1.32516992]]
-----------------------------------------
name Add_grad
gradient [ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.]
value [ 1.71239722  1.12632215  0.75409448  0.01951236  0.32135537 -1.46281374
  0.40413955  0.54653352 -0.57894999  0.2746354 ]
-----------------------------------------
name Add_1_grad
gradient [ 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.]
value [ 0.74800217 -0.43517059 -0.77706921  1.46858656  1.09103405 -0.46681881
  0.6126743  -2.27877688  1.48809242 -1.19616997]
-----------------------------------------
name Add_2_grad
gradient [ 0.  0.  0.  0.]
value [-0.12137324 -0.23238407  0.17909229 -0.75496733]
-----------------------------------------
name Add_3_grad
gradient [ 0.]
value [-0.91176724]

As you see, almost all gradients are zero. Why?

  • by definition (labels - ol) is in [0, 1]
  • the squared value is much smaller than one
  • the derivative of sigmoid s(x) is s'(x) = s(x)*(1-s(x)) the gradients are multiplied by this value which is again much smaller than one.

But after using sparse_softmax_cross_entropy_with_logits which is numerically stable and operates in the log-domain I get

variables
   Variable:0
   Variable_1:0
   Variable_2:0
   Variable_3:0
   Variable_4:0
   Variable_5:0
   Variable_6:0
   Variable_7:0
-----------------------------------------
name MatMul_grad
gradient [[ -1.42780918e-05  -1.96137808e-05  -2.44040220e-05  -2.25691911e-05
    0.00000000e+00   2.95208647e-05   0.00000000e+00   0.00000000e+00
    0.00000000e+00   0.00000000e+00]
 [ -2.54181440e-08  -3.49168410e-08  -4.34445262e-08  -4.01781257e-08
    0.00000000e+00   5.25536308e-08   0.00000000e+00   0.00000000e+00
    0.00000000e+00   0.00000000e+00]
 [ -2.45539122e-03  -3.37296468e-03  -4.19673882e-03  -3.88120394e-03
    0.00000000e+00   5.07667707e-03   0.00000000e+00   0.00000000e+00
    0.00000000e+00   0.00000000e+00]
 [ -1.42123906e-06  -1.95235293e-06  -2.42917258e-06  -2.24653377e-06
    0.00000000e+00   2.93850212e-06   0.00000000e+00   0.00000000e+00
    0.00000000e+00   0.00000000e+00]]
value [[ 0.43133125 -0.40009859 -0.08456381  0.59587955  0.57171088 -0.9824872
   1.18876612  0.9704771   0.74798232  0.15660612]
 [-1.18380785  0.22617982 -1.15734088 -0.50478351  1.43819618  1.55950046
  -1.1510663  -0.88835335  0.58378232  0.56860197]
 [ 0.29826403  0.02192715  0.62225986  2.47716165 -0.9223454   1.70159853
  -1.03968358 -0.26019615 -0.33808291 -0.30873826]
 [ 0.59774327 -1.28855145 -0.43420359 -0.4413566  -0.19220066  0.96984953
  -0.04922202  0.32994318 -1.05539823 -0.80112725]]
-----------------------------------------
name MatMul_1_grad
gradient [[  0.00000000e+00   1.15650124e-03   0.00000000e+00   0.00000000e+00
    6.59449317e-04  -1.09400018e-03   0.00000000e+00  -4.02117817e-04
    5.44495881e-04  -8.90314346e-04]
 [  0.00000000e+00   7.24206184e-05   0.00000000e+00   0.00000000e+00
    4.12950030e-05  -6.85067716e-05   0.00000000e+00  -2.51807924e-05
    3.40965707e-05  -5.57518724e-05]
 [  0.00000000e+00   2.38713808e-03   0.00000000e+00   0.00000000e+00
    1.36117137e-03  -2.25812919e-03   0.00000000e+00  -8.30012548e-04
    1.12389564e-03  -1.83770037e-03]
 [  0.00000000e+00   9.52679198e-03   0.00000000e+00   0.00000000e+00
    5.43227792e-03  -9.01193265e-03   0.00000000e+00  -3.31248436e-03
    4.48533799e-03  -7.33405072e-03]
 [  0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
    0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
    0.00000000e+00   0.00000000e+00]
 [  0.00000000e+00   6.51591457e-03   0.00000000e+00   0.00000000e+00
    3.71544389e-03  -6.16377220e-03   0.00000000e+00  -2.26559630e-03
    3.06777749e-03  -5.01617463e-03]
 [  0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
    0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
    0.00000000e+00   0.00000000e+00]
 [  0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
    0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
    0.00000000e+00   0.00000000e+00]
 [  0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
    0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
    0.00000000e+00   0.00000000e+00]
 [  0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
    0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
    0.00000000e+00   0.00000000e+00]]
value [[ 0.38902158 -2.14370036 -1.02228141 -0.6492967   1.87193418 -0.06453216
   1.0013988  -1.26857054  0.59826601  0.45045251]
 [ 0.51465249 -1.09108925 -0.21368918 -0.49310678 -0.87893176 -0.07944249
  -0.15810326  1.65703297  1.01812947 -0.95572269]
 [-1.76351583 -1.46950841  1.43533802  2.15617752  1.30682683  0.77409673
  -1.50309181  0.81978178  0.6672287  -0.434971  ]
 [-0.7291944   2.16516733 -1.39850736 -1.06059277  0.40035763  1.23335707
  -0.03707252  1.88107574  0.09459961  2.11439633]
 [-1.39152992 -1.39924514 -0.35704514 -0.71152836 -2.68857026  0.78129828
  -1.0077033  -1.26149333  0.4403404  -0.10159389]
 [ 0.37354535  0.12654085  0.7632165  -0.76493222  0.68177891 -0.34254205
  -1.11582613  2.60665917  1.53196526 -0.867055  ]
 [ 0.62746197 -0.01072595  3.26629376  1.28371656 -0.88725293  3.55530715
   0.67065352 -0.61927503  1.20604384 -0.87207574]
 [-0.68954837  1.89912283  0.90083456  0.02054735 -0.23425011  0.39949065
  -0.08969283 -0.75943565  1.0924015   0.28920195]
 [-0.64865923 -1.29299021 -0.39945969  0.02289505  1.46024895  0.94282049
  -0.99704605 -1.36124468  0.76788425  0.86770487]
 [ 0.63794595  1.68530416 -0.15548207 -0.22658408 -0.45446202 -0.77308726
  -0.12694608  1.17369819  2.25879693  0.20346723]]
-----------------------------------------
name MatMul_2_grad
gradient [[ 0.          0.          0.          0.        ]
 [-0.02205572  0.          0.00960038  0.        ]
 [ 0.          0.          0.          0.        ]
 [ 0.          0.          0.          0.        ]
 [-0.01932034  0.          0.00840973  0.        ]
 [-0.01617817  0.          0.00704201  0.        ]
 [ 0.          0.          0.          0.        ]
 [-0.05091252  0.          0.02216113  0.        ]
 [-0.0189826   0.          0.00826272  0.        ]
 [-0.01993647  0.          0.00867792  0.        ]]
value [[-0.18724969 -0.0544498  -0.69153035  0.47535184]
 [-0.75444973 -1.33321464 -0.13066645  1.56889391]
 [-0.6458627   1.17859495 -0.75926393  0.30138403]
 [ 1.0069555  -0.69344127  0.49295315  0.54917085]
 [-0.55954564 -1.13277721 -0.37167427 -0.64837182]
 [ 0.93753678  1.12197697  0.63789612  0.52438796]
 [ 0.77543265 -1.241382    1.78230286 -0.6928125 ]
 [ 0.95383584 -2.00331807  1.63409865 -0.36474878]
 [-0.73891008  2.066082   -0.94303596 -0.42322466]
 [ 0.38519588  0.03278512 -0.3487882  -1.50447905]]
-----------------------------------------
name MatMul_3_grad
gradient [[ 0.08460998]
 [ 0.        ]
 [ 0.16564058]
 [ 0.        ]]
value [[-0.35376808]
 [-0.07330427]
 [ 0.15398768]
 [-0.06484076]]
-----------------------------------------
name Add_grad
gradient [ -8.22783885e-09  -1.13025616e-08  -1.40629695e-08  -1.30056375e-08
   0.00000000e+00   1.70115797e-08   0.00000000e+00   0.00000000e+00
   0.00000000e+00   0.00000000e+00]
value [-1.00038147 -0.56519473  0.59372097 -1.1646167  -0.16213787 -0.69313556
  0.62788707  1.03768504  0.57876503 -0.5201084 ]
-----------------------------------------
name Add_1_grad
gradient [  0.00000000e+00   1.28705375e-08   0.00000000e+00   0.00000000e+00
   7.33891703e-09  -1.21749730e-08   0.00000000e+00  -4.47511184e-09
   6.05961770e-09  -9.90818183e-09]
value [ 0.02854451 -1.46039021 -0.03916361  0.40116394  0.16030532  0.88267213
 -0.46328214  0.18927227 -1.7536788  -0.46590349]
-----------------------------------------
name Add_2_grad
gradient [ -1.84504412e-08   0.00000000e+00   8.03108247e-09   0.00000000e+00]
value [ 0.94534302 -0.9080081  -1.86719894 -1.31547296]
-----------------------------------------
name Add_3_grad
gradient [ 0.29727879 -0.29727876]
value [ 0.07999782 -0.75647992]

The gradients are (while very small) this time non zero. The code for reproducing that is

import numpy as np
import tensorflow as tf

features = [
[2104, 3, 399900, 190.066540],
[1600, 3, 329900, 206.187500],
[2400, 3, 369000, 153.750000],
[1416, 2, 232000, 163.841808],
[3000, 4, 539900, 179.966667],
[1985, 4, 299900, 151.083123],
[1534, 3, 314900, 205.280313],
[1427, 3, 198999, 139.452698],
[1380, 3, 212000, 153.623188],
[1494, 3, 242500, 162.315930],
[1940, 4, 239999, 123.710825],
[2000, 3, 347000, 173.500000],
[1890, 3, 329999, 174.602645],
[4478, 5, 699900, 156.297454],
[1268, 3, 259900, 204.968454],
[2300, 4, 449900, 195.608696],
[1320, 2, 299900, 227.196970],
[1236, 3, 199900, 161.731392],
[2609, 4, 499998, 191.643542],
[3031, 4, 599000, 197.624546]]

lbls = [1,0,1,0,1,0,0,1,0,0,1,1,1,1,0,1,0,1,1,1]
features = np.array(features, dtype=np.float32)
lbls = np.array(lbls, dtype=np.int32)

n_examples = len(lbls)
epochs = 100
learning_rate = 0.1
batch_size = 1

input_data = tf.placeholder('float', [None, 4])
labels = tf.placeholder('int32', [None])

weights = {
            'hl1': tf.Variable(tf.random_normal([4, 10])),
            'hl2': tf.Variable(tf.random_normal([10, 10])),
            'hl3': tf.Variable(tf.random_normal([10, 4])),
            'ol': tf.Variable(tf.random_normal([4, 1]))
            }

biases = {
            'hl1': tf.Variable(tf.random_normal([10])),
            'hl2': tf.Variable(tf.random_normal([10])),
            'hl3': tf.Variable(tf.random_normal([4])),
            # 'ol': tf.Variable(tf.random_normal([1])),
            'ol': tf.Variable(tf.random_normal([2]))
            }

hl1 = tf.nn.relu(tf.add(tf.matmul(input_data, weights['hl1']), biases['hl1']))
hl2 = tf.nn.relu(tf.add(tf.matmul(hl1, weights['hl2']), biases['hl2']))
hl3 = tf.nn.relu(tf.add(tf.matmul(hl2, weights['hl3']), biases['hl3']))
# ol = tf.nn.sigmoid(tf.add(tf.matmul(hl3, weights['ol']), biases['ol']))
logits = tf.add(tf.matmul(hl3, weights['ol']), biases['ol'])

# ol = tf.Print(ol, [tf.reduce_sum(weights['hl1'])])
# loss = tf.reduce_mean((labels - ol)**2)
cost = tf.nn.sparse_softmax_cross_entropy_with_logits(logits=logits, labels=labels)
# loss = tf.reduce_mean((labels - ol)**2)
loss = tf.reduce_mean(cost)
optimizer = tf.train.AdamOptimizer(learning_rate)

iterations = int(n_examples/batch_size)

def debug_minimize(optimizer, loss, sess):
    from tensorflow.python.ops import variables
    from tensorflow.python.framework import ops
    # get all varibles
    var_list = (variables.trainable_variables() + ops.get_collection(ops.GraphKeys.TRAINABLE_RESOURCE_VARIABLES))
    print 'variables'
    for v in var_list:
        print '  ', v.name
    # get all gradients
    grads_and_vars = optimizer.compute_gradients(loss)
    train_op = optimizer.apply_gradients(grads_and_vars)

    zipped_val = sess.run(grads_and_vars, {input_data: features, labels: lbls})

    for rsl, tensor in zip(zipped_val, grads_and_vars):
        print '-----------------------------------------'
        print 'name', tensor[0].name.replace('/tuple/control_dependency_1:0', '').replace('gradients/', '')
        print 'gradient', rsl[0]
        print 'value', rsl[1]
    return train_op

sess = tf.Session()
sess.run(tf.global_variables_initializer())
debug_minimize(optimizer, loss, sess)

Question:

This is a discriminative network I'm training so I could use it in a generative network. I trained in on a dataset with 2 features and does binary classification. 1 = meditating 0 = not meditating. (dataset is from one of siraj raval's video).

For some reasons, the output layer (ol) always outputs [1] in every test case.

My dataset: https://drive.google.com/open?id=0B5DaSp-aTU-KSmZtVmFoc0hRa3c

import pandas as pd
import tensorflow as tf

data = pd.read_csv("E:/workspace_py/datasets/simdata/linear_data_train.csv")
data_f = data.drop("lbl", axis = 1)
data_l = data.drop(["f1", "f2"], axis = 1)

learning_rate = 0.01
batch_size = 1
n_epochs = 30
n_examples = 999 # This is highly unsatisfying >:3
n_iteration = int(n_examples/batch_size)


features = tf.placeholder('float', [None, 2], name='features_placeholder')
labels = tf.placeholder('float', [None, 1], name = 'labels_placeholder')

weights = {
            'ol': tf.Variable(tf.random_normal([2, 1], stddev= -12), name = 'w_ol')
}

biases = {
            'ol': tf.Variable(tf.random_normal([1], stddev=-12), name = 'b_ol')
}

ol = tf.nn.sigmoid(tf.add(tf.matmul(features, weights['ol']), biases['ol']), name = 'ol')

loss = -tf.reduce_sum(labels*tf.log(ol), name = 'loss') # cross entropy
train = tf.train.AdamOptimizer(learning_rate).minimize(loss)

sess = tf.Session()
sess.run(tf.global_variables_initializer())

for epoch in range(n_epochs):
    ptr = 0
    for iteration in range(n_iteration):
        epoch_x = data_f[ptr: ptr + batch_size]
        epoch_y = data_l[ptr: ptr + batch_size]
        ptr = ptr + batch_size

        _, err = sess.run([train, loss], feed_dict={features: epoch_x, labels:epoch_y})
    print("Loss @ epoch ", epoch, " = ", err)

print("Testing...\n")

data = pd.read_csv("E:/workspace_py/datasets/simdata/linear_data_eval.csv")
test_data_l = data.drop(["f1", "f2"], axis = 1)
test_data_f = data.drop("lbl", axis = 1)
#vvvHERE    
print(sess.run(ol, feed_dict={features: test_data_f})) #<<<HERE
#^^^HERE
saver = tf.train.Saver()
saver.save(sess, save_path="E:/workspace_py/saved_models/meditation_disciminative_model.ckpt")
sess.close()

output:

2017-10-11 00:49:47.453721: W C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\35\tensorflow\core\platform\cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
2017-10-11 00:49:47.454212: W C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\35\tensorflow\core\platform\cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations.
2017-10-11 00:49:49.608862: I C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\35\tensorflow\core\common_runtime\gpu\gpu_device.cc:955] Found device 0 with properties: 
name: GeForce GTX 960M
major: 5 minor: 0 memoryClockRate (GHz) 1.176
pciBusID 0000:01:00.0
Total memory: 4.00GiB
Free memory: 3.35GiB
2017-10-11 00:49:49.609281: I C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\35\tensorflow\core\common_runtime\gpu\gpu_device.cc:976] DMA: 0 
2017-10-11 00:49:49.609464: I C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\35\tensorflow\core\common_runtime\gpu\gpu_device.cc:986] 0:   Y 
2017-10-11 00:49:49.609659: I C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\35\tensorflow\core\common_runtime\gpu\gpu_device.cc:1045] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 960M, pci bus id: 0000:01:00.0)
Loss @ epoch  0  =  0.000135789
Loss @ epoch  1  =  4.16049e-05
Loss @ epoch  2  =  1.84776e-05
Loss @ epoch  3  =  9.41758e-06
Loss @ epoch  4  =  5.24522e-06
Loss @ epoch  5  =  2.98024e-06
Loss @ epoch  6  =  1.66893e-06
Loss @ epoch  7  =  1.07288e-06
Loss @ epoch  8  =  5.96047e-07
Loss @ epoch  9  =  3.57628e-07
Loss @ epoch  10  =  2.38419e-07
Loss @ epoch  11  =  1.19209e-07
Loss @ epoch  12  =  1.19209e-07
Loss @ epoch  13  =  1.19209e-07
Loss @ epoch  14  =  -0.0
Loss @ epoch  15  =  -0.0
Loss @ epoch  16  =  -0.0
Loss @ epoch  17  =  -0.0
Loss @ epoch  18  =  -0.0
Loss @ epoch  19  =  -0.0
Loss @ epoch  20  =  -0.0
Loss @ epoch  21  =  -0.0
Loss @ epoch  22  =  -0.0
Loss @ epoch  23  =  -0.0
Loss @ epoch  24  =  -0.0
Loss @ epoch  25  =  -0.0
Loss @ epoch  26  =  -0.0
Loss @ epoch  27  =  -0.0
Loss @ epoch  28  =  -0.0
Loss @ epoch  29  =  -0.0
Testing...

[[ 1.]
 [ 1.]
 [ 1.]
 [ 1.]
 [ 1.]
 [ 1.]
 [ 1.]
 [ 1.]
 [ 1.]
 [ 1.]
 [ 1.]
 [ 1.]
 [ 1.]
 [ 1.]
 [ 1.]
 [ 1.]
 [ 1.]
 [ 1.]
 [ 1.]
 [ 1.]
 [ 1.]
 [ 1.]
 [ 1.]
 [ 1.]
 [ 1.]
 [ 1.]
 [ 1.]
 [ 1.]
 [ 1.]
 [ 1.]
 [ 1.]
 [ 1.]
 [ 1.]
 [ 1.]
 [ 1.]
 [ 1.]
 [ 1.]
 [ 1.]
 [ 1.]
 [ 1.]
 [ 1.]
 [ 1.]
 [ 1.]
 [ 1.]
 [ 1.]
 [ 1.]
 [ 1.]
 [ 1.]
 [ 1.]
 [ 1.]
 [ 1.]
 [ 1.]
 [ 1.]
 [ 1.]
 [ 1.]
 [ 1.]
 [ 1.]
 [ 1.]
 [ 1.]
 [ 1.]
 [ 1.]
 [ 1.]
 [ 1.]
 [ 1.]
 [ 1.]
 [ 1.]
 [ 1.]
 [ 1.]
 [ 1.]
 [ 1.]
 [ 1.]
 [ 1.]
 [ 1.]
 [ 1.]
 [ 1.]
 [ 1.]
 [ 1.]
 [ 1.]
 [ 1.]
 [ 1.]
 [ 1.]
 [ 1.]
 [ 1.]
 [ 1.]
 [ 1.]
 [ 1.]
 [ 1.]
 [ 1.]
 [ 1.]
 [ 1.]
 [ 1.]
 [ 1.]
 [ 1.]
 [ 1.]
 [ 1.]
 [ 1.]
 [ 1.]
 [ 1.]
 [ 1.]
 [ 1.]
 [ 1.]
 [ 1.]
 [ 1.]
 [ 1.]
 [ 1.]
 [ 1.]
 [ 1.]
 [ 1.]
 [ 1.]
 [ 1.]
 [ 1.]
 [ 1.]
 [ 1.]
 [ 1.]
 [ 1.]
 [ 1.]
 [ 1.]
 [ 1.]
 [ 1.]
 [ 1.]
 [ 1.]
 [ 1.]
 [ 1.]
 [ 1.]
 [ 1.]
 [ 1.]
 [ 1.]
 [ 1.]
 [ 1.]
 [ 1.]
 [ 1.]
 [ 1.]
 [ 1.]
 [ 1.]
 [ 1.]
 [ 1.]
 [ 1.]
 [ 1.]
 [ 1.]
 [ 1.]
 [ 1.]
 [ 1.]
 [ 1.]
 [ 1.]
 [ 1.]
 [ 1.]
 [ 1.]
 [ 1.]
 [ 1.]
 [ 1.]
 [ 1.]
 [ 1.]
 [ 1.]
 [ 1.]
 [ 1.]
 [ 1.]
 [ 1.]
 [ 1.]
 [ 1.]
 [ 1.]
 [ 1.]
 [ 1.]
 [ 1.]
 [ 1.]
 [ 1.]
 [ 1.]
 [ 1.]
 [ 1.]
 [ 1.]
 [ 1.]
 [ 1.]
 [ 1.]
 [ 1.]
 [ 1.]
 [ 1.]
 [ 1.]
 [ 1.]
 [ 1.]
 [ 1.]
 [ 1.]
 [ 1.]
 [ 1.]
 [ 1.]
 [ 1.]
 [ 1.]
 [ 1.]
 [ 1.]
 [ 1.]
 [ 1.]
 [ 1.]
 [ 1.]
 [ 1.]
 [ 1.]
 [ 1.]
 [ 1.]
 [ 1.]
 [ 1.]
 [ 1.]
 [ 1.]
 [ 1.]]
Saving model...
[Finished in 57.9s]

Answer:

Main problem

First of all this is not valid cross entropy loss. The equation you are using works only with 2 or more outputs. With a single sigmoid output you have to do

-tf.reduce_sum(labels*tf.log(ol) + (1-labels)*tf.log(1-ol), name = 'loss')

otherwise the optimal solution is to always answer "1" (which is happening right now).

Why?

Note that labels is only 0 or 1, and your whole loss is a multiplication of label and logarithm of the prediction. Consequently when true label is 0, your loss is 0 no matter your prediction, as 0 * log(x) = 0 no matter what is x (as long as log(x) is defined). Consequently your model is only penalised for not predicting "1" when it should, and so it learns to output 1 all the time.

Some other odd things
  1. You are providing negative stddev to normal distribution, while you should not (unless this is some undocumented feature of random_normal, but according to docs it should accept a single positive float, and you should provide a small number there).

  2. Computing cross entropy like this (in a naive way) is not numericaly stable, take a look at tf.sigmoid_cross_entropy_with_logits.

  3. You are not permuting your dataset, thus you always process data in the same order, which can have bad consequences (periodic increases in the loss, harder convergence or lack of convergence).

Question:

I've just used for the first time the ModelCheckpoint function to save the best model (best_model = True) and wanted to test its performance. When the model was saved it said that the val_acc was at 83.3% before saving. I loaded the model and used the evaluate_generator on validation_generator but the result for val_acc was 0.639. I got confused and used it again and got 0.654 and then 0.647, 0.744 and so on. I've tested the same configuration on my PC (no GPUs) and it is consistently showing same results (maybe small rounding errors sometimes)

  1. Why are the results between different evaluate_generator executions different only on GPU?
  2. Why is the model val_acc different from the one reported?

I am using Tensorflows implementation of Keras.

model.compile(loss='categorical_crossentropy',
              optimizer=optimizers.SGD(lr=1e-4, momentum=0.9),
              metrics=['accuracy'])
checkpointer = ModelCheckpoint(filepath='/tmp/weights.hdf5', monitor = "val_acc", verbose=1, save_best_only=True)
# prepare data augmentation configuration
train_datagen = ImageDataGenerator(
    rescale = 1./ 255,
    shear_range = 0.2,
    zoom_range = 0.2,
    horizontal_flip = True)
test_datagen = ImageDataGenerator(rescale=1. / 255)
train_generator = train_datagen.flow_from_directory(
    train_data_dir,
    target_size = (img_height, img_width),
    batch_size = batch_size)
validation_generator = test_datagen.flow_from_directory(
    validation_data_dir,
    target_size = (img_height, img_width),
    batch_size = batch_size)
# fine-tune the model
model.fit_generator(
    train_generator,
    steps_per_epoch = math.ceil(train_samples/batch_size),
    epochs=100,
    workers = 120,
    validation_data=validation_generator,
    validation_steps=math.ceil(val_samples/batch_size),
    callbacks=[checkpointer])
model.load_weights(filepath='/tmp/weights.hdf5')
model.predict_generator(validation_generator, steps = math.ceil(val_samples/batch_size) )
temp_model = load_model('/tmp/weights.hdf5')
temp_model.evaluate_generator(validation_generator, steps = math.ceil(val_samples/batch_size), workers = 120)
>>> [2.1996076788221086, 0.17857142857142858]
temp_model.evaluate_generator(validation_generator, steps = math.ceil(val_samples/batch_size), workers = 120)
>>> [2.2661823204585483, 0.25]

Answer:

It is because you only save the model weights. This means you are not saving the optimizer state which explains the difference in accuracy when you reload the model. If you add save_weights_only=False when you create the ModelCheckpoint the issue will be resolved:

If you reload the model use the load_model function of Keras. Else you will still only load the weights.

checkpointer = ModelCheckpoint(filepath='/tmp/full_model.hdf5', monitor = "val_acc", verbose=1, save_best_only=True, save_weights_only=False)

#reload model
from keras.models import load_model
model = load_model('/tmp/full_model.hdf5')

Question:

I am trying to create an end-to-end trainable offline English Handwriting Recognition Model (without segmenting individual character). The word dataset from IAM Handwriting Database is being used for training the model.

The model is training very slowly and the GPU utilization hovers around just 30%. Also getting a PoolAllocator warning -

PoolAllocator: After 89632424 get requests, put_count=89632402 evicted_count=175000 eviction_rate=0.00195242 and unsatisfied allocation rate=0.00195474

Tried changing the batch size but to no avail. The data is being fed through a TFRecords file (could it be causing an issue?). I am new to TensorFlow so could have made some naive error. The code used:

class Config():
im_height = 28
num_epochs = 25
batch_size = 1

# Rnn
rnn_num_hidden = 256

# Number of classes
num_classes = 81

tfrecord_filename = 'sequence_data_lengths_3_4.tfrecords'

config = Config()

class CRNN(object):

def __init__(self, config):

    self.config = config
    tf.reset_default_graph()

def read_and_decode(self, filename_queue):

    reader = tf.TFRecordReader()

    _, serialized_example = reader.read(filename_queue)

    # Define how to parse the example
    context_features = {
        'length': tf.FixedLenFeature([], dtype=tf.int64),
        'out_length': tf.FixedLenFeature([], dtype=tf.int64)
    }
    sequence_features = {
        'token': tf.FixedLenSequenceFeature([], dtype=tf.float32),
        'labels': tf.FixedLenSequenceFeature([], dtype=tf.int64)
    }

    context_parsed, sequence_parsed = tf.parse_single_sequence_example(
        serialized=serialized_example,
        context_features=context_features,
        sequence_features=sequence_features)

    image = sequence_parsed['token']
    label = tf.cast(sequence_parsed['labels'], tf.int32)
    length = tf.cast(context_parsed['length'], tf.int32)
    lab_length = tf.cast(context_parsed['out_length'], tf.int32)

    image_shape = tf.cast(tf.stack([self.config.im_height, 
                                    length/self.config.im_height]), tf.int32)
    image = tf.reshape(image, image_shape)

    # Updating length to represent image width
    length = tf.shape(image)[1]

    # Batch the variable length tensor with dynamic padding
    self.images, self.labels, self.lengths, self.lab_lengths = tf.train.batch(
        tensors=[image, label, length, lab_length],
        batch_size=self.config.batch_size, dynamic_pad=True)

def net(self):


    batch_lab_length = tf.reduce_max(self.lab_lengths)
    batch_im_length = tf.reduce_max(self.lengths)

    # Reshape to time major
    sequences = tf.reshape(self.images, [batch_im_length, self.config.batch_size,
                                            self.config.im_height])

    # Feed sequences into RNN
    with tf.name_scope('RNN'):
        self.cell_fw = tf.nn.rnn_cell.LSTMCell(num_units=self.config.rnn_num_hidden,
                                       state_is_tuple=True)
        self.cell_bw = tf.nn.rnn_cell.LSTMCell(num_units=self.config.rnn_num_hidden,
                                       state_is_tuple=True)
        self.output, self.state = tf.nn.bidirectional_dynamic_rnn(
            cell_fw=self.cell_fw,
            cell_bw=self.cell_bw,
            inputs=sequences,
            dtype=tf.float32,
            sequence_length=self.lengths,
            time_major=True,
            scope='RNN'
        )

        # Reshaping to apply the same weights over the timesteps
        self.output = tf.reshape(self.output, [-1, self.config.rnn_num_hidden])

        self.out_W = tf.Variable(tf.truncated_normal([self.config.rnn_num_hidden,
                                                 self.config.num_classes],
                                                stddev=0.1), name='out_W')
        self.out_b = tf.Variable(tf.constant(0., shape=[self.config.num_classes]), name='out_b')

        # Doing the affine projection
        logits = tf.matmul(self.output, self.out_W) + self.out_b

    # Reshaping back to the original shape
    logits = tf.reshape(logits, [self.config.batch_size, -1, self.config.num_classes])

    # Time major
    logits = tf.transpose(logits, (1, 0, 2))

    # Training computation

    # Prepare sparse tensor for CTC loss
    labs = tf.reshape(self.labels, (self.config.batch_size, batch_lab_length))
    sparse_tensor_indices = tf.where(tf.less(tf.cast(0, tf.int32), labs))

    labels_vals = tf.reshape(self.labels, [batch_lab_length*self.config.batch_size])
    mask = tf.cast(tf.sign(labels_vals), dtype=tf.bool)
    labels_vals = tf.boolean_mask(labels_vals,mask)

    labels_sparse = tf.SparseTensor(indices=sparse_tensor_indices, values=labels_vals, 
                                    dense_shape=[self.config.batch_size, 
                                                 tf.cast(batch_lab_length, tf.int64)])
    self.loss = tf.nn.ctc_loss(labels_sparse, logits, sequence_length=self.lab_lengths, 
                          preprocess_collapse_repeated=False, ctc_merge_repeated=False, 
                          time_major=True)
    self.cost = tf.reduce_mean(self.loss)

    # Optimizer
    self.optimizer = tf.train.MomentumOptimizer(learning_rate=0.01,
                                           momentum=0.9, use_nesterov=True).minimize(self.cost)

    # Predictions for the training, validation, and test data.
    self.train_prediction = tf.nn.ctc_beam_search_decoder(logits, 
                                                sequence_length=self.lab_lengths)


def train(self):
    num_steps = int((self.config.num_epochs*self.config.sample_size)/self.config.batch_size)
    tf.reset_default_graph()

    filename_queue = tf.train.string_input_producer(
                    [self.config.tfrecord_filename], num_epochs=self.config.num_epochs)

    self.read_and_decode(filename_queue)
    self.net()

    # The op for initializing the variables.
    init_op = tf.group(tf.global_variables_initializer(),
                       tf.local_variables_initializer())
    saver = tf.train.Saver()

    with tf.Session() as sess:

        training_summary = tf.summary.scalar("training_cost", self.cost)
        writer = tf.summary.FileWriter("./TensorBoard/graph", sess.graph)

        sess.run(init_op)
        print('Initialized')
        coord = tf.train.Coordinator()
        threads = tf.train.start_queue_runners(coord=coord)

        start = time.time()
        steps_time = start

        epoch = 1
        for step in range(num_steps):
            _, c, predictions, actual_labels, train_summ = sess.run([self.optimizer, self.cost,
                                                                     self.train_prediction, 
                                                                     self.labels, training_summary])
            writer.add_summary(train_summ, step) 


            if (step % 10000 == 0):
                preds = np.zeros((predictions[0][0].dense_shape))
                i =  0
                for idx in predictions[0][0].indices:
                    preds[idx[0]][idx[1]] = predictions[0][0].values[i]
                    i+=1
                print(time.time() - steps_time)
                steps_time = time.time()
                print('Minibatch cost at step %d: %f' % (step, c))
                print('Label =', [''.join([char_map_inv[j] for j in i]) for i in actual_labels], 
                      'Prediction =', [''.join([char_map_inv[j] for j in i]) for i in preds])

            if (step!=0 and step % int(self.config.sample_size/self.config.batch_size) == 0):
                print('Epoch', epoch, 'Completed')
                epoch+=1

            last_step = step
        saver.save(sess, "model_BLSTM", global_step=last_step)
        writer.close()
        print(time.time() - start)

model = CRNN(config=config)
model.train()

Answer:

The issue was because TensorFlow's CTC implementation does not support GPU (see https://github.com/tensorflow/tensorflow/issues/2146).

Using Baidu's CTC GPU implementation (https://github.com/baidu-research/warp-ctc) resulted in an increase in GPU utilization and sped up the training.