Hot questions for Using Neural networks in imagenet

Question:

I'm using the pretrained imagenet model provided along the caffe (CNN) library ('bvlc_reference_caffenet.caffemodel'). I can output a 1000 dim vector of object scores for any images using this model. However I don't know what the actual object categories are. Did someone find a file, where the corresponding object categories are listed?


Answer:

You should look for the file 'synset_words.txt' it has 1000 line each line provides a description of a different class.

For more information on how to get this file (and some others you might need) you can read this.


If you want all the labels to be ready-for-use in Matlab, you can read the txt file into a cell array (a cell per class):

C = textread('/path/to/synset_words.txt','%s','delimiter','\n');

Question:

After several month working with caffe, I've been able to train my own models successfully. For example further than my own models, I've been able to train ImageNet with 1000 classes.

In my project now, I'm trying to extract the region of my interest class. After that I've compiled and run the demo of Fast R-CNN and it works ok, but the sample models contains only 20 classes and I'd like to have more classes, for example all of them.

I've already downloaded the bounding boxes of ImageNet, with the real images.

Now, I've gone blank, I can't figure out the next steps and there's not a documentation of how to do it. The only thing I've found is how to train the INRIA person model, and they provide dataset + annotations + python script.

My questions are:

  • Is there maybe any tutorial or guide that I've missed?
  • Is there already a model trained with 1000 classes able to classify images and extract the bounding boxes?

Thank you very much in advance.

Regards.

Rafael.


Answer:

Dr Ross Girshik has done a lot of work on object detection. You can learn a lot from his detailed git on fast RCNN: you should be able to find a caffe branch there, with a demo. I did not use it myself, but it seems very comprehensible.

Another direction you might find interesting is LSDA: using weak supervision to train object detection for many classes.

BTW, have you looked into faster-rcnn?

Question:

ImageNet images are all different sizes, but neural networks need a fixed size input.

One solution is to take a crop size that is as large as will fit in the image, centered around the center point of the image. This works but has some drawbacks. Often times important parts of the object of interest in the image are cut out, and there are even cases where the correct object is completely missing while another object that belongs to a different class is visible, meaning your model will be trained wrong for that image.

Another solution would be to use the entire image and zero pad it to where each image has the same dimensions. This seems like it would interfere with the training process though, and the model would learn to look for vertical/horizontal patches of black near the edge of images.

What is commonly done?


Answer:

There are several approaches:

  • Multiple crops. For example AlexNet was originally trained on 5 different crops: center, top-left, top-right, bottom-left, bottom-right.
  • Random crops. Just take a number of random crops from the image and hope that the Neural Network will not be biased.
  • Resize and deform. Resize the image to a fixed size without considering the aspect ratio. This witll deform the image contents but preserves but now you are sure that no content is cut.
  • Variable-sized Inputs. Do not crop and train the network on variable sized images, using something like Spatial Pyramid Pooling to extract a fixed size feature vector that can be used with fully connected layers.

You could take a look how the latest ImageNet networks are trained, like VGG and ResNet. They usually describe this step in detail.

Question:

def preprocess_input(x):
x /= 255.
x -= 0.5
x *= 2.
return x

 I am using keras inception_v3 imagenet pretrained model(inception_v3.py) to finetune on my own dataset.  When I want to subtract the imagenet mean value [123.68, 116.779, 103.939] and reverse axis RGB to BGR as we often do, I find that the author provided a _preprocess_input()_ function at the end.I am confused about this.

  Should I use the provided function preprocess_input() or subtract mean value and reverse axis as usual?   Thanks lot.


Answer:

Actually in a original Inception paper the autors mention as a data preprocessor the function you provided (one which is zero-centering all channels and resizes it to [-1, 1] interval). As in InceptionV3 paper no new data transformation is provided I think that you may assume that you should use the following function:

def preprocess_input(x):
    x /= 255.
    x -= 0.5
    x *= 2.
    return x

Question:

I have already implemented image captioning using VGG as the image classification model. I have read about YOLO being a fast image classification and detection model and it is primarily used for multiple object detection. However for image captioning i just want the classes not the bounding boxes.


Answer:

I completely agree with what Parag S. Chandakkar mentioned in his answer. YOLO and RCNN the two most used object detection models are slow if used just for classification compared to VGG-16 and other object classification networks. However in support of YOLO, I would mention that , you can create a single model for image captioning and image object detection.

YOLO generates a vector of length 1470.

  • Tune YOLO to generate number of classes as supported by your dataset i.e make YOLO generate a vector of 49*(number of classes in your dataset) + 98 + 392.

  • Use this vector to generate the Bounding boxes.

  • Further tune this vector to generate a vector of size equal to the number of classes. You can use a dense layer for the same.
  • Pass this vector to your language model for generating captions.

Thus to sum up, you can generate the bounding boxes first and then further tune that vector to generate captions.

Question:

I understand that bigger batch size gives more accurate results from here. But I'm not sure which batch size is "good enough". I guess bigger batch sizes will always be better but it seems like at a certain point you will only get a slight improvement in accuracy for every increase in batch size. Is there a heuristic or a rule of thumb on finding the optimal batch size?

Currently, I have 40000 training data and 10000 test data. My batch size is the default which is 256 for training and 50 for the test. I am using NVIDIA GTX 1080 which has 8Gigs of memory.


Answer:

Test-time batch size does not affect accuracy, you should set it to be the largest you can fit into memory so that validation step will take shorter time.

As for train-time batch size, you are right that larger batches yield more stable training. However, having larger batches will slow training significantly. Moreover, you will have less backprop updates per epoch. So you do not want to have batch size too large. Using default values is usually a good strategy.

Question:

When I try to run Google's Inception model in a loop over a list of images, I get the issue below after about 100 or so images. It seems to be running out of memory. I'm running on a CPU. Has anyone else encountered this issue?

Traceback (most recent call last):
  File "clean_dataset.py", line 33, in <module>
    description, score = inception.run_inference_on_image(f.read())
  File "/Volumes/EXPANSION/research/dcgan-transfer/data/classify_image.py", line 178, in run_inference_on_image
    node_lookup = NodeLookup()
  File "/Volumes/EXPANSION/research/dcgan-transfer/data/classify_image.py", line 83, in __init__
    self.node_lookup = self.load(label_lookup_path, uid_lookup_path)
  File "/Volumes/EXPANSION/research/dcgan-transfer/data/classify_image.py", line 112, in load
    proto_as_ascii = tf.gfile.GFile(label_lookup_path).readlines()
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/tensorflow/python/lib/io/file_io.py", line 110, in readlines
    self._prereadline_check()
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/tensorflow/python/lib/io/file_io.py", line 72, in _prereadline_check
    compat.as_bytes(self.__name), 1024 * 512, status)
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/contextlib.py", line 24, in __exit__
    self.gen.next()
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/tensorflow/python/framework/errors.py", line 463, in raise_exception_on_not_ok_status
    pywrap_tensorflow.TF_GetCode(status))
tensorflow.python.framework.errors.ResourceExhaustedError: /tmp/imagenet/imagenet_2012_challenge_label_map_proto.pbtxt


real    6m32.403s
user    7m8.210s
sys     1m36.114s

https://github.com/tensorflow/models/tree/master/inception


Answer:

The issue is you cannot simply import the original 'classify_image.py'(https://github.com/tensorflow/tensorflow/blob/master/tensorflow/models/image/imagenet/classify_image.py) in your own code, especially when you put it into a huge loop to classify thousands of images 'in batch mode'.

Look at the original code here:

with tf.Session() as sess:
# Some useful tensors:
# 'softmax:0': A tensor containing the normalized prediction across
#   1000 labels.
# 'pool_3:0': A tensor containing the next-to-last layer containing 2048
#   float description of the image.
# 'DecodeJpeg/contents:0': A tensor containing a string providing JPEG
#   encoding of the image.
# Runs the softmax tensor by feeding the image_data as input to the graph.
softmax_tensor = sess.graph.get_tensor_by_name('softmax:0')
predictions = sess.run(softmax_tensor,
                       {'DecodeJpeg/contents:0': image_data})
predictions = np.squeeze(predictions)

# Creates node ID --> English string lookup.
node_lookup = NodeLookup()

top_k = predictions.argsort()[-FLAGS.num_top_predictions:][::-1]
for node_id in top_k:
  human_string = node_lookup.id_to_string(node_id)
  score = predictions[node_id]
  print('%s (score = %.5f)' % (human_string, score))

From above you can see that for each classification task it generate a new instance of Class 'NodeLookup', which loads below from files:

  • label_lookup="imagenet_2012_challenge_label_map_proto.pbtxt"
  • uid_lookup_path="imagenet_synset_to_human_label_map.txt"

So the instance would be really huge, and then in your codes' loop it will generate over hundreds of instances of this class, which results in 'tensorflow.python.framework.errors.ResourceExhaustedError'.

What I am suggesting to get ride of this is to write a new script and modify those classes and functions from 'classify_image.py', and avoid to instantiate the NodeLookup class for each loop, just instantiate it for once and use it in the loop. Something like this:

with tf.Session() as sess:
        softmax_tensor = sess.graph.get_tensor_by_name('softmax:0')
        print 'Making classifications:'

        # Creates node ID --> English string lookup.
        node_lookup = NodeLookup(label_lookup_path=self.Model_Save_Path + self.label_lookup,
                                 uid_lookup_path=self.Model_Save_Path + self.uid_lookup_path)

        current_counter = 1
        for (tensor_image, image) in self.tensor_files:
            print 'On ' + str(current_counter)

            predictions = sess.run(softmax_tensor, {'DecodeJpeg/contents:0': tensor_image})
            predictions = np.squeeze(predictions)

            top_k = predictions.argsort()[-int(self.filter_level):][::-1]

             for node_id in top_k:
                 human_string = node_lookup.id_to_string(node_id)
                 score = predictions[node_id]