Hot questions for Using Neural networks in detection


I generate images of a single coin pasted over a white background of size 200x200. The coin is randomly chosen among 8 euro coin images (one for each coin) and has :

  • random rotation ;
  • random size (bewteen fixed bounds) ;
  • random position (so that the coin is not cropped).

Here are two examples (center markers added): Two dataset examples

I am using Python + Lasagne. I feed the color image into the neural network that has an output layer of 2 linear neurons fully connected, one for x and one for y. The targets associated to the generated coin images are the coordinates (x,y) of the coin center.

I have tried (from Using convolutional neural nets to detect facial keypoints tutorial):

  • Dense layer architecture with various number of layers and number of units (500 max) ;
  • Convolution architecture (with 2 dense layers before output) ;
  • Sum or mean of squared difference (MSE) as loss function ;
  • Target coordinates in the original range [0,199] or normalized [0,1] ;
  • Dropout layers between layers, with dropout probability of 0.2.

I always used simple SGD, tuning the learning rate trying to have a nice decreasing error curve.

I found that as I train the network, the error decreases until a point where the output is always the center of the image. It looks like the output is independent of the input. It seems that the network output is the average of the targets I give. This behavior looks like a simple minimization of the error since the positions of the coins are uniformly distributed on the image. This is not the wanted behavior.

I have the feeling that the network is not learning but is just trying to optimize the output coordinates to minimize the mean error against the targets. Am I right? How can I prevent this? I tried to remove the bias of the output neurons because I thought maybe I'm just modifying the bias and all others parameters are being set to zero but this didn't work.

Is it possible for a neural network alone to perform well at this task? I have read that one can also train a net for present/not present binary classification and then scan the image to find possible locations of objects. But I just wondered if it was possible just using the forward computation of a neural net.


Question : How can I prevent this [overfitting without improvement to test scores]?

What needs to be done is to re-architect your neural net. A neural net just isn't going to do a good job at predicting an X and Y coordinate. It can through create a heat map of where it detects a coin, or said another way, you could have it turn your color picture into a "coin-here" probability map.

Why? Neurons have a good ability to be used to measure probability, not coordinates. Neural nets are not the magic machines they are sold to be but instead really do follow the program laid out by their architecture. You'd have to lay out a pretty fancy architecture to have the neural net first create an internal space representation of where the coins are, then another internal representation of their center of mass, then another to use the center of mass and the original image size to somehow learn to scale the X coordinate, then repeat the whole thing for Y.

Easier, much easier, is to create a coin detector Convolution that converts your color image to a black and white image of probability-a-coin-is-here matrix. Then use that output for your custom hand written code that turns that probability matrix into an X/Y coordinate.

Question : Is it possible for a neural network alone to perform well at this task?

A resounding YES, so long as you set up the right neural net architecture (like the above), but it would probably be much easier to implement and faster to train if you broke the task into steps and only applied the Neural Net to the coin detection step.


As far as I know, CNN rely on sliding window techniques and can only indicate if a certain pattern is present or not anywhere in given bounding boxes. Is that true?

Can one achieve localization with CNN without any help of such techniques?


Thats an open problem in image recognition. Besides sliding windows, existing approaches include predicting object location in image as CNN output, predicting borders (classifiyng pixels as belonging to image boundary or not) and so on. See for example this paper and references therein.

Also note that with CNN using max-pooling, one can identify positions of feature detectors that contributed to object recognition, and use that to suggest possible object location region.


I am building a RCNN detection network using Tensorflow's object detection API.

My goal is to detect bounding boxes for animals in outdoor videos. Most frames do not have animals and are just of dynamic backgrounds.

Most tutorials focus on training custom labels, but make no mention of negative training samples. How do these class of detectors deal with images which do not contain objects of interest? Does it just output a low probability, or will it force to try to draw a bounding box within an image?

My current plan is to use traditional background subtraction in opencv to generate potential frames and pass them to a trained network. Should I also include a class of 'background' bounding boxes as 'negative data'?

The final option would be to use opencv for background subtraction, RCNN to generate bounding boxes, then a classification model of crops to identify animals versus background.


In general it's not necessary to explicitly include "negative images". What happens in these detection models is that they use the parts of the image that don't belong to the annotated objects as negatives.


I want to create a face detection mobile app and I want to do it with a regular Deep Learning(Convolutional Network). I will train it with my computer and use trained data in the mobile app.

My question is that: can I get very fast computation in the regulat phone like iPhone? I need it be very fast and under 1 sec can detect a face in the video. Is it possible on a mobile device? or this kind of task need more powerful hardware?

I know training phase must be in a powerful computer but I mean production phase in a mobile device.

for example, if I put my phone in a street, It can detect all peoples face with the same deep network in training phase?


Yes, this is possible, but not with standard CNN architectures, some changes are needed:

  • One approach is CNNs with binary weights, so evaluating the CNN can just be done with bit operations. There are many publications about this, like this, this or this. I have seen an implementation of YOLO with binary weights running in real-time on an iPhone, so it is definitely possible.
  • A second approach is to reduce the number of parameters of the neural network, for example if you train a network with 5000 weights and gets detection performance that is close to what you want, then this network might run in real-time. But this is harder.
  • Third approach is just to optimize the neural network architecture to minimize parameters, and combine it with a very optimized implementation. There are algorithms to speedup convolution operations, such as L-CNN, or the ones implemented by cuDNN.

A very good related resource are the presentation and papers from the The 1st International Workshop on Efficient Methods for Deep Neural Networks.


I am trying to implement Neural network for email spam detection. I have neural network for solving XOR problem and I want to edit that network for my purpose and use ba. Its accessible here:

I downloaded some database of email spam and ham in text formats for training the network.So I have some training sets. But my question is:

What should be inputs for that neural network?

Thanks for every comment! :)


The short answer: the input will be your spam emails.

The longer answer, at a very basic level: Assuming your emails are free of weird characters. Imagine a vector, where each element of the vector represents one of the words that appear in those emails. And for each email, you create one those vectors, and for each element, you calculate the frequency of that word in the email. And all these vectors, one for each email, will be your inputs.

That's the basic idea. Then you can refine this by applying stemming, use tf-idf instead of plain frequency, bring in other input elements (from the email headers for example).


I am thinking about a toy project that would use a neural network for object recognition. Some of my objects are quite similar when viewed from one specific angle but easily distinguishable when viewed from a different angle. Thus my question:

What are methods to feed multiple images of the same object into a network? Or which network architectures exist that can take advantage of multiple images taken at different angles?

I have a good understanding of machine learning techniques but only basic understanding of neural networks. So what I am looking for here is both names of methods, techniques and other jargon that would be relevant for a google search as well as links to specific papers or articles that could be of interest.


The most common ones using multidimensional data use either multidimensional convolutions (, recurrent networks ( or multiple inputs, which is kinda similar to multidimensional convolutions.

Recurrent Networks handle sequences of data and the stacks of images can be seen a sequence. In contrast the multidimensional convolutions mostly exploit nearby data. Therefore it is important that the same space is highly correlated across your image stack. If this is not the case, you might want to consider using multiple inputs into your neural network.


First of all here is my github link for the question.

And here is my question:

I would like to do a face comparison function using Python. And I can successfully(?) recognize faces using OpenCV. Now, how do I do the comparison thing?

What I understand is this:

In general Machine learning approach, I need to gather lots of data about that particular person and finalize it using a CNN.

However, I just got 2 images, how do I do the comparison? Should I think it in terms of classification or clustering (Using KNN)?

Thank you very much in advance for all your help.


You can use the idea of face-embeddings, which for example is proposed in the highly-cited paper FaceNet and implemented in OpenFace (which also comes pre-trained).

The general idea: take some preprocessed face (frontal, cropped, ...) and embedd it to some lower dimension with the characteristic, that similar faces in input should have low euclidean-distance in the output.

So in your case: use the embedding-CNN to map your faces to the reduced space (usually a vector of size 128) and calculate the distance as in the euclidean-space. Of course you also cluster faces then, but that's not your task.

The good thing here besides the general idea: openface is a nice implementation ready to use and it's homepage also explains the idea:

Use a deep neural network to represent (or embed) the face on a 128-dimensional unit hypersphere.

The embedding is a generic representation for anybody's face. Unlike other face representations, this embedding has the nice property that a larger distance between two face embeddings means that the faces are likely not of the same person.

This property makes clustering, similarity detection, and classification tasks easier than other face recognition techniques where the Euclidean distance between features is not meaningful.

They even have a comparison-demo here.


I understand how CNNs work for classification problems, such as on the MNIST dataset, where each image represents a hand-written digit. Images are evaluated, and classifications are given with some confidence.

I would like to know what approach I should take if I wish to identify several objects in one image, with a confidence for each. For example - if I evaluated an image of a cat and a dog, I would like a high confidence for both 'cat' and 'dog'. I do not care where the object is in the picture.

My current knowledge would lead me to build a dataset of images containing JUST dogs, and a dataset of images containing JUST cats. I would retrain the top-level of say, the Inception V3 network, and it would be able to identify which images are of cats, and which images are of dogs.

The problem with this is that evaluating an image of a dog and a cat will lead to 50% dog and 50% cat - because it is trying to classify the image, but I want to 'tag' the image (ideally reaching ~100% dog, ~100% cat).

I have briefly looked at region-based CNNs, which address a similar problem, but I don't care where in the picture the objects are - just that they can each be identified.

What approaches exist to solve this problem? I would like to achieve this in Python using something like Tensorflow or Keras.


First, to easily understand, just think you have 2 seperate neural networks, one only identify whether cat is in image or not and the other identify dog is dog or not, surely the neurons will learn how do recognize that pretty well.

But more interesting is, those 2 networks can be combined into single network to share weights, and have 2 outputs for dog and cat together. To do that, you just need notice:

  • The 2 class(cat and dog) can be in the same image, then [cat_label, dog label] ={[0, 0], [0, 1], [1, 0], [1, 1]}. Not like MNIST or ordinary classification model where [cat_label, dog label] ={[0, 1], [1, 0]} (one_hot label).
  • When you predict, you may choose some threshold to determine whether cat and dog appear, for example, if y_cat>0.5 and y_dog>0.5, then cat and dog are in the image.

Hope this help!


I want to detect small objects (9x9 px) in my images (around 1200x900) using neural networks. Searching in the net, I've found several webpages with codes for keras using customized layers for custom objects classification. In this case, I've understood that you need to provide images where your object is alone. Although the training is goodand it classifies them properly, unfortunately I haven't found how to later load this trained network to find objects in my big images.

On the other side, I have found that I can do this using the cnn class in cv if I load the weigths from the Yolov3 netwrok. In this case I provide the big images with the proper annotations but the network is not well trained...

Given this context, could someone show me how to load weigths in cnn that are trained with a customized network and how to train that nrtwork?


After a lot of search, I've found a better approach:

  1. Cut your images in subimages (I cut it in 2 rows and 4 columns).
  2. Feed yolo with these subimages and their proper annotations. I used yolov3 tiny, with a size of 960x960 for 10k steps. In my case, intensity and color was important so random parameters such as hue, saturation and exposition were kept at 0. Use random angles. If your objects do not change in size, disable random at yolo layers (random=0 in cfg files. It only randomizes the fact that it changes the size for training in every step). For this, I'm using Alexey darknet fork. If you have some blur object, add blur=1 in the [net] properties in cfg file (after hue). For blur you need Alexey fork and to be compiled with opencv (appart from cuda if you can).
  3. Calculate anchors with Alexey fork. Cluster_num is the number of pairs of anchors you use. You can know it by opening your cfg and look at any anchors= line. Anchors are the size of the boxes that darknet will use to predict the positions. Cluster_num = number of anchors pairs.
  4. Change cfg with your new anchors. If you have fixed size objects, anchors will be very close in size. I left the ones for bigger (first yolo layer) but for the second, the tinies, I modified and I even removed 1 pair. If you remove some, then change the order in mask [yolo] (in all [yolo]). Mask refer to the index of the anchors, starting at 0 index. If you remove some, change also the num= inside the [yolo].
  5. After, detection is quite good.It could happen that if you detect on a video, there are objects that are lost in some frames. You can try to avoid this by using the lstm cfg.

Now, if you also want to track them, you can apply a deep sort algorithm with your yolo pretrained network. For example, you can convert your pretrained network to keras using (add this commit for tiny yolov3 and then use

As an alternative, you can train it with mask-rcnn or any other faster-rcnn algorithm and then look for deep-sort.


I have trained a model using tensorflow object detection/SSD mobilenet. It works great!

I'd like to add a class to it - just to detect pens or something.

How can I do this?

I have created my image set already, I just cannot find any tutorials or info on how to add a single class to an existing model.



Your idea of adding a class to an existing model, speaking in tensorflow object detection api lingo, is to retrain a custom object detection model on a custom dataset (in this case, your pen dataset).

There are quite some good tutorials on how to build a custom object detector by using tensorflow object detection api.

For example, sentdex posted a very good step by step tutorial here. Also the official github repo page contains some good tutorials like this one: bringing in your own dataset, this is actually the same as adding or deleting classes from the pretrained model in some sense.

But again, I think the above tutorials don't serve the exact goal of adding class to the model, it is only adding new class if you have data for old classes and new classes and retrain on all of them. Since in your case you only have data for new class, it is more formally referred as retrain a custom object detection model.


everybody.I'd like to use caffe to train a 5 classes detection task with "SSD: Single Shot MultiBox Detector", so I changed the num_classes from 21 to 6.However,I get an following error:

"Check failed: num_priors_ * num_classes_ == bottom[1]->channels() (52392 vs. 183372) Number of priors must match number of confidence predictions."

I can understand this error,and I found 52392/6=183372/21,namely why I changed num_classes to 6,but the number of confidence predictions is still 183372. So how to solve this problem. Thank you very much!


Since SSD depends on the number of labels not only for the classification output, but also for the BB prediction, you would need to change num_output in several other places in the model. I would strongly suggest you wouldn't do that manually, but rather use the python scripts provided in the 'examples/ssd' folder. For instance, you can change line 277 in 'examples/ssd/' to:

num_classes = 5 # instead of 21

And then use the model files this script provides.