Hot questions for Using Neural networks in yolo

Question:

I have created my own dataset which is a set of soccer ball images. Since I have only 1 class, I have modified the ball-yolov3-tiny.cfg as setting the filters to 18, and classes to 1.

Then I have annotated the images and put the created .txt files into the same directory of the images. Finally, I have started the training by using the darknet53.conv.74 model by executing the command darknet detector train custom/ball-obj.data custom/ball-yolov3-tiny.cfg darknet53.conv.74.

I have 134 images for training, and 15 images for the test. Here is a sample output of the training process:

95: 670.797241, 597.741333 avg, 0.000000 rate, 313.254830 seconds, 6080 images
Loaded: 0.000302 seconds
Region 16 Avg IOU: -nan, Class: -nan, Obj: -nan, No Obj: 0.499381, .5R: -nan, .75R: -nan,  count: 0
Region 23 Avg IOU: 0.344946, Class: 0.498204, Obj: 0.496005, No Obj: 0.496541, .5R: 0.000000, .75R: 0.000000,  count: 32
Region 16 Avg IOU: -nan, Class: -nan, Obj: -nan, No Obj: 0.499381, .5R: -nan, .75R: -nan,  count: 0
Region 23 Avg IOU: 0.344946, Class: 0.498204, Obj: 0.496005, No Obj: 0.496541, .5R: 0.000000, .75R: 0.000000,  count: 32
96: 670.557190, 605.022949 avg, 0.000000 rate, 312.962750 seconds, 6144 images
Loaded: 0.000272 seconds
Region 16 Avg IOU: -nan, Class: -nan, Obj: -nan, No Obj: 0.499360, .5R: -nan, .75R: -nan,  count: 0
Region 23 Avg IOU: 0.344946, Class: 0.498204, Obj: 0.495868, No Obj: 0.496454, .5R: 0.000000, .75R: 0.000000,  count: 32
Region 16 Avg IOU: -nan, Class: -nan, Obj: -nan, No Obj: 0.499360, .5R: -nan, .75R: -nan,  count: 0
Region 23 Avg IOU: 0.344946, Class: 0.498204, Obj: 0.495868, No Obj: 0.496454, .5R: 0.000000, .75R: 0.000000,  count: 32
97: 670.165161, 611.537170 avg, 0.000000 rate, 312.681998 seconds, 6208 images
Loaded: 0.000282 seconds
Region 16 Avg IOU: -nan, Class: -nan, Obj: -nan, No Obj: 0.499331, .5R: -nan, .75R: -nan,  count: 0
Region 23 Avg IOU: 0.344946, Class: 0.498204, Obj: 0.495722, No Obj: 0.496397, .5R: 0.000000, .75R: 0.000000,  count: 32
Region 16 Avg IOU: -nan, Class: -nan, Obj: -nan, No Obj: 0.499331, .5R: -nan, .75R: -nan,  count: 0
Region 23 Avg IOU: 0.344946, Class: 0.498204, Obj: 0.495722, No Obj: 0.496397, .5R: 0.000000, .75R: 0.000000,  count: 32
98: 669.815918, 617.365051 avg, 0.000000 rate, 319.203044 seconds, 6272 images
Loaded: 0.000244 seconds
Region 16 Avg IOU: -nan, Class: -nan, Obj: -nan, No Obj: 0.499294, .5R: -nan, .75R: -nan,  count: 0
Region 23 Avg IOU: 0.344947, Class: 0.498204, Obj: 0.495569, No Obj: 0.496253, .5R: 0.000000, .75R: 0.000000,  count: 32
Region 16 Avg IOU: -nan, Class: -nan, Obj: -nan, No Obj: 0.499294, .5R: -nan, .75R: -nan,  count: 0
Region 23 Avg IOU: 0.344947, Class: 0.498204, Obj: 0.495569, No Obj: 0.496253, .5R: 0.000000, .75R: 0.000000,  count: 32
99: 669.555664, 622.584106 avg, 0.000000 rate, 320.330266 seconds, 6336 images
Loaded: 0.000244 seconds
Region 16 Avg IOU: -nan, Class: -nan, Obj: -nan, No Obj: 0.499246, .5R: -nan, .75R: -nan,  count: 0
Region 23 Avg IOU: 0.344948, Class: 0.498204, Obj: 0.495409, No Obj: 0.496197, .5R: 0.000000, .75R: 0.000000,  count: 32
Region 16 Avg IOU: -nan, Class: -nan, Obj: -nan, No Obj: 0.499246, .5R: -nan, .75R: -nan,  count: 0
Region 23 Avg IOU: 0.344948, Class: 0.498204, Obj: 0.495409, No Obj: 0.496197, .5R: 0.000000, .75R: 0.000000,  count: 32
100: 669.132629, 627.238953 avg, 0.000000 rate, 329.954091 seconds, 6400 images
Saving weights to backup//ball-yolov3-tiny.backup
Saving weights to backup//ball-yolov3-tiny_100.weights
Resizing
576
Loaded: 1.764142 seconds
Region 16 Avg IOU: -nan, Class: -nan, Obj: -nan, No Obj: 0.499216, .5R: -nan, .75R: -nan,  count: 0
Region 23 Avg IOU: 0.430712, Class: 0.498203, Obj: 0.495251, No Obj: 0.496154, .5R: 0.000000, .75R: 0.000000,  count: 32

Here are the other configuration files:

ball-obj.data

classes= 1
train  = custom/ball-train.txt
valid  = custom/ball-test.txt
names = custom/ball-obj.names
backup = backup/

ball-obj.names

ball

When I use the created weights in order to test a single image, it simply fails to find the soccer balls in the images. Do I need a lot more (e.g. 10K) images for that? Or do I need to train the model for long hours? I just want to be sure that everything regarding my setup is OK.

Please feel free to ask any queries regarding my experiment. Your help is really appreciated. Thanks in advance.

p.s. Here is the whole content of my ball-yolov3-tiny.cnf:

[net]
# Testing
batch=1
subdivisions=1
# Training
#batch=64
#subdivisions=2
width=416
height=416
channels=3
momentum=0.9
decay=0.0005
angle=0
saturation = 1.5
exposure = 1.5
hue=.1

learning_rate=0.001
burn_in=1000
max_batches = 500200
policy=steps
steps=400000,450000
scales=.1,.1

[convolutional]
batch_normalize=1
filters=16
size=3
stride=1
pad=1
activation=leaky

[maxpool]
size=2
stride=2

[convolutional]
batch_normalize=1
filters=32
size=3
stride=1
pad=1
activation=leaky

[maxpool]
size=2
stride=2

[convolutional]
batch_normalize=1
filters=64
size=3
stride=1
pad=1
activation=leaky

[maxpool]
size=2
stride=2

[convolutional]
batch_normalize=1
filters=128
size=3
stride=1
pad=1
activation=leaky

[maxpool]
size=2
stride=2

[convolutional]
batch_normalize=1
filters=256
size=3
stride=1
pad=1
activation=leaky

[maxpool]
size=2
stride=2

[convolutional]
batch_normalize=1
filters=512
size=3
stride=1
pad=1
activation=leaky

[maxpool]
size=2
stride=1

[convolutional]
batch_normalize=1
filters=1024
size=3
stride=1
pad=1
activation=leaky

###########

[convolutional]
batch_normalize=1
filters=256
size=1
stride=1
pad=1
activation=leaky

[convolutional]
batch_normalize=1
filters=512
size=3
stride=1
pad=1
activation=leaky

[convolutional]
size=1
stride=1
pad=1
filters=18
activation=linear



[yolo]
mask = 3,4,5
anchors = 10,14,  23,27,  37,58,  81,82,  135,169,  344,319
classes=1
num=6
jitter=.3
ignore_thresh = .7
truth_thresh = 1
random=1

[route]
layers = -4

[convolutional]
batch_normalize=1
filters=128
size=1
stride=1
pad=1
activation=leaky

[upsample]
stride=2

[route]
layers = -1, 8

[convolutional]
batch_normalize=1
filters=256
size=3
stride=1
pad=1
activation=leaky

[convolutional]
size=1
stride=1
pad=1
filters=18
activation=linear

[yolo]
mask = 0,1,2
anchors = 10,14,  23,27,  37,58,  81,82,  135,169,  344,319
classes=1
num=6
jitter=.3
ignore_thresh = .7
truth_thresh = 1
random=1

The command I execute is:

darknet detector train custom/ball-obj.data custom/ball-yolov3-tiny.cfg darknet53.conv.74

Answer:

  1. You have small dataset and 15 images for testing is too small I think.
  2. batch=1 subdivisions=1 means you only train 1 image per 1 iteration, if you have enough GPU, you can try to increase the number to make your model converge better
  3. The most obvious reason why your model is not good is 100: 669.132629, 627.238953 avg. You can see the loss average is 627.238953 which is too big. Good YOLO model has loss around 0.06~1.

So things that you can do according to my points above are continue the training (100 iteration is not enough especially for Tiny YOLO), increase batch and subdivision, and increase your dataset.

Addition : Recalculating bounding box using your dataset is also a good option, you can find lots of great example and code on internet.

Question:

I'm trying to train a neural net using YOLOv2 to recognize characters and objects in a video game. For input data, I took screen shots of in game assets from various angles. However, there are no backgrounds in these character models - only the models themselves. In the game, of course, there will be backgrounds behind the characters.

Will this confuse the neural network? And if so, should I go ahead and find some sample background images from the game and apply them randomly to the input data?


Answer:

Yes, you should add ingame backgrounds to your models or you will never get a decent quality detection. The network needs to know the background, placement of the objects on the background, even the lighting of the objects in the scene. They all contribute to the final detection quality.

Also the technique you use to blend the background and your images is important.

A good read about the subject: Synthesizing Training Data for Object Detection in Indoor Scenes

Question:

I am currently working on replicating YOLOv2 (not tiny) on iOS (Swift4) using MPS.

A problem is that it is hard for me to implement space_to_depth function (https://www.tensorflow.org/api_docs/python/tf/space_to_depth) and concatenation of two results from convolutions (13x13x256 + 13x13x1024 -> 13x13x1280). Could you give me some advice on making these parts? My codes are below.

...



let conv19 = MPSCNNConvolutionNode(source: conv18.resultImage,

                                 weights: DataSource("conv19", 3, 3, 1024, 1024))



let conv20 = MPSCNNConvolutionNode(source: conv19.resultImage,

                                 weights: DataSource("conv20", 3, 3, 1024, 1024))



let conv21 = MPSCNNConvolutionNode(source: conv13.resultImage,

                                 weights: DataSource("conv21", 1, 1, 512, 64))



/*****

    1. space_to_depth with conv21

    2. concatenate the result of conv20(13x13x1024) to the result of 1 (13x13x256)

    I need your help to implement this part!

******/

Answer:

  1. I believe space_to_depth can be expressed in form of a convolution: For instance, for an input with dimension [1,2,2,1], Use 4 convolution kernels that each output one number to one channel, ie. [[1,0],[0,0]] [[0,1],[0,0]] [[0,0],[1,0]] [[0,0],[0,1]], this should put all input numbers from spatial dimension to depth dimension.

  2. MPS actually has a concat node. See here: https://developer.apple.com/documentation/metalperformanceshaders/mpsnnconcatenationnode

    You can use it like this: concatNode = [[MPSNNConcatenationNode alloc] initWithSources:@[layerA.resultImage, layerB.resultImage]];

Question:

I looked into Yahoo's old NSFW detector and can't help but wonder if there is a Yolo DNN version trained on similar (unreleased) datasets that would detect and locate human nudity on pictures?

Is there at least a public database of it or must I gather my own?


Answer:

A recent effort has been put together to implement a scraper for that kind of data. As described in this article, it resulted in a 220k image dataset you can find in this repo's /raw_data folder.

It may already be useful for you, but that dataset has very generic and sparsely defined categories, which inspired this newer, better organized dataset. It has 159 defined categories, with a total of 1.58 million imgur URLs. These were taken mostly from Reddit channels, which - in all of Reddit's categorization glory - contributed to the overall placement of tags. The repo's README claims that after data cleaning - e.g. duplicate / corrupted / deleted data removal - your total volume should have ~500 GB and ~1.3 million images.

As for the pretrained YOLO, there's no pulished work on that. If you're okay with the dependency and cost of delegating that content filtering to Google's Cloud Vision API, they claim to be good at classifying visual adult content. Otherwise, since most works on the same nature seem to be held private, you'd have to train your own.