Hot questions for Using Neural networks in lmdb

Question:

I am trying to build a deep learning model for Saliency analysis using caffe (I am using the python wrapper). But I am unable to understand how to generate the lmdb data structure for this purpose. I have gone through the Imagenet and mnist examples and I understand that I should generate labels in the format

my_test_dir/picture-foo.jpg 0

But in my case, I will be labeling each pixel with 0 or 1 indicating whether that pixel is salient or not. That won't be a single label for an image.

How to generate lmdb files for a per pixel based labeling ?


Answer:

You can approach this problem in two ways:

1. Using HDF5 data layer instead of LMDB. HDF5 is more flexible and can support labels the size of the image. You can see this answer for an example of constructing and using HDF5 input data layer.

2. You can have two LMDB input layers: one for the image and one for the label. Note that when you build the LMDB you must not use the 'shuffle' option in order to have the images and their labels in sync.

Update: I recently gave a more detailed answer here.

Question:

I have two LMDB files, with the first one my network trains fine while with the other one doesn't really work (loss starts and stays at 0). So I figured that maybe there's something wrong with the second LMDB. I tried writing some python code (mostly taken from here) to fetch the data from my LMDBs and inspect it but so far no luck with any of the 2 databases. The LMDBs contain images as data and bounding box information as labels.

Doing this:

for key, value in lmdb_cursor:
    datum.ParseFromString(value)
    label = datum.label
    data = caffe.io.datum_to_array(datum)

on either one of the LMDBs gives me a key which is correctly the name of the image, but that datum.ParseFromString function is not able to retrieve anything from value. label is always 0, while data is an empty ndarray. Nonetheless, the data is there, value is a binary string of around 140 KB which correctly accounts for the size of the image plus the bounding box information I guess.

I tried browsing several answers and discussions dealing with reading data from LMDBs in python, but I couldn't find any clue on how to read structured information such as bounding box labels. My guess is that the parsing function expects a digit label and interprets the first bytes as such, with the remaining data being then lost due to the binary string not making any sense anymore?

I know for a fact that at least the first LMDB is correct since my network performs correctly in both training and testing using it.

Any inputs will be greatly appreciated!


Answer:

The basic element stored in your LMDB is not Datum, but rather AnnotatedDatum. Threfore, you need to approach it with a little care:

datum.ParseFromString(value.datum)
value.annotation_group  # should store the annotations

Question:

I am relatively new to using caffe and am trying to create minimal working examples that I can (later) tweak. I had no difficulty using caffe's examples with MNIST data. I downloaded image-net data (ILSVRC12) and used caffe's tool to convert it to an lmdb database using:

$CAFFE_ROOT/build/install/bin/convert_imageset -shuffle -encoded=true top_level_data_dir/ fileNames.txt lmdb_name

To create an lmdb containing encoded (jpeg) image data. The reason for this is that encoded, the lmdb is about 64GB versus unencoded being about 240GB.

My .prototxt file that describes the net is minimal (a pair of inner product layers, mostly borrowed from the MNIST example--not going for accuracy here, I just want something to work).

name: "example"
layer {
  name: "imagenet"
  type: "Data"
  top: "data"
  top: "label"
  include {
    phase: TRAIN
  }
  transform_param {
    scale: 0.00390625
  }
  data_param {
    source: "train-lmdb"
    batch_size: 100
    backend: LMDB
  }
}
layer {
  name: "imagenet"
  type: "Data"
  top: "data"
  top: "label"
  include {
    phase: TEST
  }
  transform_param {
    scale: 0.00390625
  }
  data_param {
    source: "test-lmdb"
    batch_size: 100
    backend: LMDB
  }
}
layer {
  name: "ip1"
  type: "InnerProduct"
  bottom: "data"
  top: "ip1"
  param {
    lr_mult: 1
  }
  param {
    lr_mult: 2
  }
  inner_product_param {
    num_output: 1000
    weight_filler {
      type: "xavier"
    }
    bias_filler {
      type: "constant"
    }
  }
}
layer {
  name: "relu1"
  type: "ReLU"
  bottom: "ip1"
  top: "ip1"
}
layer {
  name: "ip2"
  type: "InnerProduct"
  bottom: "ip1"
  top: "ip2"
  param {
    lr_mult: 1
  }
  param {
    lr_mult: 2
  }
  inner_product_param {
    num_output: 1000
    weight_filler {
      type: "xavier"
    }
    bias_filler {
      type: "constant"
    }
  }
}
layer {
  name: "accuracy"
  type: "Accuracy"
  bottom: "ip2"
  bottom: "label"
  top: "accuracy"
  include {
    phase: TEST
  }
}
layer {
  name: "loss"
  type: "SoftmaxWithLoss"
  bottom: "ip2"
  bottom: "label"
  top: "loss"
}

When train-lmdb is unencoded, this .prototxt file works fine (accuracy is abysmal, but caffe does not crash). However, if train-lmdb is encoded then I get the following error:

data_transformer.cpp:239] Check failed: channels == img_channels (3 vs. 1)

Question: Is there some "flag" I must set in the .prototxt file that indicates that the train-lmdb is encoded images? (The same flag would likely have to be given to for the testing data layer, test-lmdb.)

A little research:

Poking around with google I found a resolved issue which seemed promising. However, setting the 'force_encoded_color' to true did not resolve my problem.

I also found this answer very helpful with creating the lmdb (specifically, with directions for enabling the encoding), however, no mention was made of what should be done so that caffe is aware that the images are encoded.


Answer:

The error message you got:

data_transformer.cpp:239] Check failed: channels == img_channels (3 vs. 1)

means caffe data transformer is expecting input with 3 channels (i.e., color image), but is getting an image with only 1 img_channels (i.e., gray scale image).

looking ar caffe.proto it would seems like you should set the parameter at the transformation_param:

layer {
  name: "imagenet"
  type: "Data"
  top: "data"
  top: "label"
  include {
    phase: TRAIN
  }
  transform_param {
    scale: 0.00390625
    force_color: true  ##  try this
  }
  data_param {
    source: "train-lmdb"
    batch_size: 100
    backend: LMDB
    force_encoded_color: true  ## cannot hurt...
  }
}

Question:

I look at the python example for Lenet and see that the number of iterations needed to run over the entire MNIST test dataset is hard-coded. However, can this value be not hard-coded at all? How to get the number of samples of the dataset pointed by a network in python?


Answer:

You can use the lmdb library to access the lmdb directly

import lmdb
db = lmdb.open('/path/to/lmdb_folder')       //Needs lmdb - method
num_examples = int( db.stat()['entries'] )

Should do the trick for you.

Question:

I have been trying to run the squeeznet model for quite some time now,after resolving multiple errors,i am stuck with this one-

when i run the command

./build/tools/caffe train -solve SqueezeNet/SqueezeNet_v1.0/solver.prototxt

i get

I0723 16:26:58.532799 11108 layer_factory.hpp:77] Creating layer data
F0723 16:26:58.629655 11108 db_lmdb.hpp:15] Check failed: mdb_status 
== 0 (2 vs. 0) No such file or directory
***     Check failure stack trace: ***
    @     0x7fb24de835cd  google::LogMessage::Fail()
    @     0x7fb24de85433  google::LogMessage::SendToLog()
    @     0x7fb24de8315b  google::LogMessage::Flush()
    @     0x7fb24de85e1e  google::LogMessageFatal::~LogMessageFatal()
    @     0x7fb24e23efd8  caffe::db::LMDB::Open()
    @     0x7fb24e2b541f  caffe::DataLayer<>::DataLayer()
    @     0x7fb24e2b55b2  caffe::Creator_DataLayer<>()
    @     0x7fb24e290a59  caffe::Net<>::Init()
    @     0x7fb24e29343e  caffe::Net<>::Net()
    @     0x7fb24e22a315  caffe::Solver<>::InitTrainNet()
    @     0x7fb24e22b6f5  caffe::Solver<>::Init()
    @     0x7fb24e22ba0f  caffe::Solver<>::Solver()
    @     0x7fb24e21c851  caffe::Creator_SGDSolver<>()
    @           0x40a958  train()
    @           0x4072f8  main
    @     0x7fb24c50c830  __libc_start_main
    @           0x407bc9  _start
    @              (nil)  (unknown)
Aborted (core dumped)

Any suggestions?


Answer:

It seems like caffe cannot find the LMDB database storing your training/validation data.

Make sure that the LMDB pointed by the path in source: ... parameter in your "Data" layer exists and that you have read permissions for this dataset.

Question:

I am trying to use the LMDB file that I created to define the data layer in caffe net and I get below error

TypeError: 'LMDB' has type (type 'str'), but expected one of: (type 'int', type 'long')

I checked for labels in the text file that I passed to script that generates lmdb file (caffe/build/tools/convert_imageset). Am I missing something here?

Edit -1: Here is my data layer definition:

n.data,n.labels = L.Data(batch_size = batch_size, 
                         source=lmdb_src, 
                         backend = "LMDB", 
                         transform_param = dict(mean_file = mean_file),
                         ntop=2)

Answer:

You are trying to set

backend: "LMDB"

in your net definition, instead of

backend: LMDB

Note that LMDB is not passed as string, but rather as an enumerated integer.

What you should do is set

backend = caffe.Data.LMDB

Use the enum value set by caffe protobuff definition.

Question:

I am trying to learn Caffe by training the AlexNet on black and white images with Circles (Label: "1") and Rectangles (Label: "0"). I'm using 1800 training images (900 Circles and 900 Rectangles). For example:

My train_val.prototxt looks like this:

name: "AlexNet"
layer {
   name: "data"
   type: "Data"
   top: "data"
   top: "label"
   include {
      phase: TRAIN
   }
   data_param {
      source: "newlmdb"
      batch_size: 100
      backend: LMDB
   }
}
layer {
   name: "data"
   type: "Data"
   top: "data"
   top: "label"
   include {
      phase: TEST
   }
   data_param {
      source: "newvallmdb"
      batch_size: 50
      backend: LMDB
   }
}
layer {
   name: "conv1"
   type: "Convolution"
   bottom: "data"
   top: "conv1"
   param {
      lr_mult: 1
       decay_mult: 1
   }
   param {
      lr_mult: 2
      decay_mult: 0
   }
   convolution_param {
      num_output: 96
      kernel_size: 11
      stride: 4
      weight_filler {
         type: "gaussian"
         std: 0.01
      }
      bias_filler {
         type: "constant"
         value: 0
      }
   }
}
layer {
   name: "relu1"
   type: "ReLU"
   bottom: "conv1"
   top: "conv1"
}
layer {
   name: "norm1"
   type: "LRN"
   bottom: "conv1"
   top: "norm1"
   lrn_param {
      local_size: 5
      alpha: 0.0001
      beta: 0.75
   }
}
layer {
   name: "pool1"
   type: "Pooling"
   bottom: "norm1"
   top: "pool1"
   pooling_param {
      pool: MAX
      kernel_size: 3
      stride: 2
   }
}
layer {
  name: "conv2"
  type: "Convolution"
  bottom: "pool1"
  top: "conv2"
  param {
    lr_mult: 1
    decay_mult: 1
  }
  param {
    lr_mult: 2
    decay_mult: 0
  }
  convolution_param {
    num_output: 256
    pad: 2
    kernel_size: 5
    group: 2
    weight_filler {
      type: "gaussian"
      std: 0.01
    }
    bias_filler {
      type: "constant"
      value: 0.1
    }
  }
}
layer {
   name: "relu2"
   type: "ReLU"
   bottom: "conv2"
   top: "conv2"
}
layer {
  name: "norm2"
  type: "LRN"
  bottom: "conv2"
  top: "norm2"
  lrn_param {
    local_size: 5
    alpha: 0.0001
    beta: 0.75
  }
}
layer {
  name: "pool2"
  type: "Pooling"
  bottom: "norm2"
  top: "pool2"
  pooling_param {
    pool: MAX
    kernel_size: 3
    stride: 2
  }
}
layer {
  name: "conv3"
  type: "Convolution"
  bottom: "pool2"
  top: "conv3"
  param {
    lr_mult: 1
    decay_mult: 1
  }
  param {
    lr_mult: 2
    decay_mult: 0
  }
  convolution_param {
    num_output: 384
    pad: 1
    kernel_size: 3
    weight_filler {
      type: "gaussian"
      std: 0.01
    }
    bias_filler {
      type: "constant"
      value: 0
    }
  }
}
layer {
  name: "relu3"
  type: "ReLU"
  bottom: "conv3"
  top: "conv3"
}
layer {
  name: "conv4"
  type: "Convolution"
  bottom: "conv3"
  top: "conv4"
  param {
    lr_mult: 1
    decay_mult: 1
  }
  param {
    lr_mult: 2
    decay_mult: 0
  }
  convolution_param {
    num_output: 384
    pad: 1
    kernel_size: 3
    group: 2
    weight_filler {
      type: "gaussian"
      std: 0.01
    }
    bias_filler {
      type: "constant"
      value: 0.1
    }
  }
}
layer {
  name: "relu4"
  type: "ReLU"
  bottom: "conv4"
  top: "conv4"
}
layer {
  name: "conv5"
  type: "Convolution"
  bottom: "conv4"
  top: "conv5"
  param {
    lr_mult: 1
    decay_mult: 1
  }
  param {
    lr_mult: 2
    decay_mult: 0
  }
  convolution_param {
    num_output: 256
    pad: 1
    kernel_size: 3
    group: 2
    weight_filler {
      type: "gaussian"
      std: 0.01
    }
    bias_filler {
      type: "constant"
      value: 0.1
    }
  }
}

    layer {
      name: "relu5"
      type: "ReLU"
      bottom: "conv5"
      top: "conv5"
    }
    layer {
      name: "pool5"
      type: "Pooling"
      bottom: "conv5"
      top: "pool5"
      pooling_param {
        pool: MAX
        kernel_size: 3
        stride: 

2
  }
}
layer {
  name: "fc6"
  type: "InnerProduct"
  bottom: "pool5"
  top: "fc6"
  param {
    lr_mult: 1
    decay_mult: 1
  }
  param {
    lr_mult: 2
    decay_mult: 0
  }
  inner_product_param {
    num_output: 4096
    weight_filler {
      type: "gaussian"
      std: 0.005
    }
    bias_filler {
      type: "constant"
      value: 0.1
    }
  }
}
layer {
  name: "relu6"
  type: "ReLU"
  bottom: "fc6"
  top: "fc6"
}
layer {
  name: "drop6"
  type: "Dropout"
  bottom: "fc6"
  top: "fc6"
  dropout_param {
    dropout_ratio: 0.5
  }
}
layer {
  name: "fc7"
  type: "InnerProduct"
  bottom: "fc6"
  top: "fc7"
  param {
    lr_mult: 1
    decay_mult: 1
  }
  param {
    lr_mult: 2
    decay_mult: 0
  }
  inner_product_param {
    num_output: 4096
    weight_filler {
      type: "gaussian"
      std: 0.005
    }
    bias_filler {
      type: "constant"
      value: 0.1
    }
  }
}
layer {
  name: "relu7"
  type: "ReLU"
  bottom: "fc7"
  top: "fc7"
}
layer {
  name: "drop7"
  type: "Dropout"
  bottom: "fc7"
  top: "fc7"
  dropout_param {
    dropout_ratio: 0.5
  }
}
layer {
  name: "fc8"
  type: "InnerProduct"
  bottom: "fc7"
  top: "fc8"
  param {
    lr_mult: 1
    decay_mult: 1
  }
  param {
    lr_mult: 2
    decay_mult: 0
  }
  inner_product_param {
    num_output: 1
    weight_filler {
      type: "gaussian"
      std: 0.01
    }
    bias_filler {
      type: "constant"
      value: 0
    }
  }
}
layer {
  name: "accuracy"
  type: "Accuracy"
  bottom: "fc8"
  bottom: "label"
  top: "accuracy"
  include {
    phase: TEST
  }
}
layer {
  name: "loss"
  type: "SoftmaxWithLoss"
  bottom: "fc8"
  bottom: "label"
  top: "loss"
}

My solver.prototxt looks like this:

net: "train_val.prototxt"
test_iter: 200
test_interval: 200
base_lr: 0.01
lr_policy: "step"
gamma: 0.1
stepsize: 50
display: 20
max_iter: 500
momentum: 0.9
weight_decay: 0.0005
snapshot: 100
snapshot_prefix: "training"
solver_mode: GPU

While trainig I get this output:

I1018 10:13:04.936286  7404 solver.cpp:330] Iteration 0, Testing net (#0)
I1018 10:13:06.262091  7792 data_layer.cpp:73] Restarting data prefetching from start.
I1018 10:13:07.556700  7792 data_layer.cpp:73] Restarting data prefetching from start.
I1018 10:13:11.440527  7792 data_layer.cpp:73] Restarting data prefetching from start.
I1018 10:13:12.267205  7404 solver.cpp:397]     Test net output #0: accuracy = 0.94
I1018 10:13:12.267205  7404 solver.cpp:397]     Test net output #1: loss = 0.104804 (* 1 = 0.104804 loss)
I1018 10:13:12.594758  7404 solver.cpp:218] Iteration 0 (-9.63533e-42 iter/s, 7.69215s/20 iters), loss = 0.873365
I1018 10:13:12.594758  7404 solver.cpp:237]     Train net output #0: loss = 0.873365 (* 1 = 0.873365 loss)
I1018 10:13:12.594758  7404 sgd_solver.cpp:105] Iteration 0, lr = 0.01
I1018 10:13:15.807883  7748 data_layer.cpp:73] Restarting data prefetching from start.
I1018 10:13:17.305263  7404 solver.cpp:218] Iteration 20 (4.25024 iter/s, 4.70562s/20 iters), loss = 0.873365
I1018 10:13:17.305263  7404 solver.cpp:237]     Train net output #0: loss = 0.873365 (* 1 = 0.873365 loss)
I1018 10:13:17.305263  7404 sgd_solver.cpp:105] Iteration 20, lr = 0.01
I1018 10:13:20.019263  7748 data_layer.cpp:73] Restarting data prefetching from start.
I1018 10:13:21.984572  7404 solver.cpp:218] Iteration 40 (4.26967 iter/s, 4.6842s/20 iters), loss = 0.873365
I1018 10:13:21.984572  7404 solver.cpp:237]     Train net output #0: loss = 0.873365 (* 1 = 0.873365 loss)
I1018 10:13:21.984572  7404 sgd_solver.cpp:105] Iteration 40, lr = 0.01
I1018 10:13:24.246239  7748 data_layer.cpp:73] Restarting data prefetching from start.
I1018 10:13:26.695078  7404 solver.cpp:218] Iteration 60 (4.25863 iter/s, 4.69634s/20 iters), loss = 0.873365
I1018 10:13:26.695078  7404 solver.cpp:237]     Train net output #0: loss = 0.873365 (* 1 = 0.873365 loss)
I1018 10:13:26.695078  7404 sgd_solver.cpp:105] Iteration 60, lr = 0.001
I1018 10:13:28.426422  7748 data_layer.cpp:73] Restarting data prefetching from start.
I1018 10:13:31.421181  7404 solver.cpp:218] Iteration 80 (4.22339 iter/s, 4.73554s/20 iters), loss = 0.873365
I1018 10:13:31.421181  7404 solver.cpp:237]     Train net output #0: loss = 0.873365 (* 1 = 0.873365 loss)
I1018 10:13:31.421181  7404 sgd_solver.cpp:105] Iteration 80, lr = 0.001
I1018 10:13:32.731387  7748 data_layer.cpp:73] Restarting data prefetching from start.
[I 10:13:32.934 NotebookApp] Saving file at /Untitled2.ipynb
I1018 10:13:35.788537  7404 solver.cpp:447] Snapshotting to binary proto file training_iter_100.caffemodel
I1018 10:13:37.317111  7404 sgd_solver.cpp:273] Snapshotting solver state to binary proto file training_iter_100.solverstate
I1018 10:13:38.081399  7404 solver.cpp:218] Iteration 100 (3.00631 iter/s, 6.65267s/20 iters), loss = 0
I1018 10:13:38.081399  7404 solver.cpp:237]     Train net output #0: loss = 0 (* 1 = 0 loss)
I1018 10:13:38.081399  7404 sgd_solver.cpp:105] Iteration 100, lr = 0.0001
I1018 10:13:38.908077  7748 data_layer.cpp:73] Restarting data prefetching from start.
I1018 10:13:42.791904  7404 solver.cpp:218] Iteration 120 (4.23481 iter/s, 4.72276s/20 iters), loss = 0
I1018 10:13:42.807502  7404 solver.cpp:237]     Train net output #0: loss = 0 (* 1 = 0 loss)
I1018 10:13:42.807502  7404 sgd_solver.cpp:105] Iteration 120, lr = 0.0001
I1018 10:13:43.088260  7748 data_layer.cpp:73] Restarting data prefetching from start.
I1018 10:13:47.393225  7748 data_layer.cpp:73] Restarting data prefetching from start.
I1018 10:13:47.549202  7404 solver.cpp:218] Iteration 140 (4.21716 iter/s, 4.74253s/20 iters), loss = 0
I1018 10:13:47.549202  7404 solver.cpp:237]     Train net output #0: loss = 0 (* 1 = 0 loss)
I1018 10:13:47.549202  7404 sgd_solver.cpp:105] Iteration 140, lr = 0.0001
I1018 10:13:51.635800  7748 data_layer.cpp:73] Restarting data prefetching from start.
I1018 10:13:52.290904  7404 solver.cpp:218] Iteration 160 (4.21268 iter/s, 4.74757s/20 iters), loss = 0
I1018 10:13:52.290904  7404 solver.cpp:237]     Train net output #0: loss = 0 (* 1 = 0 loss)
I1018 10:13:52.290904  7404 sgd_solver.cpp:105] Iteration 160, lr = 1e-05
I1018 10:13:56.003156  7748 data_layer.cpp:73] Restarting data prefetching from start.
I1018 10:13:57.048202  7404 solver.cpp:218] Iteration 180 (4.20926 iter/s, 4.75142s/20 iters), loss = 0.873365
I1018 10:13:57.048202  7404 solver.cpp:237]     Train net output #0: loss = 0.873365 (* 1 = 0.873365 loss)
I1018 10:13:57.048202  7404 sgd_solver.cpp:105] Iteration 180, lr = 1e-05
I1018 10:14:00.214535  7748 data_layer.cpp:73] Restarting data prefetching from start.
I1018 10:14:01.431155  7404 solver.cpp:447] Snapshotting to binary proto file training_iter_200.caffemodel
I1018 10:14:03.053316  7404 sgd_solver.cpp:273] Snapshotting solver state to binary proto file training_iter_200.solverstate
I1018 10:14:03.552443  7404 solver.cpp:330] Iteration 200, Testing net (#0)
I1018 10:14:04.082764  7792 data_layer.cpp:73] Restarting data prefetching from start.
I1018 10:14:05.439764  7792 data_layer.cpp:73] Restarting data prefetching from start.
I1018 10:14:10.727385  7792 data_layer.cpp:73] Restarting data prefetching from start.
I1018 10:14:10.789775  7404 blocking_queue.cpp:49] Waiting for data
I1018 10:14:10.961350  7404 solver.cpp:397]     Test net output #0: accuracy = 0.94
I1018 10:14:10.961350  7404 solver.cpp:397]     Test net output #1: loss = 0.104804 (* 1 = 0.104804 loss)
I1018 10:14:11.179718  7404 solver.cpp:218] Iteration 200 (1.41459 iter/s, 14.1384s/20 iters), loss = 0.873365
I1018 10:14:11.179718  7404 solver.cpp:237]     Train net output #0: loss = 0.873365 (* 1 = 0.873365 loss)
I1018 10:14:11.179718  7404 sgd_solver.cpp:105] Iteration 200, lr = 1e-06
I1018 10:14:13.846925  7748 data_layer.cpp:73] Restarting data prefetching from start.
I1018 10:14:15.952615  7404 solver.cpp:218] Iteration 220 (4.19673 iter/s, 4.76562s/20 iters), loss = 0.873365
I1018 10:14:15.952615  7404 solver.cpp:237]     Train net output #0: loss = 0.873365 (* 1 = 0.873365 loss)
I1018 10:14:15.952615  7404 sgd_solver.cpp:105] Iteration 220, lr = 1e-06
I1018 10:14:18.198683  7748 data_layer.cpp:73] Restarting data prefetching from start.
I1018 10:14:20.709913  7404 solver.cpp:218] Iteration 240 (4.19817 iter/s, 4.76398s/20 iters), loss = 0.873365
I1018 10:14:20.709913  7404 solver.cpp:237]     Train net output #0: loss = 0.873365 (* 1 = 0.873365 loss)
I1018 10:14:20.709913  7404 sgd_solver.cpp:105] Iteration 240, lr = 1e-06
I1018 10:14:22.441257  7748 data_layer.cpp:73] Restarting data prefetching from start.
I1018 10:14:25.498407  7404 solver.cpp:218] Iteration 260 (4.18243 iter/s, 4.78191s/20 iters), loss = 0.873365
I1018 10:14:25.498407  7404 solver.cpp:237]     Train net output #0: loss = 0.873365 (* 1 = 0.873365 loss)
I1018 10:14:25.498407  7404 sgd_solver.cpp:105] Iteration 260, lr = 1e-07
I1018 10:14:26.761821  7748 data_layer.cpp:73] Restarting data prefetching from start.
I1018 10:14:30.271303  7404 solver.cpp:218] Iteration 280 (4.18629 iter/s, 4.7775s/20 iters), loss = 0
I1018 10:14:30.271303  7404 solver.cpp:237]     Train net output #0: loss = 0 (* 1 = 0 loss)
I1018 10:14:30.271303  7404 sgd_solver.cpp:105] Iteration 280, lr = 1e-07
I1018 10:14:31.129176  7748 data_layer.cpp:73] Restarting data prefetching from start.
I1018 10:14:34.701050  7404 solver.cpp:447] Snapshotting to binary proto file training_iter_300.caffemodel
I1018 10:14:36.136039  7404 sgd_solver.cpp:273] Snapshotting solver state to binary proto file training_iter_300.solverstate
I1018 10:14:36.931521  7404 solver.cpp:218] Iteration 300 (3.00228 iter/s, 6.66161s/20 iters), loss = 0
I1018 10:14:36.931521  7404 solver.cpp:237]     Train net output #0: loss = 0 (* 1 = 0 loss)
I1018 10:14:36.931521  7404 sgd_solver.cpp:105] Iteration 300, lr = 1e-08
I1018 10:14:37.337061  7748 data_layer.cpp:73] Restarting data prefetching from start.
I1018 10:14:41.595233  7748 data_layer.cpp:73] Restarting data prefetching from start.
I1018 10:14:41.688819  7404 solver.cpp:218] Iteration 320 (4.20513 iter/s, 4.7561s/20 iters), loss = 0
I1018 10:14:41.688819  7404 solver.cpp:237]     Train net output #0: loss = 0 (* 1 = 0 loss)
I1018 10:14:41.688819  7404 sgd_solver.cpp:105] Iteration 320, lr = 1e-08
I1018 10:14:45.884600  7748 data_layer.cpp:73] Restarting data prefetching from start.
I1018 10:14:46.461715  7404 solver.cpp:218] Iteration 340 (4.19496 iter/s, 4.76763s/20 iters), loss = 0
I1018 10:14:46.461715  7404 solver.cpp:237]     Train net output #0: loss = 0 (* 1 = 0 loss)
I1018 10:14:46.461715  7404 sgd_solver.cpp:105] Iteration 340, lr = 1e-08
I1018 10:14:50.111598  7748 data_layer.cpp:73] Restarting data prefetching from start.
I1018 10:14:51.234639  7404 solver.cpp:218] Iteration 360 (4.1858 iter/s, 4.77806s/20 iters), loss = 0.873365
I1018 10:14:51.234639  7404 solver.cpp:237]     Train net output #0: loss = 0.873365 (* 1 = 0.873365 loss)
I1018 10:14:51.234639  7404 sgd_solver.cpp:105] Iteration 360, lr = 1e-09
I1018 10:14:54.478982  7748 data_layer.cpp:73] Restarting data prefetching from start.
I1018 10:14:56.007566  7404 solver.cpp:218] Iteration 380 (4.19437 iter/s, 4.76829s/20 iters), loss = 0.873365
I1018 10:14:56.007566  7404 solver.cpp:237]     Train net output #0: loss = 0.873365 (* 1 = 0.873365 loss)
I1018 10:14:56.007566  7404 sgd_solver.cpp:105] Iteration 380, lr = 1e-09
I1018 10:14:58.705986  7748 data_layer.cpp:73] Restarting data prefetching from start.
I1018 10:15:00.421743  7404 solver.cpp:447] Snapshotting to binary proto file training_iter_400.caffemodel
I1018 10:15:01.903534  7404 sgd_solver.cpp:273] Snapshotting solver state to binary proto file training_iter_400.solverstate
I1018 10:15:02.371469  7404 solver.cpp:330] Iteration 400, Testing net (#0)
I1018 10:15:03.478912  7792 data_layer.cpp:73] Restarting data prefetching from start.
I1018 10:15:04.820323  7792 data_layer.cpp:73] Restarting data prefetching from start.
I1018 10:15:06.146136  7792 data_layer.cpp:73] Restarting data prefetching from start.
I1018 10:15:07.471949  7792 data_layer.cpp:73] Restarting data prefetching from start.
I1018 10:15:08.813360  7792 data_layer.cpp:73] Restarting data prefetching from start.
I1018 10:15:09.796021  7404 solver.cpp:397]     Test net output #0: accuracy = 0.95
I1018 10:15:09.796021  7404 solver.cpp:397]     Test net output #1: loss = 0.0873365 (* 1 = 0.0873365 loss)
I1018 10:15:10.014390  7404 solver.cpp:218] Iteration 400 (1.4278 iter/s, 14.0076s/20 iters), loss = 0.873365
I1018 10:15:10.014390  7404 solver.cpp:237]     Train net output #0: loss = 0.873365 (* 1 = 0.873365 loss)
I1018 10:15:10.014390  7404 sgd_solver.cpp:105] Iteration 400, lr = 1e-10
I1018 10:15:12.291669  7748 data_layer.cpp:73] Restarting data prefetching from start.
I1018 10:15:14.787317  7404 solver.cpp:218] Iteration 420 (4.18883 iter/s, 4.7746s/20 iters), loss = 0.873365
I1018 10:15:14.787317  7404 solver.cpp:237]     Train net output #0: loss = 0.873365 (* 1 = 0.873365 loss)
I1018 10:15:14.787317  7404 sgd_solver.cpp:105] Iteration 420, lr = 1e-10
I1018 10:15:16.582064  7748 data_layer.cpp:73] Restarting data prefetching from start.
I1018 10:15:19.545646  7404 solver.cpp:218] Iteration 440 (4.20273 iter/s, 4.75881s/20 iters), loss = 0.873365
I1018 10:15:19.545646  7404 solver.cpp:237]     Train net output #0: loss = 0.873365 (* 1 = 0.873365 loss)
I1018 10:15:19.545646  7404 sgd_solver.cpp:105] Iteration 440, lr = 1e-10
I1018 10:15:20.824666  7748 data_layer.cpp:73] Restarting data prefetching from start.
I1018 10:15:24.334172  7404 solver.cpp:218] Iteration 460 (4.18022 iter/s, 4.78443s/20 iters), loss = 0
I1018 10:15:24.334172  7404 solver.cpp:237]     Train net output #0: loss = 0 (* 1 = 0 loss)
I1018 10:15:24.334172  7404 sgd_solver.cpp:105] Iteration 460, lr = 1e-11
I1018 10:15:25.114061  7748 data_layer.cpp:73] Restarting data prefetching from start.
I1018 10:15:29.107098  7404 solver.cpp:218] Iteration 480 (4.18678 iter/s, 4.77694s/20 iters), loss = 0
I1018 10:15:29.107098  7404 solver.cpp:237]     Train net output #0: loss = 0 (* 1 = 0 loss)
I1018 10:15:29.107098  7404 sgd_solver.cpp:105] Iteration 480, lr = 1e-11
I1018 10:15:29.497043  7748 data_layer.cpp:73] Restarting data prefetching from start.
I1018 10:15:33.505677  7404 solver.cpp:447] Snapshotting to binary proto file training_iter_500.caffemodel
I1018 10:15:35.112251  7404 sgd_solver.cpp:273] Snapshotting solver state to binary proto file training_iter_500.solverstate
I1018 10:15:35.751760  7404 solver.cpp:310] Iteration 500, loss = 0
I1018 10:15:35.751760  7404 solver.cpp:315] Optimization Done.

As you can see the loss is either constant 0.873365 or 0 and I don't know why. When I use the following code for testing images I always get in return zero:

img = caffe.io.load_image('val/img911.png', color=False)
grayimg = img[:,:,0]
gi = np.reshape(grayimg, (260,260,1))

net = caffe.Net('deploy.prototxt',
                'training_iter_500.caffemodel',
                caffe.TEST)

transformer = caffe.io.Transformer({'data': net.blobs['data'].data.shape})
transformer.set_transpose('data', (2,0,1))
transformer.set_raw_scale('data', 255.0)

net.blobs['data'].reshape(1,1,260,260)
net.blobs['data'].data[...] = transformer.preprocess('data', gi)

out = net.forward()

print out['prob'].argmax()

To create the LMDB file I used this script:

import numpy as np
import lmdb
import caffe
import cv2

N = 1800

X = np.zeros((N, 1, 260, 260), dtype=np.uint8)
y = np.zeros(N, dtype=np.int64)
map_size = X.nbytes * 10

file = open("train.txt", "r") 
files =  file.readlines() 
print(len(files))

for i in range(0,len(files)):
    line = files[i]
    img_path = line.split()[0]
    label = line.split()[1]
    img = cv2.imread(img_path, cv2.IMREAD_GRAYSCALE)
    X[i]=img

env = lmdb.open('newlmdb', map_size=map_size)

with env.begin(write=True) as txn:
    # txn is a Transaction object
    for i in range(N):
        datum = caffe.proto.caffe_pb2.Datum()
        datum.channels = X.shape[1]
        datum.height = X.shape[2]
        datum.width = X.shape[3]
        datum.data = X[i].tobytes()  # or .tostring() if numpy < 1.9
        datum.label = int(y[i])
        y[i]=label

Is this a mistake in my code or did I choose the parameters for the network to bad?

EDIT

I edited my data layer to get zero-mean inputs:

layer {
  name: "data"
  type: "Data"
  top: "data"
  top: "label"
  include {
    phase: TRAIN
  }
  transform_param {
    mirror: true
    crop_size: 260
    mean_file: "formen_mean.binaryproto"
  }
  data_param {
    source: "newlmdb"
    batch_size: 10
    backend: LMDB
  }
}

Increased the number of training images to 10000 and test images to 1000, shuffled my data and edited my solver.prototxt:

net: "train_val.prototxt"
test_iter: 20
test_interval: 50
base_lr: 0.01
lr_policy: "step"
gamma: 0.1
stepsize: 50
display: 20
max_iter: 1000
momentum: 0.9
weight_decay: 0.0005
snapshot: 200
debug_info: true
snapshot_prefix: "training"
solver_mode: GPU

At some point in the Debug info the following happened:

I1018 14:21:16.238169  5540 net.cpp:619]     [Backward] Layer drop6, bottom blob fc6 diff: 2.64904e-05
I1018 14:21:16.238169  5540 net.cpp:619]     [Backward] Layer relu6, bottom blob fc6 diff: 1.33896e-05
I1018 14:21:16.269316  5540 net.cpp:619]     [Backward] Layer fc6, bottom blob pool2 diff: 8.48778e-06
I1018 14:21:16.269316  5540 net.cpp:630]     [Backward] Layer fc6, param blob 0 diff: 0.000181272
I1018 14:21:16.269316  5540 net.cpp:630]     [Backward] Layer fc6, param blob 1 diff: 0.000133896
I1018 14:21:16.269316  5540 net.cpp:619]     [Backward] Layer pool2, bottom blob norm2 diff: 1.82455e-06
I1018 14:21:16.269316  5540 net.cpp:619]     [Backward] Layer norm2, bottom blob conv2 diff: 1.82354e-06
I1018 14:21:16.269316  5540 net.cpp:619]     [Backward] Layer relu2, bottom blob conv2 diff: 1.41858e-06
I1018 14:21:16.284889  5540 net.cpp:619]     [Backward] Layer conv2, bottom blob pool1 diff: 1.989e-06
I1018 14:21:16.284889  5540 net.cpp:630]     [Backward] Layer conv2, param blob 0 diff: 0.00600851
I1018 14:21:16.284889  5540 net.cpp:630]     [Backward] Layer conv2, param blob 1 diff: 0.00107259
I1018 14:21:16.284889  5540 net.cpp:619]     [Backward] Layer pool1, bottom blob norm1 diff: 4.57322e-07
I1018 14:21:16.284889  5540 net.cpp:619]     [Backward] Layer norm1, bottom blob conv1 diff: 4.54691e-07
I1018 14:21:16.284889  5540 net.cpp:619]     [Backward] Layer relu1, bottom blob conv1 diff: 2.18649e-07
I1018 14:21:16.284889  5540 net.cpp:630]     [Backward] Layer conv1, param blob 0 diff: 0.0333731
I1018 14:21:16.284889  5540 net.cpp:630]     [Backward] Layer conv1, param blob 1 diff: 0.000384605
E1018 14:21:16.331610  5540 net.cpp:719]     [Backward] All net params (data, diff): L1 norm = (1.0116e+06, 55724.3); L2 norm = (80.218, 24.0218)
I1018 14:21:16.331610  5540 solver.cpp:218] Iteration 0 (0 iter/s, 1.69776s/20 iters), loss = 8.73365
I1018 14:21:16.331610  5540 solver.cpp:237]     Train net output #0: loss = 8.73365 (* 1 = 8.73365 loss)
I1018 14:21:16.331610  5540 sgd_solver.cpp:105] Iteration 0, lr = 0.01
I1018 14:21:19.726611  5540 net.cpp:591]     [Forward] Layer data, top blob data data: 44.8563
I1018 14:21:19.742184  5540 net.cpp:591]     [Forward] Layer data, top blob label data: 1
I1018 14:21:19.742184  5540 net.cpp:591]     [Forward] Layer conv1, top blob conv1 data: nan
I1018 14:21:19.742184  5540 net.cpp:603]     [Forward] Layer conv1, param blob 0 data: nan
I1018 14:21:19.742184  5540 net.cpp:603]     [Forward] Layer conv1, param blob 1 data: nan
I1018 14:21:19.742184  5540 net.cpp:591]     [Forward] Layer relu1, top blob conv1 data: nan
I1018 14:21:19.742184  5540 net.cpp:591]     [Forward] Layer norm1, top blob norm1 data: nan
I1018 14:21:19.742184  5540 net.cpp:591]     [Forward] Layer pool1, top blob pool1 data: inf
I1018 14:21:19.742184  5540 net.cpp:591]     [Forward] Layer conv2, top blob conv2 data: nan
I1018 14:21:19.742184  5540 net.cpp:603]     [Forward] Layer conv2, param blob 0 data: nan
I1018 14:21:19.742184  5540 net.cpp:603]     [Forward] Layer conv2, param blob 1 data: nan
I1018 14:21:19.742184  5540 net.cpp:591]     [Forward] Layer relu2, top blob conv2 data: nan
I1018 14:21:19.742184  5540 net.cpp:591]     [Forward] Layer norm2, top blob norm2 data: nan
I1018 14:21:19.742184  5540 net.cpp:591]     [Forward] Layer pool2, top blob pool2 data: inf

So I reduced the base_lr to 0.0001. But at some later point the gradient drops to zero:

I1018 14:24:40.919765  5500 net.cpp:591]     [Forward] Layer loss, top blob loss data: 0
I1018 14:24:40.919765  5500 net.cpp:619]     [Backward] Layer loss, bottom blob fc8 diff: 0
I1018 14:24:40.919765  5500 net.cpp:619]     [Backward] Layer fc8, bottom blob fc7 diff: 0
I1018 14:24:40.919765  5500 net.cpp:630]     [Backward] Layer fc8, param blob 0 diff: 0
I1018 14:24:40.919765  5500 net.cpp:630]     [Backward] Layer fc8, param blob 1 diff: 0
I1018 14:24:40.919765  5500 net.cpp:619]     [Backward] Layer drop7, bottom blob fc7 diff: 0
I1018 14:24:40.919765  5500 net.cpp:619]     [Backward] Layer relu7, bottom blob fc7 diff: 0
I1018 14:24:40.919765  5500 net.cpp:619]     [Backward] Layer fc7, bottom blob fc6 diff: 0
I1018 14:24:40.919765  5500 net.cpp:630]     [Backward] Layer fc7, param blob 0 diff: 0
I1018 14:24:40.919765  5500 net.cpp:630]     [Backward] Layer fc7, param blob 1 diff: 0
I1018 14:24:40.919765  5500 net.cpp:619]     [Backward] Layer drop6, bottom blob fc6 diff: 0
I1018 14:24:40.919765  5500 net.cpp:619]     [Backward] Layer relu6, bottom blob fc6 diff: 0
I1018 14:24:40.936337  5500 net.cpp:619]     [Backward] Layer fc6, bottom blob pool2 diff: 0
I1018 14:24:40.936337  5500 net.cpp:630]     [Backward] Layer fc6, param blob 0 diff: 0
I1018 14:24:40.936337  5500 net.cpp:630]     [Backward] Layer fc6, param blob 1 diff: 0
I1018 14:24:40.936337  5500 net.cpp:619]     [Backward] Layer pool2, bottom blob norm2 diff: 0
I1018 14:24:40.951910  5500 net.cpp:619]     [Backward] Layer norm2, bottom blob conv2 diff: 0
I1018 14:24:40.967483  5500 net.cpp:619]     [Backward] Layer relu2, bottom blob conv2 diff: 0
I1018 14:24:40.967483  5500 net.cpp:619]     [Backward] Layer conv2, bottom blob pool1 diff: 0
I1018 14:24:40.967483  5500 net.cpp:630]     [Backward] Layer conv2, param blob 0 diff: 0
I1018 14:24:40.967483  5500 net.cpp:630]     [Backward] Layer conv2, param blob 1 diff: 0
I1018 14:24:40.967483  5500 net.cpp:619]     [Backward] Layer pool1, bottom blob norm1 diff: 0
I1018 14:24:40.967483  5500 net.cpp:619]     [Backward] Layer norm1, bottom blob conv1 diff: 0
I1018 14:24:40.967483  5500 net.cpp:619]     [Backward] Layer relu1, bottom blob conv1 diff: 0

Answer:

I don't know why your net does not learn. But here are some points you might want to consider:

  1. Your test phase: test batch_size is 50 and test_iter is 200 meaning you are validating on 50*200=10,000 examples. Since you only have 1,800 examples total - what is the meaning of this large test_iter value? Look at this thread for more information about this issue.
  2. It seems like you are using the images "as is" meaning your input values' range is [0..255]. It is very common to subtract the mean from the net's inputs so that you have zero-mean inputs to the net.
  3. Consider looking at your training's debug info: does your gradient vanishes? do you have layers that are not "active" (e.g., a layer with all negative values with a "ReLU" on top is practically inactive).
  4. Getting a constant loss value suggests that your layer predicts only one label regardless of the inputs, consider shuffling your dataset.

Question:

I want to create an lmdb dataset from images which part of them contain the feature I want caffe to learn, and part of them don't. My question is - in the text input file transferred to convert_imageset - how should I label those images that don't contain the feature? I know the format is

PATH_TO_IMAGE LABEL
PATH_TO_IMAGE LABEL
PATH_TO_IMAGE LABEL

But which label should I assign to images without the feature? For example, img1.jpg contain the feature, img2.jpg and img3.jpg don't. So should the text file look like -

img1.jpg 0
img2.jpg 1?
img3.jpg 1?

Thanks!


Answer:

Got an answer from Caffe-users Google Group - yes, creating a dummy feature is the right way for this. So it is:

img1.jpg 0
img2.jpg 1
img3.jpg 1

Question:

I am a beginner in Caffe and Python. I installed Caffe and compiled it successfully in ubuntu 16.04. I created an environment in anaconda 2 and used Cmake for compiling. I ran this code and it printed caffe version.

$ python -c "import caffe;print caffe.__version__"
1.0.0-rc3

So I suppose that I have installed correctly. I wanted to have my first experience in caffe, so I followed the instructions in this link. But I am not really familiar with this. It is giving me this error:

~/deeplearning-cats-dogs-tutorial/code$ python create_lmdb.py
Traceback (most recent call last):
  File "create_lmdb.py", line 21, in <module>
    import lmdb
ImportError: No module named lmdb

I really appreciate if someone can guide me how to start running examples and models in caffe.


Answer:

It seems like you need to install LMDB python package: https://lmdb.readthedocs.io/en/release/

Question:

I'm using two lmdb inputs for identifying eyes, nosetip and mouth regions of a face. The data lmdb is of dimension Nx3xHxW while the label lmdb is of dimension Nx1xH/4xW/4. The label image is created by masking regions using numbers 1-4 on an opencv Mat that was initialized to be all 0s (so in total there are 5 labels with 0 being the background label). I scaled down the label image to be 1/4 in width and height of the corresponding image because I have 2 pooling layers in my net. This downscaling ensures the label image dimension will match the output of the last convolution layer.

My train_val.prototxt:

name: "facial_keypoints"
layer {
name: "images"
type: "Data"
top: "images"
include {
phase: TRAIN
}
transform_param {
mean_file: "../mean.binaryproto"
}
data_param {
source: "../train_lmdb"
batch_size: 100
backend: LMDB
}
}
layer {
name: "labels"
type: "Data"
top: "labels"
include {
phase: TRAIN
}
data_param {
source: "../train_label_lmdb"
batch_size: 100
backend: LMDB
}
}
layer {
name: "images"
type: "Data"
top: "images"
include {
phase: TEST
}
transform_param {
mean_file: "../mean.binaryproto"
}
data_param {
source: "../test_lmdb"
batch_size: 100
backend: LMDB
}
}
layer {
name: "labels"
type: "Data"
top: "labels"
include {
phase: TEST
}
data_param {
source: "../test_label_lmdb"
batch_size: 100
backend: LMDB
}
}
layer {
name: "conv1"
type: "Convolution"
bottom: "images"
top: "conv1"
param {
lr_mult: 1
}
param {
lr_mult: 2
}
convolution_param {
num_output: 32
pad: 2
kernel_size: 5
stride: 1
weight_filler {
type: "gaussian"
std: 0.0001
}
bias_filler {
type: "constant"
}
}
}
layer {
name: "pool1"
type: "Pooling"
bottom: "conv1"
top: "pool1"
pooling_param {
pool: MAX
kernel_size: 3
stride: 2
}
}
layer {
name: "relu1"
type: "ReLU"
bottom: "pool1"
top: "pool1"
}
layer {
name: "conv2"
type: "Convolution"
bottom: "pool1"
top: "conv2"
param {
lr_mult: 1
}
param {
lr_mult: 2
}
convolution_param {
num_output: 64
pad: 2
kernel_size: 5
stride: 1
weight_filler {
type: "gaussian"
std: 0.01
}
bias_filler {
type: "constant"
}
}
}
layer {
name: "relu2"
type: "ReLU"
bottom: "conv2"
top: "conv2"
}
layer {
name: "pool2"
type: "Pooling"
bottom: "conv2"
top: "pool2"
pooling_param {
pool: AVE
kernel_size: 3
stride: 2
}
}
layer {
name: "conv_last"
type: "Convolution"
bottom: "pool2"
top: "conv_last"
param {
lr_mult: 1
}
param {
lr_mult: 2
}
convolution_param {
num_output: 5
pad: 2
kernel_size: 5
stride: 1
weight_filler {
#type: "xavier"
type: "gaussian"
std: 0.01
}
bias_filler {
type: "constant"
}
}
}
layer {
name: "relu2"
type: "ReLU"
bottom: "conv_last"
top: "conv_last"
}

layer {
name: "accuracy"
type: "Accuracy"
bottom: "conv_last"
bottom: "labels"
top: "accuracy"
include {
phase: TEST
}
}
layer {
name: "loss"
type: "SoftmaxWithLoss"
bottom: "conv_last"
bottom: "labels"
top: "loss"
}

In the last convolution layer, I set the output size to be 5 because I have 5 label classes. The training is converged with final loss at about 0.3 and accuracy 0.9 (although some sources suggest this accuracy is not correctly measured for multi-lablels). When using the trained model, the output layer correctly produces a blob of dimension 1x5xH/4xW/4 which I managed to visualze as 5 separate single channel images. However, while the first image correctly hightlighted the background pixels, the remaining 4 images looks almost the same with all 4 regions highlighted.

visualization of 5 output channels(intensity increases from blue to red):

original image(the concentric circles labels highest intensity from each channel. some are bigger just to distinguish from others. as you can see other than the background, the rest channels have highest activations almost on the same mouth region which should not be the case. )

Could someone help me spot the mistake I made?

Thanks.


Answer:

It seems like you are facing class imbalance: most of your labeled pixels are labeled 0 (Background), hence, during training the net learns to predict background almost regardless of what it "sees". Since predicting background is correct most of the time, the training loss decreases and the accuracy increases up to a certain point. However, when you actually try to visualize the output prediction it is mostly background with little information regarding the other scarce labels.

One way of tackling class imbalance, in caffe, is to use "InfogainLoss" layer with weights tuned to counter-effect the imbalance of the labels.