Hot questions for Using Neural networks in cntk

Question:

I've gone through the documentation of Microsoft's OpenSource AI Library CNTK and did understand how to create and train neural networks. I've also understood, how to "save" the trained results into an output directory.

However, I don't see a way to load the results into the neural network and even more complicated: how do I wrap my trained neural network into an application, so I can actually use it in production instead of just using it for academic research.

I want to integrate my neural network into my Python or C# application. How do I wrap it into such, and how do I create an interface towards its input and output?


Answer:

They have added a Wrapper for C# and C++ a short time ago.

C# https://github.com/Microsoft/CNTK/tree/master/Source/Extensibility/CSEvalClient

C++ https://github.com/Microsoft/CNTK/tree/master/Source/Extensibility/EvalWrapper

Some guys are already working on a python wrapper also. However, but wrapper it into c++ you can already integrate the c++ solution as a python wrapper library. Confer: http://www.boost.org/doc/libs/1_49_0/libs/python/doc/

Question:

I installed version 2.0.beta7 from CNTK in an Azure NC24 GPU VM with Ubuntu (python 3.4). The machine has 4 NVIDIA K80 GPUs. Build info:

            Build type: release
            Build target: GPU
            With 1bit-SGD: yes
            With ASGD: yes
            Math lib: mkl
            CUDA_PATH: /usr/local/cuda-8.0
            CUB_PATH: /usr/local/cub-1.4.1
            CUDNN_PATH: /usr/local
            Build Branch: HEAD
            Build SHA1: 8e8b5ff92eff4647be5d41a5a515956907567126
            Built by Source/CNTK/buildinfo.h$$0 on bbdadbf3455d
            Build Path: /home/philly/jenkins/workspace/CNTK-Build-Linux

I was running the CIFAR example in distributed mode:

mpiexec -n 4 python TrainResNet_CIFAR10_Distributed.py -n resnet20 -q 32

Finished Epoch [1]: [Training] loss = 1.675002 * 50176, metric = 62.5% * 50176 112.019s (447.9 samples per second)
Finished Epoch [1]: [Training] loss = 1.675002 * 50176, metric = 62.5% * 50176 112.019s (447.9 samples per second)
Finished Epoch [1]: [Training] loss = 1.675002 * 50176, metric = 62.5% * 50176 112.018s (447.9 samples per second)
Finished Epoch [1]: [Training] loss = 1.675002 * 50176, metric = 62.5% * 50176 112.019s (447.9 samples per second)
Finished Epoch [2]: [Training] loss = 1.247423 * 50176, metric = 45.4% * 50176 8.210s (6111.3 samples per second)
Finished Epoch [2]: [Training] loss = 1.247423 * 50176, metric = 45.4% * 50176 8.210s (6111.4 samples per second)
Finished Epoch [2]: [Training] loss = 1.247423 * 50176, metric = 45.4% * 50176 8.210s (6111.8 samples per second)
Finished Epoch [2]: [Training] loss = 1.247423 * 50176, metric = 45.4% * 50176 8.210s (6111.6 samples per second)
...
...
Finished Epoch [160]: [Training] loss = 0.037745 * 49664, metric = 1.2% * 49664 7.883s (6300.4 samples per second)
Finished Epoch [160]: [Training] loss = 0.037745 * 49664, metric = 1.2% * 49664 7.883s (6299.7 samples per second)
Finished Epoch [160]: [Training] loss = 0.037745 * 49664, metric = 1.2% * 49664 7.884s (6299.7 samples per second)
Finished Epoch [160]: [Training] loss = 0.037745 * 49664, metric = 1.2% * 49664 7.884s (6299.2 samples per second)

However, when I run it with 1bit SGD I get:

mpiexec -n 4 python TrainResNet_CIFAR10_Distributed.py -n resnet20 -q 1 -a 50000

...
Finished Epoch [160]: [Training] loss = 0.059290 * 49664, metric = 2.1% * 49664 10.055s (4939.1 samples per second)
Finished Epoch [160]: [Training] loss = 0.059290 * 49664, metric = 2.1% * 49664 10.056s (4938.9 samples per second)
Finished Epoch [160]: [Training] loss = 0.059290 * 49664, metric = 2.1% * 49664 10.056s (4938.9 samples per second)
Finished Epoch [160]: [Training] loss = 0.059290 * 49664, metric = 2.1% * 49664 10.056s (4938.9 samples per second)

As explained here 1bit should be faster than the normal counterpart. Any help is appreciated.


Answer:

1-bit sgd is an effective strategy when the communication time between GPUs is large compared to the computation time for a minibatch.

There are two "issues" with your experiment above: the model you are training has few parameters (computation is not that much) and the 4 GPUs are in the same machine (communication is not that bad compared to say going over the network). Also, inside a machine CNTK uses the NVIDIA nccl which is much better optimized than a generic MPI implementation that 1-bit uses. Update: At the time of this comment NCCL is not used by default.

Question:

I've gone through some of the CNTK Python tutorials and I'm trying to write an extremely basic one layer neural network that can compute a logical AND. I have functioning code, but the network isn't learning - in fact, loss gets worse and worse with each minibatch trained.

import numpy as np
from cntk import Trainer
from cntk.learner import sgd
from cntk import ops
from cntk.utils import get_train_eval_criterion, get_train_loss

input_dimensions = 2
# Define the training set
input_data = np.array([
    [0, 0], 
    [0, 1],
    [1, 0],
    [1, 1]], dtype=np.float32)

# Each index matches with an index in input data
correct_answers = np.array([[0], [0], [0], [1]])

# Create the input layer
net_input = ops.input_variable(2, np.float32)
weights = ops.parameter(shape=(2, 1))
bias = ops.parameter(shape=(1))

network_output = ops.times(net_input, weights) + bias

# Set up training
expected_output = ops.input_variable((1), np.float32)
loss_function = ops.cross_entropy_with_softmax(network_output, expected_output)
eval_error = ops.classification_error(network_output, expected_output)

learner = sgd(network_output.parameters, lr=0.02)
trainer = Trainer(network_output, loss_function, eval_error, [learner])

minibatch_size = 4
num_samples_to_train = 1000
num_minibatches_to_train = int(num_samples_to_train/minibatch_size)
training_progress_output_freq = 20

def print_training_progress(trainer, mb, frequency, verbose=1):
    training_loss, eval_error = "NA", "NA"

    if mb % frequency == 0:
        training_loss = get_train_loss(trainer)
        eval_error = get_train_eval_criterion(trainer)
        if verbose:
            print("Minibatch: {0}, Loss: {1:.4f}, Error: {2:.2f}".format(
            mb, training_loss, eval_error))

    return mb, training_loss, eval_error


for i in range(0, num_minibatches_to_train):
    trainer.train_minibatch({net_input: input_data, expected_output: correct_answers})
    batchsize, loss, error = print_training_progress(trainer, i, training_progress_output_freq, verbose=1)
Sample training output
Minibatch: 0, Loss: -164.9998, Error: 0.75
Minibatch: 20, Loss: -166.0998, Error: 0.75
Minibatch: 40, Loss: -167.1997, Error: 0.75
Minibatch: 60, Loss: -168.2997, Error: 0.75
Minibatch: 80, Loss: -169.3997, Error: 0.75
Minibatch: 100, Loss: -170.4996, Error: 0.75
Minibatch: 120, Loss: -171.5996, Error: 0.75
Minibatch: 140, Loss: -172.6996, Error: 0.75
Minibatch: 160, Loss: -173.7995, Error: 0.75
Minibatch: 180, Loss: -174.8995, Error: 0.75
Minibatch: 200, Loss: -175.9995, Error: 0.75
Minibatch: 220, Loss: -177.0994, Error: 0.75
Minibatch: 240, Loss: -178.1993, Error: 0.75

I'm not really sure what's going on here. Error is stuck at 0.75 which, I think, means the network is performing the same as it would by chance. I'm uncertain whether I've misunderstood a requirement of ANN architecture, or if I'm misusing the library.

Any help would be appreciated.


Answer:

You are trying to solve a binary classification problem with a softmax as your final layer. The softmax layer is not the right layer here, it is only effective for multiclass (classes >= 3) problems.

For binary classification problems you should do the following two modifications:

  • Add a sigmoid layer to your output (this will make your output look like a probability)
  • Use binary_cross_entropy as your criterion (you will have to be on at least this release)

Question:

When using an AlexNet neural network, be it with caffe or CNTK, it needs a mean file as input. What is this mean file for ? How does it affect the training ? How is it generated, only from training sample ?


Answer:

Mean subtraction removes the DC component from images. It has the geometric interpretation of centering the cloud of data around the origin along every dimension. It reduces the correlation between images which improves training. From my experience I can say that it improves the training accuracy significantly. It is computed from the training data. Computing mean from the testing data makes no sense.

Question:

Is it possible to create a "conditional" network in CNTK and apply it only on one of the inputs depending on another input variable? See the following code:

a_in = ct.input_variable(shape=[16,16])
b_in = ct.input_variable(shape=[16,16])
flag = ct.input_variable(shape=[])

a_branch = ct.layers.Sequential([...])
b_branch = ct.layers.Sequential([...])

sel_branch = ct.element_select(flag, a_branch, b_branch)

out = sel_branch(a_in, b_in)

Howerer, this doesn't work since sel_branch expects 3 arguments instead of the ones requested either by a_branch or b_branch (which is totally correct since here I am using element_select in a wrong way)

Keep in mind that the objective is to avoid executing both branches,


Answer:

The answer is no, at this moment there is no conditional execution in CNTK. The general case is that flag is a vector/tensor and some of its elements would be 0 and others would be 1. There's an obvious optimization when all the elements have the same value but it is not implemented. However even if it were implemented the signature of sel_branch would still be that it requires 3 arguments, because that is a "compile-time" property, while the aforementioned optimization can only be determined at runtime. Even in your case when flag is a scalar, it might be 0 in one batch and 1 for the other and the signature of sel_branch cannot change from batch to batch.