Hot questions for Using Neural networks in frameworks

Question:

Hello I am trying to create a program that will calculate weights from the perceptron algorithm. I have it all working right now but with the very basics, it is a step function and it is single-layered. Before I move on to muilti-layer I am trying to make it optimal and sigmoid.

So those are my question, I have a general idea of the sigmoid but I can't find any information on how to make the line optimal, so it is equal distance from the data points. Anyone has any idea?

net = 0;
for(i=0; i<N; i++)
{
    net = net + (x[i] * w[i]);      //Calculates weighted sum
}

if(net >= threshold) output =  1;     //Finds the output based on the net
if(net <  threshold) output = -1;

This is my current code to find the "y" variable, I called it output here, and it is finding it using a simple step function, threshold = 0. How can I convert this to a sigmoid?


Answer:

The following may work:

output = 1 / (1 + exp(-net));

You may also love tanh(net) or 1 / (1 + abs(net)) functions, which are much faster according to this answer.

Question:

I'm trying to model a technical process (a number of nonlinear equations) with artificial neural networks. The function has a number of inputs and a number of outputs (e.g. 50 inputs, 150 outputs - all floats).

I have tried the python library ffnet (wrapper for a fortran library) with great success. The errors for a certain dataset are well below 0.2%.

It is using a fully connected graph and these additional parameters.

Basic assumptions and limitations:
    Network has feed-forward architecture.
    Input units have identity activation function, all other units have sigmoid activation function.
    Provided data are automatically normalized, both input and output, with a linear mapping to the range (0.15, 0.85). Each input and output is treated separately (i.e. linear map is unique for each input and output).
    Function minimized during training is a sum of squared errors of each output for each training pattern.

I am using one input layer, one hidden layer (size: 2/3 of input vector + size of output vector) and an output layer. I'm using the scipy conjugate gradient optimizer.

The downside of ffnet is the long training time and the lack of functionality to use GPUs. Therefore i want to switch to a different framework and have chosen keras with TensorFlow as the backend.

I have tried to model the previous configuration:

model = Sequential()
model.add(Dense(n_hidden, input_dim=n_in))
model.add(BatchNormalization())
model.add(Dense(n_hidden))
model.add(Activation('sigmoid'))
model.add(Dense(n_out))
model.add(Activation('sigmoid'))
model.summary()
model.compile(loss='mean_squared_error',
              optimizer='Adamax',
              metrics=['accuracy'])

However the results are far worse, the error is up to 0.5% with a few thousand (!) epochs of training. The ffnet training was automatically canceled at 292 epochs. Furthermore the differences between the network response and the validation target are not centered around 0, but mostly negative. I have tried all optimizers and different loss functions. I have also skipped the BatchNormalization and normalized the data manually in the same way that ffnet does it. Nothing helps.

Does anyone have a suggestion to obtain better results with keras?


Answer:

I understand you are trying to re-train the same architecture from scratch, with a different library. The first fundamental issue to keep in mind here is that neural nets are not necessarily reproducible, when weights are initialized randomly.

For example, here is the default constructor parameter for Dense in Keras:

init='glorot_uniform'

But even before trying to evaluate the convergence of Keras optimizations, I would recommend trying to port the weights for which you got good results, from ffnet, into your Keras model. You can do so either with the kwarg Dense(..., weights=) of each layer, or globally at the end model.set_weights(...)

Using the same weights must yield the exact same result between the two libs. Unless you run into some floating point rounding issues. I believe that as long as porting the weights is not consistent, working on the optimization is unlikely to help.

Question:

sorry for this rather simple question, however there is yet too little documentation about the usage of Microsoft's OpenSource AI library CNTK.

I continue to witness people setting the reader's feature start to 1, while setting the labels start to 0. But should both of them be always 0, as informations does in computer science always start from the zero point? Wouldn't we lose one piece of information this way?

Example of CIFAR10 02_BatchNormConv

    features=[
    #dimension = 3 (rgb) * 32 (width) * 32(length)
        dim=3072
        start=1
    ]
    labels=[
        dim=1
        start=0
        labelDim=10
        labelMappingFile=$DataDir$/labelsmap.txt
    ]
Update: New format

Microsoft has recently updated this, in order to get rid of these confusion and make the CNTK Definition Language more readable.

Instead of having to define the start of the values within the line, you can now simply define the type of data in the dataset itself:

|labels <tab seperated values> | features <tab seperated values> [EndOfLine/EOL]

if you want to reverse the order of features and lables you can simply go for:

|features <tab seperated values> | labels <tab seperated values> [EndOfLine/EOL]

You only have still to define the dim value, in order to specify the amount of values you want to input.

Note: There's no | at the end of the row. EOL indicates the end of the row.

For more information visit the CNTK Wiki on this topic.


Answer:

You are misunderstanding how the reader works. The UCIFastReader works on a file which contains tab separated feature vector. Each line in this file corresponds to an entry (an image in this case), as well as its classification.

So, dim tells the reader how many columns to read, while start tells the reader from which column to start reading. So, if you had an image of size 2x2, with a 2 labels for each, your file could be of the form <image_pixel_columns><label_columns>:

0 0 0 0 0 0
0 0 1 0 1 0
...

So the first 4 entries in the line are your features (image pixel values), and the last two are your labels. Your reader would be of the form:

reader=[
        readerType=UCIFastReader
        file=$DataDir$/Train.txt
        randomize=None
        features=[
            dim=4
            start=0
        ]
        labels=[
            dim=2
            start=4
            labelDim=10
            labelMappingFile=$DataDir$/labelsmap.txt
        ]
    ]

It's just that all examples given have the label placed in the first column.

Question:

I'm new to Deep learning. I followed some course materials on internet and I found they are using deep learning frameworks like KERAS, Tensor-flow in order to build deep neural networks. Also I found in some course materials they are building deep neural networks from that scratch rather than using frameworks. So I wanna know since I am new to deep learning what is the best thing for me. Whether using frameworks to build neural networks or building them from scratch. Is there any thing that could be missing if I directly use frameworks rather that building from the scratch.

I know that they have developed frameworks like KERAS to use. But my problem is if we depend on those frameworks will we miss basic theories of those things since I am new to this field...?


Answer:

Yes, its fine, else what would be the purpose of making those frameworks in the first place?

As you found out building neural networks from scratch is not easy, at least by using Keras or other frameworks, you know that these have some degree of validation that they work correctly.

Question:

I want to create your own simple CNN, but I need in some ready implementations. Can your share me links, articles, where I can find ready implementations of CNN(without using any frameworks Keras, but maybe with numpy,scipy)where I can see the implementation of each operations, like matrices multiplication and so on?


Answer:

Yes, it is possible to implement your own bare-bones CNN without help of any frameworks like Keras, TF, etc. You can check out this simple implementation of CNNs using numpy/cython and the code repository here.

Question:

I just didn't understand what kind of filter does keras framework for convolution neural network uses in the following line of code, is it for horizontal edge detection or verticals or any edge or any other feature?? Here it's a 7*7 32 filters with stride of 1 which we convolve with X

x= Conv2D(32, (7, 7), strides = (1, 1), name = 'conv0')(X)

Answer:

Convolutional filters are not pre-disposed to any particular feature. Rather, they "learn" their duties through training. These features evolve organically through training, depending on what enhances the prediction accuracy on the far end of the model. The model will gradually learn which features work well for the given inputs, depending on the ground truth and back propagation.

The critical trick in this is a combination of back prop and initialization. When we randomly initialize the filters, the important part isn't so much what distribution we choose; rather, it's that there are some differences, so that the filters will differentiate well.

For instance, in typical visual processing applications, the model's first layer (taking the conv0 label as a hint) will learn simple features: lines, curves, colour blobs, etc. Whatever filter happens to be initialized most closely to a vertical line detector, will eventually evolve into that filter. In the early training, it will receive the highest reinforcement from back propagation's "need" for vertical lines. Those filters that are weaker at verticals will get less reinforcement, then see their weights reduced (since our "star pupil" will be sufficient to drive the vertical-line needs), and will eventually evolve to recognize some other feature.

Overall, the filters will evolve into a set of distinct features, as needed by the eventual output. One brute-force method of finding the correct quantity of features is to put in too many -- see how many of them learn something useful, then reduce the quantity until you have clean differentiation on a minimal set of filters. In the line of code you present, someone has already done this, and found that CONV0 needs about 32 filters for this topology and application.

Does that clear up the meaning?