Hot questions for Using Neural networks in convergence


I have written a basic program to understand what's happening in MLP classifier?

from sklearn.neural_network import MLPClassifier

data: a dataset of body metrics (height, width, and shoe size) labeled male or female:

X = [[181, 80, 44], [177, 70, 43], [160, 60, 38], [154, 54, 37], [166, 65, 40],
     [190, 90, 47], [175, 64, 39],
     [177, 70, 40], [159, 55, 37], [171, 75, 42], [181, 85, 43]]
y = ['male', 'male', 'female', 'female', 'male', 'male', 'female', 'female',
     'female', 'male', 'male']

prepare the model:

 clf= MLPClassifier(hidden_layer_sizes=(3,), activation='logistic',
                       solver='adam', alpha=0.0001,learning_rate='constant', 


clf=, y)

attributes of the learned classifier:

print('current loss computed with the loss function: ',clf.loss_)
print('coefs: ', clf.coefs_)
print('intercepts: ',clf.intercepts_)
print(' number of iterations the solver: ', clf.n_iter_)
print('num of layers: ', clf.n_layers_)
print('Num of o/p: ', clf.n_outputs_)


print('prediction: ', clf.predict([  [179, 69, 40],[175, 72, 45] ]))

calc. accuracy

print( 'accuracy: ',clf.score( [ [179, 69, 40],[175, 72, 45] ], ['female','male'], sample_weight=None ))
current loss computed with the loss function:  0.617580287851
coefs:  [array([[ 0.17222046, -0.02541928,  0.02743722],
       [-0.19425909,  0.14586716,  0.17447281],
       [-0.4063903 ,  0.148889  ,  0.02523247]]), array([[-0.66332919],
       [ 0.04249613],
intercepts:  [array([-0.05611057,  0.32634023,  0.51251098]), array([ 0.17996649])]
 number of iterations the solver:  200
num of layers:  3
Num of o/p:  1
prediction:  ['female' 'male']
accuracy:  1.0
/home/anubhav/anaconda3/envs/mytf/lib/python3.6/site-packages/sklearn/neural_network/ ConvergenceWarning: Stochastic Optimizer: Maximum iterations reached and the optimization hasn't converged yet.
  % (), ConvergenceWarning)
current loss computed with the loss function:  0.639478303643
coefs:  [array([[ 0.02300866,  0.21547873, -0.1272455 ],
       [-0.2859666 ,  0.40159542,  0.55881399],
       [ 0.39902066, -0.02792529, -0.04498812]]), array([[-0.64446013],
       [ 0.60580985],
intercepts:  [array([-0.10482234,  0.0281211 , -0.16791644]), array([-0.19614561])]
 number of iterations the solver:  39
num of layers:  3
Num of o/p:  1
prediction:  ['female' 'female']
accuracy:  0.5
current loss computed with the loss function:  0.691966937074
coefs:  [array([[ 0.21882191, -0.48037975, -0.11774392],
       [-0.15890357,  0.06887471, -0.03684797],
       [-0.28321762,  0.48392007,  0.34104955]]), array([[ 0.08672174],
       [ 0.1071615 ],
intercepts:  [array([-0.36606747,  0.21969636,  0.10138625]), array([-0.05670653])]
 number of iterations the solver:  4
num of layers:  3
Num of o/p:  1
prediction:  ['male' 'male']
accuracy:  0.5
current loss computed with the loss function:  0.697102567593
coefs:  [array([[ 0.32489731, -0.18529689, -0.08712877],
       [-0.35425908,  0.04214241,  0.41249622],
       [-0.19993622, -0.38873908, -0.33057999]]), array([[ 0.43304555],
       [ 0.37959392],
       [ 0.55998979]])]
intercepts:  [array([ 0.11555407, -0.3473817 , -0.16852093]), array([ 0.31326347])]
 number of iterations the solver:  158
num of layers:  3
Num of o/p:  1
prediction:  ['male' 'male']
accuracy:  0.5

I have following questions:

1.Why in the RUN1 the optimizer did not converge?
2.Why in RUN3 the number of iteration were suddenly becomes so low and in the RUN4 so high?
3.What else can be done to increase the accuracy which I get in RUN1.? 


1: Your MLP didn't converge: The algorithm is optimizing by a stepwise convergence to a minimum and in run 1 your minimum wasn't found.

2 Difference of runs: You have some random starting values for your MLP, so you dont get the same results as you see in your data. Seems that you started very close to a minimum in your fourth run. You can change the random_state parameter of your MLP to an constant e.g. random_state=0 to get the same result over and over.

3 is the most difficult point. You can optimize parameters with

from sklearn.model_selection import GridSearchCV

Gridsearch splits up your test set in eqally sized parts, uses one part as test data and the rest as training data. So it optimizes as many classifiers as parts you split your data into.

you need to specify (your data is small so i suggest 2 or 3) the number of parts you split, a classifier (your MLP), and a Grid of parameters you want to optimize like this:

param_grid = [
            'activation' : ['identity', 'logistic', 'tanh', 'relu'],
            'solver' : ['lbfgs', 'sgd', 'adam'],
            'hidden_layer_sizes': [
             (1,),(2,),(3,),(4,),(5,),(6,),(7,),(8,),(9,),(10,),(11,), (12,),(13,),(14,),(15,),(16,),(17,),(18,),(19,),(20,),(21,)

Beacuse you once got 100 percent accuracy with a hidden layer of three neurons, you can try to optimize parameters like learning rate and momentum instead of the hidden layers.

Use Gridsearch like that:

clf = GridSearchCV(MLPClassifier(), param_grid, cv=3,

print("Best parameters set found on development set:")


The code for the network below works okay, but it's too slow. This site implies that the network should get 99% accuracy after 100 epochs with a learning rate of 0.2, while my network never gets past 97% even after 1900 epochs.

Epoch 0, Inputs [0 0], Outputs [-0.83054376], Targets [0]
Epoch 100, Inputs [0 1], Outputs [ 0.72563824], Targets [1]
Epoch 200, Inputs [1 0], Outputs [ 0.87570863], Targets [1]
Epoch 300, Inputs [0 1], Outputs [ 0.90996706], Targets [1]
Epoch 400, Inputs [1 1], Outputs [ 0.00204791], Targets [0]
Epoch 500, Inputs [0 1], Outputs [ 0.93396672], Targets [1]
Epoch 600, Inputs [0 0], Outputs [ 0.00006375], Targets [0]
Epoch 700, Inputs [0 1], Outputs [ 0.94778227], Targets [1]
Epoch 800, Inputs [1 1], Outputs [-0.00149935], Targets [0]
Epoch 900, Inputs [0 0], Outputs [-0.00122716], Targets [0]
Epoch 1000, Inputs [0 0], Outputs [ 0.00457281], Targets [0]
Epoch 1100, Inputs [0 1], Outputs [ 0.95921556], Targets [1]
Epoch 1200, Inputs [0 1], Outputs [ 0.96001748], Targets [1]
Epoch 1300, Inputs [1 0], Outputs [ 0.96071742], Targets [1]
Epoch 1400, Inputs [1 1], Outputs [ 0.00110912], Targets [0]
Epoch 1500, Inputs [0 0], Outputs [-0.00012382], Targets [0]
Epoch 1600, Inputs [1 0], Outputs [ 0.9640324], Targets [1]
Epoch 1700, Inputs [1 0], Outputs [ 0.96431516], Targets [1]
Epoch 1800, Inputs [0 1], Outputs [ 0.97004973], Targets [1]
Epoch 1900, Inputs [1 0], Outputs [ 0.96616225], Targets [1]

The dataset I'm using is:

0 0 0
1 0 1
0 1 1
1 1 1

The training set is read using a function in a helper file, but that isn't relevant to the network.

import numpy as np
import helper

FILE_NAME = 'data.txt'
EPOCHS = 2000


class Classifier:
    def __init__(self, layer_sizes):

        self.activ = helper.tanh
        self.dactiv = helper.dtanh

        network = list()
        for i in range(1, len(layer_sizes)):
            layer = dict()
            layer['weights'] = np.random.randn(layer_sizes[i], layer_sizes[i-1])
            layer['biases'] = np.random.randn(layer_sizes[i])
            network.append(layer) = network

    def forward_propagate(self, x):
        for i in range(0, len(
  [i]['outputs'] =[i]['weights'].dot(x) +[i]['biases']
            if i != len(
      [i]['outputs'] = x = self.activ([i]['outputs'])
      [i]['outputs'] = self.activ([i]['outputs'])

    def backpropagate_error(self, x, targets):
        self.forward_propagate(x)[-1]['deltas'] = ([-1]['outputs'] - targets) * self.dactiv([-1]['outputs'])
        for i in reversed(range(len(
  [i]['deltas'] =[i+1]['deltas'].dot([i+1]['weights'] * self.dactiv([i]['outputs']))

    def adjust_weights(self, inputs, learning_rate):[0]['weights'] -= learning_rate * np.atleast_2d([0]['deltas'])[0]['biases'] -= learning_rate *[0]['deltas']
        for i in range(1, len(
  [i]['weights'] -= learning_rate * np.atleast_2d([i]['deltas'])[i-1]['outputs']))
  [i]['biases'] -= learning_rate *[i]['deltas']

    def train(self, inputs, targets, epochs, testfreq, lrate):
        for epoch in range(epochs):
            i = np.random.randint(0, len(inputs))
            if epoch % testfreq == 0:
                predictions = self.forward_propagate(inputs[i])
                print('Epoch %s, Inputs %s, Outputs %s, Targets %s' % (epoch, inputs[i], predictions, targets[i]))
            self.backpropagate_error(inputs[i], targets[i])
            self.adjust_weights(inputs[i], lrate)

inputs, outputs = helper.readInput(FILE_NAME, INPUT_SIZE, OUTPUT_SIZE)
print('Input data: {0}'.format(inputs))
print('Output targets: {0}\n'.format(outputs))


nn.train(inputs, outputs, EPOCHS, TESTING_FREQ, LEARNING_RATE)


The main bug is that you are doing the forward pass only 20% of the time, i.e. when epoch % testfreq == 0:

for epoch in range(epochs):
  i = np.random.randint(0, len(inputs))
  if epoch % testfreq == 0:
    predictions = self.forward_propagate(inputs[i])
    print('Epoch %s, Inputs %s, Outputs %s, Targets %s' % (epoch, inputs[i], predictions, targets[i]))
  self.backpropagate_error(inputs[i], targets[i])
  self.adjust_weights(inputs[i], lrate)

When I take predictions = self.forward_propagate(inputs[i]) out of if, I get much better results faster:

Epoch 100, Inputs [0 1], Outputs [ 0.80317447], Targets 1
Epoch 105, Inputs [1 1], Outputs [ 0.96340466], Targets 1
Epoch 110, Inputs [1 1], Outputs [ 0.96057278], Targets 1
Epoch 115, Inputs [1 0], Outputs [ 0.87960599], Targets 1
Epoch 120, Inputs [1 1], Outputs [ 0.97725825], Targets 1
Epoch 125, Inputs [1 0], Outputs [ 0.89433666], Targets 1
Epoch 130, Inputs [0 0], Outputs [ 0.03539024], Targets 0
Epoch 135, Inputs [0 1], Outputs [ 0.92888141], Targets 1

Also, note that the term epoch usually means a single run of all of your training data, in your case 4. So, in fact, you are doing 4 times less epochs.


I didn't pay attention to the details, as a result, missed few subtle yet important notes:

  • the training data in the question represents OR, not XOR, so my results above are for learning OR operation;
  • backward pass executes forward pass as well (so it's not a bug, rather a surprising implementation detail).

Knowing this, I've updated the data and checked the script once again. Running the training for 10000 iterations gave ~0.001 average error, so the model is learning, simply not so fast as it could.

A simple neural network (without embedded normalization mechanism) is pretty sensitive to particular hyperparameters, such as initialization and the learning rate. I tried various values manually and here's what I've got:

# slightly bigger learning rate
# slightly bigger init variation of weights
layer['weights'] = np.random.randn(layer_sizes[i], layer_sizes[i-1]) * 2.0

This gives the following performance:

Epoch 960, Inputs [1 1], Outputs [ 0.01392014], Targets 0
Epoch 970, Inputs [0 0], Outputs [ 0.04342895], Targets 0
Epoch 980, Inputs [1 0], Outputs [ 0.96471654], Targets 1
Epoch 990, Inputs [1 1], Outputs [ 0.00084511], Targets 0
Epoch 1000, Inputs [0 0], Outputs [ 0.01585915], Targets 0
Epoch 1010, Inputs [1 1], Outputs [-0.004097], Targets 0
Epoch 1020, Inputs [1 1], Outputs [ 0.01898956], Targets 0
Epoch 1030, Inputs [0 0], Outputs [ 0.01254217], Targets 0
Epoch 1040, Inputs [1 1], Outputs [ 0.01429213], Targets 0
Epoch 1050, Inputs [0 1], Outputs [ 0.98293925], Targets 1
Epoch 1920, Inputs [1 1], Outputs [-0.00043072], Targets 0
Epoch 1930, Inputs [0 1], Outputs [ 0.98544288], Targets 1
Epoch 1940, Inputs [1 0], Outputs [ 0.97682002], Targets 1
Epoch 1950, Inputs [1 0], Outputs [ 0.97684186], Targets 1
Epoch 1960, Inputs [0 0], Outputs [-0.00141565], Targets 0
Epoch 1970, Inputs [0 0], Outputs [-0.00097559], Targets 0
Epoch 1980, Inputs [0 1], Outputs [ 0.98548381], Targets 1
Epoch 1990, Inputs [1 0], Outputs [ 0.97721286], Targets 1

The average accuracy is close to 98.5% after 1000 iterations and 99.1% after 2000 iterations. It's a bit slower than promised, but good enough. I'm sure it can be tuned further, but it's not the goal of this toy exercise. After all, tanh is not the best activation function, and classification problems should better be solved with cross-entropy loss (rather than L2 loss). So I wouldn't worry too much about performance of this particular network and go on to the logistic regression. That will be definitely better in terms of speed of learning.


I'm building a neural network for Image classificaion/recognition. There are 1000 images (30x30 greyscale) for each of the 10 classes. Images of different classes are placed in different folders. I'm planning to use Back-propagation algorithm to train the net.

  1. Does the order in which I feed training examples into the net affect it's convergence?
  2. Should I feed training examples in random order?


First I will answer your question

  1. Yes it will affect it's convergence
  2. Yes it's encouraged to do that, it's called randomized arrangement

But why?

referenced from here

A common example in most ANN software is IRIS data, where you have 150 instances comprising your dataset. These are about three different types of Iris flowers (Versicola, Virginics, and Setosa). The data set contains measurements of four variables (sepal length and width, and petal length and width). The cases are arranged so that the first case 50 cases belong to Setosa, while cases 51-100 belong to Versicola, and the rest belong to Virginica. Now, what you do not want to do is present them to the network in that order. In other words, you do not want the network to see all 50 instances in Versicola class, then all 50 in Virginics class, then all 50 in Setosa class. Without randomization your training set wont represent all the classes and, hence, no convergence, and will fail to generalize.

Another example, in the past I also have 100 images for each Alphabets (26 classes), When I trained them ordered (per alphabet), it failed to converged but after I randomized it got converged easily because the neural network can generalize the alphabets.


I've been struggling for some time with building a simplistic NN in Java. I've been working on and off on this project for a few months and I wanna finish it. My main issue is that I dunno how to implement backpropagation correctly (all sources use Python, math jargon, or explain the idea too briefly). Today I tried deducing the ideology by myself and the rule that I'm using is:

the weight update = error * sigmoidDerivative(error) * weight itself; error = output - actual; (last layer) error = sigmoidDerivative(error from previous layer) * weight attaching this neuron to the neuron giving the error (intermediary layer)

My main problems are that the outputs converge towards an average value and my secondary problem is that the weights get updated towards an extremely weird value. (probably the weights issue is causing the convergence)

What I'm trying to train: for inputs 1-9 , the expected output is: (x*1.2+1)/10. This is just a rule that came to me randomly. I'm using a NN with the structure 1-1-1 (3 layers, 1 network/ layer). In the link bellow I attached two runs: one in which I'm using the training set that follows the rule (x*1.2+1)/10 and in the other I'm using (x*1.2+1)/100. With the division by 10, the first weight goes towards infinity; with the division by 100, the second weight tends towards 0.I kept trying to debug it but I have no idea what I should be looking for or what's wrong. Any suggestions are much appreciated. Thank you in advance and a great day to you all!

I have as training samples 1->9 and their respective outputs by following the rule above and I run them for 100_000 epochs. I log the the error every 100 epochs since it's easier to plot with less datapoints, while still having 1000 datapoints for each expected output of the 9. Code for backpropagation and weight updates:

    //for each layer in the Dweights array
    for(int k=deltaWeights.length-1; k >= 0; k--)
        for(int i=0; i<deltaWeights[k][0].length; i++)     // for each neuron in the layer
            if(k == network.length-2)      // if we're on the last layer, we calculate the errors directly
                outputErrors[k][i] = outputs[i] - network[k+1][i].result;
                errors[i] = outputErrors[k][i];
            else        // otherwise the error is actually the sum of errors feeding backwards into the neuron currently being processed * their respective weight
                for(int j=0; j<outputErrors[k+1].length; j++)
                {                         // S'(error from previous layer) * weight attached to it
                    outputErrors[k][i] += sigmoidDerivative(outputErrors[k+1][j])[0] * network[k+1][i].emergingWeights[j];

        for (int i=0; i<deltaWeights[k].length; i++)           // for each neuron
            for(int j=0; j<deltaWeights[k][i].length; j++)     // for each weight attached to that respective neuron
            {                        // error                S'(error)                                  weight connected to respective neuron                
                deltaWeights[k][i][j] = outputErrors[k][j] * sigmoidDerivative(outputErrors[k][j])[0] * network[k][i].emergingWeights[j];

    // we use the learning rate as an order of magnitude, to scale how drastic the changes in this iteration are
    for(int k=deltaWeights.length-1; k >= 0; k--)       // for each layer
        for (int i=0; i<deltaWeights[k].length; i++)            // for each neuron
            for(int j=0; j<deltaWeights[k][i].length; j++)     // for each weight attached to that respective neuron
                deltaWeights[k][i][j] *=  1;       // previously was learningRate; MSEAvgSlope

                network[k][i].emergingWeights[j] += deltaWeights[k][i][j];

    return errors;

Edit: a quick question that comes to mind: since I'm using sigmoid as my activation function, should my input and output neurons be only between 0-1? My output is between 0-1 but my inputs literally are 1-9.

Edit2: normalized the input values to be 0.1-0.9 and changed:

    outputErrors[k][i] += sigmoidDerivative(outputErrors[k+1][j])[0] * network[k+1][i].emergingWeights[j];     


    outputErrors[k][i] = sigmoidDerivative(outputErrors[k+1][j])[0] * network[k+1][i].emergingWeights[j]* outputErrors[k+1][j];       

so that I keep the sign of the output error itself. This repaired the Infinity tendency in the first weight. Now, with the /10 run, the first weight tends to 0 and with the /100 run, the second weight tends to 0. Still hoping that someone will bud in to clear things up for me. :(


I've seen serval problems with your code like your weight updates are incorrect for example. I'd also strongly recommend you to organize your code cleaner by introducing methods.

Backpropagation is usually hard to implement efficiently but the formal definitions are easily translated into any language. I'd not recommend you to look at code for studying neural nets. Look at the math and try to understand that. This makes you way more flexible about implementing one from scratch.

I can give you some hints by describing the forward and backward pass in pseudo code.

As a matter of notation, I use i for the input, j for the hidden and k for the output layer. The bias of the input layer is then bias_i. The weights are w_mn for the weights connecting one node to another. The activation is a(x) and it's derivative a'(x).

Forward pass:

for each n of j
       dot = 0
       for each m of i
              dot += m*w_mn
       n = a(dot + bias_i)

The identical applies for the output layer k and the hidden layer j. Hence, just replace j by k and i by j for the this step.

Backward pass:

Calculate delta for output nodes:

for each n of k
       d_n = a'(n)(n - target)

Here, target is the expected output, n the output of the current output node. d_n is the delta of this node. An important note here is, that the derivatives of the logistic and the tanh function contain the output of the original function and this values don't have to be reevaluated. The logistic function is f(x) = 1/(1+e^(-x)) and it's derivative f'(x) = f(x)(1-f(x)). Since the value at each output node n was previously evaluated with f(x), one can simply apply n(1-n) as the derivative. In the case above this would calculate the delta as follows:

d_n = n(1-n)(n - target)

In the same fashion, calculate the deltas for the hidden nodes.

for each n of j
      d_n = 0
      for each m of k
             d_n += d_m*w_jk
      d_n = a'(n)*d_n

Next step is to perform the weight update using the gradients. This is done by an algorithm called gradient descent. Without going into much detail, this can be accomplished as follows:

for each n of j
      for each m of k
            w_nm -= learning_rate*n*d_m

Same applies for the layer above. Just replace j by i and k by j.

To update the biases, just sum up the deltas of the connected nodes, multiply this by the learning rate and subtract this product from the specific bias.


I am trying delta rule learning with AND example, and i have noticed that the learning converges faster and better when i do not apply derivative of sigmoid activation in weight correction.

I am using bias neuron.

If i understand correctly, delta rule should consider derivative of activation function for weight adjustment: Ξ” Wk(n) = Ξ·βˆ—π‘’(𝑛)βˆ—π‘”β€²(β„Ž)βˆ—π‘₯(𝑛).

where e(n) = desired_output - neuron_output.

This is sigmoid i am using to calculate output:

public double calc(double sum) {
    return 1 / (1 + Math.pow(Math.E, -sum));

According to page 33, step 4 in this dela rule, weight update should be:

double delta = learningRate * error * estimated * (1 - estimated) * input; 

It works better without:

estimated * (1 - estimated)

This is pretty much code for training with delta rule:

public void train(List<LearningSample> samples, double[] weights, Function<double[], Double> neuronOutput) {

    double[] weightDelta = new double[weights.length];
    for (int i = 0; i < 10000; i++) {
        // Collections.shuffle(samples);
        for (LearningSample sample : samples) {
            // sigmoid of dot product of weights and input vector, including bias
            double estimated = neuronOutput.apply(sample.getInput());
            double error = sample.getDesiredOutput() - estimated;
            // this commented out version actually works better than the one bellow
            // double delta = learningRate * error;
            double delta = learningRate * error * estimated * (1 - estimated);
            // aggregate delta per weight for each sample in epoch
            deltaUpdate(delta, weightDelta, sample.getInput());

        // batch update weights at the end of training epoch
        for (int weight = 0; weight < weights.length; weight++) {
            weights[weight] += weightDelta[weight];

        weightDelta = new double[weights.length];

private void deltaUpdate(double delta, double[] weightsDelta, double[] input) {
    for (int feature = 0; feature < input.length; feature++) {
        weightsDelta[feature] = weightsDelta[feature] + delta * input[feature];

Training sample for AND looks like this:

List<LearningSample> samples = new ArrayList<>();
LearningSample sample1 = new LearningSample(new double[] { 0, 0 }, 0);
LearningSample sample2 = new LearningSample(new double[] { 0, 1 }, 0);
LearningSample sample3 = new LearningSample(new double[] { 1, 0 }, 0);
LearningSample sample4 = new LearningSample(new double[] { 1, 1 }, 1);

Bias 1 is injected as 0th component in the constructor.

Order in which output was tested after learning:

System.out.println(neuron.output(new double[] { 1,   1, 1 }));
System.out.println(neuron.output(new double[] { 1,   0, 0 }));
System.out.println(neuron.output(new double[] { 1,   0, 1 }));
System.out.println(neuron.output(new double[] { 1,   1, 0 }));

This is result when i omit derivative of sigmoid from delta calculation:

10000 iterations

  • 0.9666565909058419
  • 2.05087653022386E-5
  • 0.023803593411627456
  • 0.023803593411627456

35000 iterations

  • 0.9903810162649429
  • 4.6475933225663785E-7
  • 0.006870001301253153
  • 0.006870001301253153

These are the result with applied derivative:

10000 iterations

  • 0.8446651307271656
  • 0.004030424878725242
  • 0.129178264332045
  • 0.129178264332045

35000 iterations

  • 0.9218773156128204
  • 4.169603485934177E-4
  • 0.06555977437019253
  • 0.06555977437019253

Learning rate is: 0.021, and starting weight of bias is: -2.

The error is smaller and approximation of function much better in first example without derivative. Why is that ?


From the answer by @Umberto, there are a couple of things i would like to verify:

  • accident experiment where delta = learningRate * error * input, is in fact valid since this minimizes cross entropy cost function ? Yes

  • cross entropy apparently works better for classification, so when should MSE be used as a cost function ? Regression

As a note i am running the output through threshold function, it's just not shown here, so this is binary classification.


The reason is simple. You minimize different cost functions. In your case (as from the slide) you minimize the error squared. If you use a cost function (cross-entropy) in the form I describe in my derivation here github link, you will get the update of the weights that works faster. Usually in classification problems (normally you use a sigmoid neuron for binary classification) the squared error is not really a good cost function.

If you use cross entropy, you will need to use learningRate * error * input; (with the right sign, according to how you define your error).

As a side note what you are actually doing is logistic regression...

Hope that helps. If you need more information let me know. Check my link, there I do a complete derivation of the mathematics behind it.


I want to fit a recurrent neural network in Rby using the RSNNSpackage. The package provides the option to set the number of maximum iterations, as done in the sample code below by maxit = 1000. Is it possible to see how many iterations the algorithm used until convergence?


#Load Lynx data

#Scale data and convert to xts
data  <- as.xts((lynx - min(lynx)) / (max(lynx) - min(lynx)))

#Build xts object with 5 lags to analyze
lags <- 5
for(i in 1:lags){

  feat <- lag(lynx, i)
  data <- merge(data, feat, all = FALSE)


#Get features and target
features <- data[,-1]
target   <- data[,1]

#Fit network
rnn      <- elman(features, target, maxit = 1000)


I think it runs the maxit number of iterations by default. When you run the below, the iteration continues even after plateauing in the graph.

rnn <- elman(features, target, maxit = 1000)

#then run this
rnn <- elman(features, target, maxit = 10000)

You can probably use head(which(abs(diff(rnn$IterativeFitError)) < 1e-20), 1) to find the iteration step when it converges.


Can anyone give an explanation for the convergence test presented int the 8th minute of this lecture by Hugo Larochelle ?


These conditions ensure the convergence asymptotically. In this case, we should be able to update the approximated solution an infinite number of times. Intuitively, to achieve this, the learning rate should be always greater than zero. The first condition means or implies that the learning rate is always larger than 0.

On the other hand, in addition to "update infinitely" our approximated solution, we are interested in going closer to the optimum solution. To achieve this, the learning rate should be smaller and smaller. The second condition means that alpha parameter should decrease monotonically.

Both conditions are required not only in SGD, but in many other stochastic approximation methods. Sometimes they are referred as Robbins-Monro conditions due to the Robbins–Monro algorithm.


I am following the book, 'Make your own neural network' and . I am doing the neural network without numpy, scipy, scikitlearn etc. I am trying to check if my algorithm is correct by training the following network using same input-output combination multiple times. However, no matter how many steps I take or increase the loop counter, the network isn't learning the output at all.

I hope the code is readable as I am assuming that the code is transliteration of the github source.

Specifically the value of first output neuron never changes (ideally, it should converge to 0.01 but is stuck at 0.5). The code is available at and presented below in case the link expires. I've also tried but the code is just wrong over there [doesn't scale up to multiple layers]

import math,copy

def initArrZero(num):
    l = []
    for i in range(num):
    return l

def initMatrix(rows,cols):
    m = []
    for i in range(rows):
    return m    

def sigmoid(x):
    return 1.0/(1.0 + math.e**(-x))

def sigmoid_m(x):
    if isinstance(x,list):
        lst = []
        for i in x:
        return lst
        return sigmoid(x)

def sigmoid_prime(x):
    return sigmoid(x)*(1-sigmoid(x))

def sigmoid_prime_m(x):
    if isinstance(x,list):
        lst = []
        for i in x:
        return lst
        return sigmoid_prime(x)

def transpose(m):
    return [[m[j][i] for j in range(len(m))] for i in range(len(m[0]))]

def matmul(A,B):
    result = initMatrix(len(A),len(B[0]))
    for i in range(len(A)):  
        for j in range(len(B[0])): 
            for k in range(len(B)): 
                result[i][j] += A[i][k] * B[k][j]
    return result

def matadd(X,Y):
    result = copy.deepcopy(X)
    for i in range(len(X)): 
        for j in range(len(X[0])): 
            result[i][j] = X[i][j] + Y[i][j]
    return result

def hadamard(X,Y):
    result = copy.deepcopy(X)
    for i in range(len(X)): 
        for j in range(len(X[0])): 
            result[i][j] = X[i][j] * Y[i][j]
    return result

def scalarmul(A,B):
    if isinstance(A,list) and (isinstance(B,float) or isinstance(B,int)):
        return scalarmul(B,A)
    if isinstance(B,list):
        lst = []
        for i in B:
        return lst
    return A*B

def subtract(A,B):
    if isinstance(A,list) and isinstance(B,list) and len(A)==len(B):
        lst = []
        for i in range(len(A)):
        return lst
        return A-B

class NN:
    def __init__(self,arr):
        assert len(arr)>1
        l = len(arr)
        input = initArrZero(arr[0]+1)
        input[-1] = 1
        self.layers = []
        for i in range(1,l-1):
            lst = initArrZero(arr[i])

        self.weights = []
        for i in range(l-1):
            w = initMatrix(len(self.layers[i]),len(self.layers[i+1]))

    def feedforward(self):
        for i in range(0,len(self.layers)-1):
            self.layers[i+1] = sigmoid_m(matmul(self.weights[i],self.layers[i]))

    def backprop(self,actual,alpha):
        self.wd = initArrZero(len(self.weights))
        self.wdm = initArrZero(len(self.weights))

        for i in range(len(self.weights)-1,-1,-1):
            if i == len(self.weights)-1:
                self.wd[i] = hadamard(subtract(self.layers[-1],actual),sigmoid_prime_m(matmul(self.weights[-1],self.layers[-2])))
                self.wd[i] = hadamard(matmul(transpose(self.weights[i+1]),self.wd[i+1]),sigmoid_prime_m(matmul(self.weights[i],self.layers[i])))

        for i in range(len(self.weights)-1,-1,-1):
            t = transpose(self.layers[i])
            self.wdm[i] = matmul(self.wd[i],t)

        for i in range(len(self.weights)-1,-1,-1):
            self.weights[i] = matadd(self.weights[i],hadamard(scalarmul(-1*alpha,self.weights[i]),self.wdm[i]))

    def show(self):
        print "Layers : "
        for p in self.layers:
            print p

        print "\n\n\n"

        print "Weights : "
        for i in range(len(self.weights)):
            print self.weights[i]

        print "\n\n\n"

n = NN([2,2,2])
n.layers[0] = [[0.05],[0.1],[1]]
n.weights[0] = transpose([[0.15,0.25],[0.2,0.3],[0.35,0.6]])
n.weights[1] = transpose([[0.4,0.5],[0.45,0.55]])
for i in range(1000):

Expected : last layer is close to (0.01,0.99)

Output : (0.5, 0.9892866637557137)


Sigmoid acivation usually performs badly. If the value you feed in is very high or very low, the slope is almost flat, i.e.: little-to-no learning. You can try a couple of solutions:

1) change the initialization of your weights.

2) change activation function (suggested). You can try activations that typically perform much better, such as ReLU or leaky-ReLU.

Hope this helps!