Hot questions for Using Neural networks in linear algebra


This is my custom extension of one of Andrew NG's neural network from deep learning course where instead of producing 0 or 1 for binary classification I'm attempting to classify multiple examples.

Both the inputs and outputs are one hot encoded.

With not much training I receive an accuracy of 'train accuracy: 67.51658067499625 %'

How can I classify a single training example instead of classifying all training examples?

I think a bug exists in my implementation as an issue with this network is training examples (train_set_x) and output values (train_set_y) both need to have same dimensions or an error related to the dimensionality of matrices is received. For example using :

train_set_x = np.array([

train_set_y = np.array([

returns error :

ValueError                                Traceback (most recent call last)
<ipython-input-11-0d356e8d66f3> in <module>()
     27 print(A)
---> 29 np.multiply(train_set_y,A)
     31 def initialize_with_zeros(numberOfTrainingExamples):

ValueError: operands could not be broadcast together with shapes (3,3) (1,4)

network code :

import numpy as np
import matplotlib.pyplot as plt
import h5py
import scipy
from scipy import ndimage
import pandas as pd
%matplotlib inline

train_set_x = np.array([

train_set_y = np.array([

numberOfFeatures = 4
numberOfTrainingExamples = 3

def sigmoid(z):
    s = 1 / (1 + np.exp(-z))  
    return s

w = np.zeros((numberOfTrainingExamples , 1))
b = 0
A = sigmoid( , train_set_x))    


def initialize_with_zeros(numberOfTrainingExamples):
    w = np.zeros((numberOfTrainingExamples , 1))
    b = 0
    return w, b

def propagate(w, b, X, Y):

    m = X.shape[1]

    A = sigmoid( , X) + b)    

    cost = -(1/m)*np.sum(np.multiply(Y,np.log(A)) + np.multiply((1-Y),np.log(1-A)), axis=1)

    dw =  ( 1 / m ) * X, ( A - Y ).T )    # consumes ( A - Y )
    db =  ( 1 / m ) *   np.sum( A - Y )    # consumes ( A - Y ) again

#     cost = np.squeeze(cost)

    grads = {"dw": dw,
             "db": db}

    return grads, cost

def optimize(w, b, X, Y, num_iterations, learning_rate, print_cost = True):

    costs = []

    for i in range(num_iterations):

        grads, cost = propagate(w, b, X, Y)

        dw = grads["dw"]
        db = grads["db"]

        w = w - (learning_rate * dw)
        b = b - (learning_rate * db)

        if i % 100 == 0:

        if print_cost and i % 10000 == 0:

    params = {"w": w,
              "b": b}

    grads = {"dw": dw,
             "db": db}

    return params, grads, costs

def model(X_train, Y_train, num_iterations, learning_rate = 0.5, print_cost = False):

    w, b = initialize_with_zeros(numberOfTrainingExamples)

    parameters, grads, costs = optimize(w, b, X_train, Y_train, num_iterations, learning_rate, print_cost = True)
    w = parameters["w"]
    b = parameters["b"]
    Y_prediction_train = sigmoid( , X_train) + b) 

    print("train accuracy: {} %".format(100 - np.mean(np.abs(Y_prediction_train - Y_train)) * 100))

model(train_set_x, train_set_y, num_iterations = 20000, learning_rate = 0.0001, print_cost = True)

Update: A bug exists in this implementation in that the training example pairs (train_set_x , train_set_y) must contain the same dimensions. Can point in direction of how linear algebra should be modified?

Update 2 :

I modified @Paul Panzer answer so that learning rate is 0.001 and train_set_x , train_set_y pairs are unique :

train_set_x = np.array([

train_set_y = np.array([

grads = model(train_set_x, train_set_y, num_iterations = 20000, learning_rate = 0.001, print_cost = True)

# To classify single training example : 

print(sigmoid(dw @ [0,0,1,1,0] + db))

This update produces following output :

[[ 0.74043089  0.32851512  0.14776077  0.77970162]
 [ 0.04810012  0.08033521  0.72846174  0.1063849 ]
 [ 0.25956911  0.67148488  0.22029838  0.85223923]]
[[1 0 0 1]
 [0 0 1 0]
 [0 1 0 1]]
train accuracy: 79.84462279013312 %
[[ 0.51309252  0.48853845  0.50945862]
 [ 0.5110232   0.48646923  0.50738869]
 [ 0.51354109  0.48898712  0.50990734]]

Should print(sigmoid(dw @ [0,0,1,1,0] + db)) produce a vector that once rounded matches train_set_y corresponding value : [0,1,0] ?

Modifying to produce a vector with (adding [0,0,1,1,0] to numpy array and taking transpose):

print(sigmoid(dw @ np.array([[0,0,1,1,0]]).T + db))

returns :

array([[ 0.51309252],
       [ 0.48646923],
       [ 0.50990734]])

Again, rounding these values to nearest whole number produces vector [1,0,1] when [0,1,0] is expected.

These are incorrect operations to produce a prediction for single training example ?


Your difficulties come from mismatched dimensions, so let's walk through the problem and try and get them straight.

Your network has a number of inputs, the features, let's call their number N_in (numberOfFeatures in your code). And it has a number of outputs which correspond to different classes let's call their number N_out. Inputs and outputs are connected by the weights w.

Now here is the problem. Connections are all-to-all, so we need a weight for each of the N_out x N_in pairs of outputs and inputs. Therefore in your code the shape of w must be changed to (N_out, N_in). You probably also want an offset b for each output, so b should be a vector of size (N_out,) or rather (N_out, 1) so it plays well with the 2d terms.

I've fixed that in the modified code below and I tried to make it very explicit. I've also thrown a mock data creator into the bargain.

Re the one-hot encoded categorical output, I'm not an expert on neural networks but I think, most people understand it so that classes are mutually exclusive, so each sample in your mock output should have one one and the rest zeros.

Side note:

At one point a competing answer advised you to get rid of the 1-... terms in the cost function. While that looks like an interesting idea to me my gut feeling (Edit Now confirmed using gradient-free minimizer; use activation="hybrid" in code below. Solver will simply maximize all outputs which are active in at least one training example.) is it won't work just like that because the cost will then fail to penalise false positives (see below for detailed explanation). To make it work you'd have to add some kind of regularization. One method that appears to work is using the softmax instead of the sigmoid. The softmax is to one-hot what the sigmoid is to binary. It makes sure the output is "fuzzy one-hot".

Therefore my recommendation is:

  • If you want to stick with sigmoid and not explicitly enforce one-hot predictions. Keep the 1-... term.
  • If you want to use the shorter cost function. Enforce one-hot predictions. For example by using softmax instead of sigmoid.

I've added an activation="sigmoid"|"softmax"|"hybrid" parameter to the code that switches between models. I've also made the scipy general purpose minimizer available, which may be useful when the gradient of the cost is not at hand.

Recap on how the cost function works:

The cost is a sum over all classes and all training samples of the term

-y log (y') - (1-y) log (1-y')

where y is the expected response, i.e. the one given by the "y" training sample for the input (the "x" training sample). y' is the prediction, the response the network with its current weights and biases generates. Now, because the expected response is either 0 or 1 the cost for a single category and a single training sample can be written

-log (y')   if   y = 1
-log(1-y')  if   y = 0

because in the first case (1-y) is zero, so the second term vanishes and in the secondo case y is zero, so the first term vanishes. One can now convince oneself that the cost is high if

  • the expected response y is 1 and the network prediction y' is close to zero
  • the expected response y is 0 and the network prediction y' is close to one

In other words the cost does its job in punishing wrong predictions. Now, if we drop the second term (1-y) log (1-y') half of this mechanism is gone. If the expected response is 1, a low prediction will still incur a cost, but if the expected response is 0, the cost will be zero, regardless of the prediction, in particular, a high prediction (or false positive) will go unpunished.

Now, because the total cost is a sum over all training samples, there are three possibilities.

  • all training samples prescribe that the class be zero: then the cost will be completely independent of the predictions for this class and no learning can take place

  • some training samples put the class at zero, some at one: then because "false negatives" or "misses" are still punished but false positives aren't the net will find the easiest way to minimize the cost which is to indiscriminately increase the prediction of the class for all samples

  • all training samples prescribe that the class be one: essentially the same as in the second scenario will happen, only here it's no problem, because that is the correct behavior

And finally, why does it work if we use softmax instead of sigmoid? False positives will still be invisible. Now it is easy to see that the sum over all classes of the softmax is one. So I can only increase the prediction for one class if at least one other class is reduced to compensate. In particular, there can be no false positives without a false negative, and the false negative the cost will detect.

On how to get a binary prediction:

For binary expected responses rounding is indeed the appropriate procedure. For one-hot I'd rather find the largest value, set that to one and all others to zero. I've added a convenience function, predict, implementing that.

import numpy as np
from scipy import optimize as opt
from collections import namedtuple

# First, a few structures to keep ourselves organized

Problem_Size = namedtuple('Problem_Size', 'Out In Samples')
Data = namedtuple('Data', 'Out In')
Network = namedtuple('Network', 'w b activation cost gradient most_likely')

def get_dims(Out, In, transpose=False):
    """extract dimensions and ensure everything is 2d
    return Data, Dims"""
    # gracefully acccept lists etc.
    Out, In = np.asanyarray(Out), np.asanyarray(In)
    if transpose:
        Out, In = Out.T, In.T
    # if it's a single sample make sure it's n x 1
    Out = Out[:, None] if len(Out.shape) == 1 else Out
    In = In[:, None] if len(In.shape) == 1 else In
    Dims = Problem_Size(Out.shape[0], *In.shape)
    if Dims.Samples != Out.shape[1]:
        raise ValueError("number of samples must be the same for Out and In")
    return Data(Out, In), Dims

def sigmoid(z):
    s = 1 / (1 + np.exp(-z))  
    return s

def sig_cost(Net, data):
    A = process(data.In, Net)
    logA = np.log(A)
    return -(data.Out * logA + (1-data.Out) * (1-logA)).sum(axis=0).mean()

def sig_grad (Net, Dims, data):
    A = process(data.In, Net)
    return dict(dw =  (A - data.Out) @ data.In.T / Dims.Samples,
                db =  (A - data.Out).mean(axis=1, keepdims=True))

def sig_ml(z):
    return np.round(z).astype(int)

def sof_ml(z):
    hot = np.argmax(z, axis=0)
    z = np.zeros(z.shape, dtype=int)
    z[hot, np.arange(len(hot))] = 1
    return z

def softmax(z):
    z = z - z.max(axis=0, keepdims=True)
    z = np.exp(z)
    return z / z.sum(axis=0, keepdims=True)

def sof_cost(Net, data):
    A = process(data.In, Net)
    logA = np.log(A)
    return -(data.Out * logA).sum(axis=0).mean()

sof_grad = sig_grad

def get_net(Dims, activation='softmax'):
    activation, cost, gradient, ml = {
        'sigmoid': (sigmoid, sig_cost, sig_grad, sig_ml),
        'softmax': (softmax, sof_cost, sof_grad, sof_ml),
        'hybrid': (sigmoid, sof_cost, None, sig_ml)}[activation]
    return Network(w=np.zeros((Dims.Out, Dims.In)),
                   b=np.zeros((Dims.Out, 1)),
                   activation=activation, cost=cost, gradient=gradient,

def process(In, Net):
    return Net.activation(Net.w @ In + Net.b)

def propagate(data, Dims, Net):
    return Net.gradient(Net, Dims, data), Net.cost(Net, data)

def optimize_no_grad(Net, Dims, data):
    def f(x):
        Net.w[...] = x[:Net.w.size].reshape(Net.w.shape)
        Net.b[...] = x[Net.w.size:].reshape(Net.b.shape)
        return Net.cost(Net, data)
    x = np.r_[Net.w.ravel(), Net.b.ravel()]
    res = opt.minimize(f, x, options=dict(maxiter=10000)).x
    Net.w[...] = res[:Net.w.size].reshape(Net.w.shape)
    Net.b[...] = res[Net.w.size:].reshape(Net.b.shape)

def optimize(Net, Dims, data, num_iterations, learning_rate, print_cost = True):

    w, b = Net.w, Net.b
    costs = []

    for i in range(num_iterations):

        grads, cost = propagate(data, Dims, Net)

        dw = grads["dw"]
        db = grads["db"]

        w -= learning_rate * dw
        b -= learning_rate * db

        if i % 100 == 0:

        if print_cost and i % 10000 == 0:

    return grads, costs

def model(X_train, Y_train, num_iterations, learning_rate = 0.5, print_cost = False, activation='sigmoid'):

    data, Dims = get_dims(Y_train, X_train, transpose=True)
    Net = get_net(Dims, activation)

    if Net.gradient is None:
        optimize_no_grad(Net, Dims, data)
        grads, costs = optimize(Net, Dims, data, num_iterations, learning_rate, print_cost = True)

    Y_prediction_train = process(data.In, Net)

    print("train accuracy: {} %".format(100 - np.mean(np.abs(Y_prediction_train - data.Out)) * 100))
    return Net

def predict(In, Net, probability=False):
    In = np.asanyarray(In)
    is1d = In.ndim == 1
    if is1d:
        In = In.reshape(-1, 1)
    Out = process(In, Net)
    if not probability:
        Out = Net.most_likely(Out)
    if is1d:
        Out = Out.reshape(-1)
    return Out

def create_data(Dims):
    Out = np.zeros((Dims.Out, Dims.Samples), dtype=int)
    Out[np.random.randint(0, Dims.Out, (Dims.Samples,)), np.arange(Dims.Samples)] = 1
    In = np.random.randint(0, 2, (Dims.In, Dims.Samples))
    return Data(Out, In)

train_set_x = np.array([

train_set_y = np.array([

Net1 = model(train_set_x, train_set_y, num_iterations = 20000, learning_rate = 0.001, print_cost = True, activation='sigmoid')

Net2 = model(train_set_x, train_set_y, num_iterations = 20000, learning_rate = 0.001, print_cost = True, activation='softmax')

Net3 = model(train_set_x, train_set_y, num_iterations = 20000, learning_rate = 0.001, print_cost = True, activation='hybrid')

Dims = Problem_Size(8, 100, 50)
data = create_data(Dims)
model(data.In.T, data.Out.T, num_iterations = 40000, learning_rate = 0.001, print_cost = True, activation='softmax') 
model(data.In.T, data.Out.T, num_iterations = 40000, learning_rate = 0.001, print_cost = True, activation='sigmoid') 


I'm new to the Julia programming language, and still learning it by writing code that I've already written in Python (or, at least, tried out in Python).

There is an article which explains how to make a very simple neural network:

I tried the code in this article out in Python, at it's working fine. However, I haven't used linear algebra things in Python before (like dot). Now I'm trying to translate this code to Julia, but there are some things I can't understand. Here is my Julia code:

using LinearAlgebra

synaptic_weights = [-0.16595599, 0.44064899, -0.99977125]::Vector{Float64}

sigmoid(x) = 1 / (1 + exp(-x))
sigmoid_derivative(x) = x * (1 -x)

function train(training_set_inputs, training_set_outputs, number_of_training_iterations)
    global synaptic_weights
    for (iteration) in 1:number_of_training_iterations
        output = think(training_set_inputs)

        error = training_set_outputs .- output

        adjustment = dot(transpose(training_set_inputs), error * sigmoid_derivative(output))

        synaptic_weights = synaptic_weights .+ adjustment

think(inputs) = sigmoid(dot(inputs, synaptic_weights))

println("Random starting synaptic weights:")

training_set_inputs = [0 0 1 ; 1 1 1 ; 1 0 1 ; 0 1 1]::Matrix{Int64}
training_set_outputs = [0, 1, 1, 0]::Vector{Int64}
train(training_set_inputs, training_set_outputs, 10000)

println("New synaptic weights after training:")

println("Considering new situation [1, 0, 0] -> ?:")
println(think([1 0 0]))

I've already tried to initialize vectors (like synaptic_weights) as:

synaptic_weights = [-0.16595599 ; 0.44064899 ; -0.99977125]

However, the code is not working. More exactly, there are 3 things that is not clear for me:

  1. Do I initialize vectors and matrixes in the right way (is it equal to what the original author does in Python)?
  2. In Python, the original author uses + and - operators where one operand is a vector and the other is a scalar. I'm not sure whether this means element-wise addition or subtraction in Python. For example, is (vector+scalar) in Python equal to (vector.+scalar) in Julia?
  3. When I try to run the Julia code above, I get the following error:

    ERROR: LoadError: DimensionMismatch("first array has length 12 which does not match the length of the second, 3.")
     [1] dot(::Array{Int64,2}, ::Array{Float64,1}) at C:\Users\julia\AppData\Local\Julia-1.0.3\share\julia\stdlib\v1.0\LinearAlgebra\src\generic.jl:702
     [2] think(::Array{Int64,2}) at C:\Users\Viktória\Documents\julia.jl:21
     [3] train(::Array{Int64,2}, ::Array{Int64,1}, ::Int64) at C:\Users\Viktória\Documents\julia.jl:11
     [4] top-level scope at none:0
    in expression starting at C:\Users\Viktória\Documents\julia.jl:28

    This error comes when the funtion think(inputs) tries to compute the dot product of inputs and synaptic_weights. In this case, inputs is a 4x3 matrix and synaptic weights is a 3x1 matrix (vector). I know that they can be multiplied, and the result will become a 4x1 matrix (vector). Doesn't this mean that they dot product can be computed?

    Anyway, that dot product can be computed in Python using the numpy package, so I guess there is a certain way that it can also be computed in Julia.

For the dot product, I also tried to make a function that takes a and b as arguments, and tries to compute their dot product: first, computes the product of a and b, then returns the sum of the result. I'm not sure whether it's a good solution, but the Julia code didn't produce the expected result when I used that function, so I removed it.

Can you help me with this code, please?


Here is the code adjusted to Julia:

sigmoid(x) = 1 / (1 + exp(-x))
sigmoid_derivative(x) = x * (1 -x)
think(synaptic_weights, inputs) = sigmoid.(inputs * synaptic_weights)

function train!(synaptic_weights, training_set_inputs, training_set_outputs,
    for iteration in 1:number_of_training_iterations
        output = think(synaptic_weights, training_set_inputs)
        error = training_set_outputs .- output
        adjustment =  transpose(training_set_inputs) * (error .* sigmoid_derivative.(output))
        synaptic_weights .+= adjustment

synaptic_weights = [-0.16595599, 0.44064899, -0.99977125]
println("Random starting synaptic weights:")

training_set_inputs = Float64[0 0 1 ; 1 1 1 ; 1 0 1 ; 0 1 1]
training_set_outputs = Float64[0, 1, 1, 0]
train!(synaptic_weights, training_set_inputs, training_set_outputs, 10000)

println("New synaptic weights after training:")

println("Considering new situation [1, 0, 0] -> ?:")
println(think(synaptic_weights, Float64[1 0 0]))

There are multiple changes so if some of them are not clear to you please ask and I will expand on them.

The most important things I have changed:

  • do not use global variables as they will significantly slow down the performance
  • make all arrays have Float64 element type
  • in several places you need to do broadcasting with . (e.g. sigmoid and sigmoid_derivative functions are defined in such a way that they expect to get a number as an argument, therefore when we call them . is added after their name to trigger broadcasting)
  • use standard matrix multiplication * instead of dot

The code runs around 30x faster than the original implementation in Python. I have not squeezed out maximum performance for this code (now it does a lot of allocations which can be avoided) as it would require to rewrite its logic a bit and I guess you wanted a direct reimplementation.


I use it to implement neural networks. I prefer NumPy, because it is more convenient to prepare data with Python; however, I am concerned that NumPy is not as fast as c++ libraries.


NumPy is implemented in C. So most of the time you just call C and for some functionality optimized Fortran functions or subroutines. Therefore, you will get a decent speed with NumPy for many tasks. You need to vectorize your operations. Don't write for loops over NumPy arrays. Of course, hand-optimized C code can be faster. On the other hand, NumPy contains a lot of already optimized algorithms that might be faster than not so optimal C code written by less experienced C programmers.

You can gradually move from Python to C with Cython and/or use Numba for jit-compilation to machine or gpu code.


Suppose that x is a vector with shape (a,), T is a tensor with shape (b, a, a). If I want to compute (x^T)Tx ,I can do it using

For example:

x = np.array([1.,2.,3.,4.,5.])

w = np.array([[[1.,2.,3.,4.,5.],

But what if I want to decompose T into two tensors P and Q (low rank express) with shape (b,a,r) and (b,r,a) and r<<a so each matrix in T which is a*a decomposed to a*r and r*a, which reduce much data. Then how do I do the computation of (x^T)PQx with numpy?


Your example has problems.

x.shape  (5,)
w.shape  (2,3,5)
   ValueError: matrices are not aligned

But to use your description:

`x` `(a,)`,  `T` `(b,a,a)`; `(x^T)Tx`

I like to use einsum (Einstein summation) when thinking about complex products. I think your x'Tx is:

np.einsum('i,kij,j->k', x, T, x)

T decomposed into: P (b,a,r), Q (b,r,a);

np.einsum('kir,krj->kij', P,Q) == T

together the expressions are:

np.einsum('i,kir,krj,j->k', x, P, Q, x)

einsum isn't the best when the dimensions are large, since the combined iteration space of k,i,j,r may be large. Still it is a useful way to think about the problem.

I think it can be rewritten as 3 dots:

P1 = np.einsum('i,kir->kr', x, P)
Q1 = np.einsum('krj,j->kr', Q, x)
np.einsum('kr,kr->k', P1, Q1)

A sample calculation:

In [629]: a,b,r = 5,3,2
In [630]: x=np.arange(1.,a+1)
In [632]: P=np.arange(b*a*r).reshape(b,a,r)
In [633]: Q=np.arange(b*a*r).reshape(b,r,a)

In [635]: T=np.einsum('kir,krj->kij',P,Q)

In [636]: P
array([[[ 0,  1],
        [ 2,  3],
        [ 4,  5],
        [ 6,  7],
        [24, 25],
        [26, 27],
        [28, 29]]])

In [637]: Q
array([[[ 0,  1,  2,  3,  4],
        [ 5,  6,  7,  8,  9]],
       [[20, 21, 22, 23, 24],
        [25, 26, 27, 28, 29]]])

In [638]: T
array([[[   5,    6,    7,    8,    9],
        [  15,   20,   25,   30,   35],
        [  25,   34,   43,   52,   61],
        [  35,   48,   61,   74,   87],
        [1105, 1154, 1203, 1252, 1301],
        [1195, 1248, 1301, 1354, 1407],
        [1285, 1342, 1399, 1456, 1513]]])

In [639]: T.shape
Out[639]: (3, 5, 5)

In [640]: R1=np.einsum('i,kij,j->k',x,T,x)
In [642]: R1
Out[642]: array([  14125.,  108625.,  293125.])

In [643]: R2=np.einsum('i,kir,krj,j->k',x,P,Q,x)
In [644]: R2
Out[644]: array([  14125.,  108625.,  293125.])

In [645]: P1=np.einsum('i,kir->kr',x,P)
In [646]: Q1=np.einsum('krj,j->kr',Q,x)
In [647]: R3=np.einsum('kr,kr->k',P1,Q1)
In [648]: R3
Out[648]: array([  14125.,  108625.,  293125.])

In [649]: P1
array([[  80.,   95.],
       [ 230.,  245.],
       [ 380.,  395.]])

In [650]: Q1
array([[  40.,  115.],
       [ 190.,  265.],
       [ 340.,  415.]])

The last set of calculations can be done with dot

In [656]:,P)
array([[  80.,   95.],
       [ 230.,  245.],
       [ 380.,  395.]])

In [657]:,x)
array([[  40.,  115.],
       [ 190.,  265.],
       [ 340.,  415.]])

In [658]:,P),,x).T)
array([[  14125.,   40375.,   66625.],
       [  37375.,  108625.,  179875.],
       [  60625.,  176875.,  293125.]])

But we want just the diagonal of the last dot. A simpler sum of products is better:

In [661]: (P1*Q1).sum(axis=1)
Out[661]: array([  14125.,  108625.,  293125.])