## Hot questions for Using Neural networks in mathematical optimization

Question:

As the question states. I am aiming to train a neural network where the weights are complex numbers. Using the default scikit learn netwokrs and building on this (editing the source code) the main problem I have encountered is that the optimizing functions used in scikit learn taken from scipy only support numerical optimization of functions whose input are real numbers.

Scikit learn is rather poor for neural networks it seems specially if you are wishing to fork and edit the structure is rather unflexible.

As I have noticed and read in a paper here I need to change things such as the error function to ensure that at the top level the error remains in the domain of real numbers or the problem becomes ill defined.

My question here is are there any standard libraries that may do this already ? or any easy tweaks that I could do the lasagne or tensorflow to save my life ?

P.S. : Sorry for not posting any working code. It is a difficult question to format to the stackoverflow standards and I do admit it may be out of topic in which case I apologize if such.

Answer:

The easiest way to do this is to divide your feature into the real and imaginary components. I've done similar work with vector input from a leap motion and it significantly simplifies things if you divide vectors into their component axis.

Question:

How do the Hessian-Free (HF) Optimization techniques compare against the Gradient Descent techniques (for e.g. Stochastic Gradient Descent (SGD), Batch Gradient Descent, Adaptive Gradient Descent) for training Deep Neural Networks (DNN)?

Under what circumstances should one prefer HF techniques as opposed to Gradient Descent techniques?

Answer:

I think if someone knows the difference, it helps to know when and where to use each method. I try to shed some lights on the concepts.

Gradient Descent is a type of first order optimization methods, and has been used in the training of Neural Networks, since second order methods, such as Newton's method, are computationally infeasible. However, second order methods show much better convergence characteristics than first order methods, because they also take into account the curvature of the error space.

Additionally, first order methods require a lot of tuning of the decrease parameter, which is application specific. They also have a tendency to get trapped in local optimum and exhibit slow convergence.

The reason for in-feasibility of Newton's method is the computation of the Hessian matrix, which takes prohibitively long. In order to overcome this issue, "Hessian free" learning is proposed in which one can use Newton's method without directly computing the Hessian matrix.

I don't wanna go into more details, but as far as I know, for deep network, it is highly recommended to use HF optimization (there are many improvement over HF approach as well) since it takes much less time for training, or using SGD with momentum.

Question:

Unlike linear and logistic regression, ANNs cost functions are not convex, and thus are susceptible to local optima. Can anyone provide an intuition as to why this is the case for ANNs and why the hypothesis cannot be modified to produce a convex function?

Answer:

I found a sufficient explanation here:

https://stats.stackexchange.com/questions/106334/cost-function-of-neural-network-is-non-convex

Basically since weights are permutable across layers there are multiple solutions for any minima that will achieve the same results, and thus the function cannot be convex (or concave either).