## Hot questions for Using Neural networks in calculus

Question:

Are there any known approaches of making a machine learn calculus?

I've learnt that it is quite simple to teach calculating derivatives because it is possiblen to implement an algorithm.

Meanwhile, an implementation of integration is possible but is rarely or never fully implemented due to the algorithmical complexity.

I am curious whether there are any academic successes in the field of using machine learning science to evaluate and calculate integrals.

##### Edit

I am interested in teaching a computer to integrate **using neural networks or similar methods**.

Answer:

My personal opinion it is not possible to feed into NN enough rules for integrating. Why? Because NN are good for linear regression ( AKA approximation ) or logical regression ( AKA classification ). Integration is neither of them. It is calculation task according to some strict algorithms. So from this prospective it's good idea to use some mathematical ways to integrate.

Question:

I am trying to produce a mathematical operation selection nn model, which is based on the scalar input. The operation is selected based on the softmax result which is produce by the nn. Then this operation has to be applied to the scalar input in order to produce the final output. So far I’ve come up with applying argmax and onehot on the softmax output in order to produce a mask which then is applied on the concated values matrix from all the possible operations to be performed (as show in the pseudo code below). The issue is that neither argmax nor onehot appears to be differentiable. I am new to this, so any would be highly appreciated. Thanks in advance.

#perform softmax logits = tf.matmul(current_input, W) + b softmax = tf.nn.softmax(logits) #perform all possible operations on the input op_1_val = tf_op_1(current_input) op_2_val = tf_op_2(current_input) op_3_val = tf_op_2(current_input) values = tf.concat([op_1_val, op_2_val, op_3_val], 1) #create a mask argmax = tf.argmax(softmax, 1) mask = tf.one_hot(argmax, num_of_operations) #produce the input, by masking out those operation results which have not been selected output = values * mask

Answer:

I believe that this is not possible. This is similar to Hard Attention described in this paper. Hard attention is used in Image captioning to allow the model to focus only on a certain part of the image at each step. Hard attention is not differentiable but there are 2 ways to go around this:

1- Use Reinforcement Learning (RL): RL is made to train models that makes decisions. Even though, the loss function won't back-propagate any gradients to the softmax used for the decision, you can use RL techniques to optimize the decision. For a simplified example, you can consider the loss as penalty, and send to the node, with the maximum value in the softmax layer, a policy gradient proportional to the penalty in order to decrease the score of the decision if it was bad (results in a high loss).

2- Use something like soft attention: instead of picking only one operation, mix them with weights based on the softmax. so instead of:

output = values * mask

Use:

output = values * softmax

Now, the operations will converge down to zero based on how much the softmax will **not** select them. This is easier to train compared to RL but it won't work if you must completely remove the non-selected operations from the final result (set them to zero completely).

This is another answer that talks about Hard and Soft attention that you may find helpful: https://stackoverflow.com/a/35852153/6938290

Question:

I have data pairs (x,y) which are created by a cubic function

y = g(x) = ax^3 − bx^2 − cx + d

plus some random noise. Now, I want to fit a model (parameters a,b,c,d) to this data using gradient descent.

My implementation:

param={} param["a"]=0.02 param["b"]=0.001 param["c"]=0.002 param["d"]=-0.04 def model(param,x,y,derivative=False): x2=np.power(x,2) x3=np.power(x,3) y_hat = param["a"]*x3+param["b"]*x2+param["c"]*x+param["d"] if derivative==False: return y_hat derv={} #of Cost function w.r.t parameters m = len(y_hat) derv["a"]=(2/m)*np.sum((y_hat-y)*x3) derv["b"]=(2/m)*np.sum((y_hat-y)*x2) derv["c"]=(2/m)*np.sum((y_hat-y)*x) derv["d"]=(2/m)*np.sum((y_hat-y)) return derv def cost(y_hat,y): assert(len(y)==len(y_hat)) return (np.sum(np.power(y_hat-y,2)))/len(y) def optimizer(param,x,y,lr=0.01,epochs = 100): for i in range(epochs): y_hat = model(param,x,y) derv = model(param,x,y,derivative=True) param["a"]=param["a"]-lr*derv["a"] param["b"]=param["b"]-lr*derv["b"] param["c"]=param["c"]-lr*derv["c"] param["d"]=param["d"]-lr*derv["d"] if i%10==0: #print (y,y_hat) #print(param,derv) print(cost(y_hat,y)) X = np.array(x) Y = np.array(y) optimizer(param,X,Y,0.01,100)

When run, the cost seems to be increasing:

36.140028646153525 181.88127675295928 2045.7925570171055 24964.787906199843 306448.81623701524 3763271.7837247783 46215271.5069297 567552820.2134454 6969909237.010273 85594914704.25394

Did I compute the gradients wrong? I don't know why the cost is exploding.

Here is the data: https://pastebin.com/raw/1VqKazUV.

Answer:

If I run your code with e.g. `lr=1e-4`

, the cost decreases.

Check your gradients (just print the result of `model(..., True)`

), you will see that they are quite large. As your learning rate is also not too small, you are likely oscillating away from the minimum (see any ML textbook for example plots of this, you should also be able to see this if you just print your parameters after every iteration).