## Hot questions for Using Neural networks in calculus

Question:

Are there any known approaches of making a machine learn calculus?

I've learnt that it is quite simple to teach calculating derivatives because it is possiblen to implement an algorithm.

Meanwhile, an implementation of integration is possible but is rarely or never fully implemented due to the algorithmical complexity.

I am curious whether there are any academic successes in the field of using machine learning science to evaluate and calculate integrals.

##### Edit

I am interested in teaching a computer to integrate using neural networks or similar methods.

My personal opinion it is not possible to feed into NN enough rules for integrating. Why? Because NN are good for linear regression ( AKA approximation ) or logical regression ( AKA classification ). Integration is neither of them. It is calculation task according to some strict algorithms. So from this prospective it's good idea to use some mathematical ways to integrate.

Question:

I am trying to produce a mathematical operation selection nn model, which is based on the scalar input. The operation is selected based on the softmax result which is produce by the nn. Then this operation has to be applied to the scalar input in order to produce the final output. So far I’ve come up with applying argmax and onehot on the softmax output in order to produce a mask which then is applied on the concated values matrix from all the possible operations to be performed (as show in the pseudo code below). The issue is that neither argmax nor onehot appears to be differentiable. I am new to this, so any would be highly appreciated. Thanks in advance.

```    #perform softmax
logits  = tf.matmul(current_input, W) + b
softmax = tf.nn.softmax(logits)

#perform all possible operations on the input
op_1_val = tf_op_1(current_input)
op_2_val = tf_op_2(current_input)
op_3_val = tf_op_2(current_input)
values = tf.concat([op_1_val, op_2_val, op_3_val], 1)

argmax  = tf.argmax(softmax, 1)

#produce the input, by masking out those operation results which have not been selected
```

I believe that this is not possible. This is similar to Hard Attention described in this paper. Hard attention is used in Image captioning to allow the model to focus only on a certain part of the image at each step. Hard attention is not differentiable but there are 2 ways to go around this:

1- Use Reinforcement Learning (RL): RL is made to train models that makes decisions. Even though, the loss function won't back-propagate any gradients to the softmax used for the decision, you can use RL techniques to optimize the decision. For a simplified example, you can consider the loss as penalty, and send to the node, with the maximum value in the softmax layer, a policy gradient proportional to the penalty in order to decrease the score of the decision if it was bad (results in a high loss).

2- Use something like soft attention: instead of picking only one operation, mix them with weights based on the softmax. so instead of:

```output = values * mask
```

Use:

```output = values * softmax
```

Now, the operations will converge down to zero based on how much the softmax will not select them. This is easier to train compared to RL but it won't work if you must completely remove the non-selected operations from the final result (set them to zero completely).

This is another answer that talks about Hard and Soft attention that you may find helpful: https://stackoverflow.com/a/35852153/6938290

Question:

I have data pairs (x,y) which are created by a cubic function

```y = g(x) = ax^3 − bx^2 − cx + d
```

plus some random noise. Now, I want to fit a model (parameters a,b,c,d) to this data using gradient descent.

My implementation:

```param={}
param["a"]=0.02
param["b"]=0.001
param["c"]=0.002
param["d"]=-0.04

def model(param,x,y,derivative=False):
x2=np.power(x,2)
x3=np.power(x,3)
y_hat = param["a"]*x3+param["b"]*x2+param["c"]*x+param["d"]
if derivative==False:
return y_hat
derv={} #of Cost function w.r.t parameters
m = len(y_hat)
derv["a"]=(2/m)*np.sum((y_hat-y)*x3)
derv["b"]=(2/m)*np.sum((y_hat-y)*x2)
derv["c"]=(2/m)*np.sum((y_hat-y)*x)
derv["d"]=(2/m)*np.sum((y_hat-y))
return derv

def cost(y_hat,y):
assert(len(y)==len(y_hat))
return (np.sum(np.power(y_hat-y,2)))/len(y)

def optimizer(param,x,y,lr=0.01,epochs = 100):
for i in range(epochs):
y_hat = model(param,x,y)
derv = model(param,x,y,derivative=True)
param["a"]=param["a"]-lr*derv["a"]
param["b"]=param["b"]-lr*derv["b"]
param["c"]=param["c"]-lr*derv["c"]
param["d"]=param["d"]-lr*derv["d"]
if i%10==0:
#print (y,y_hat)
#print(param,derv)
print(cost(y_hat,y))

X = np.array(x)
Y = np.array(y)
optimizer(param,X,Y,0.01,100)
```

When run, the cost seems to be increasing:

```36.140028646153525
181.88127675295928
2045.7925570171055
24964.787906199843
306448.81623701524
3763271.7837247783
46215271.5069297
567552820.2134454
6969909237.010273
85594914704.25394
```

Did I compute the gradients wrong? I don't know why the cost is exploding.

Here is the data: https://pastebin.com/raw/1VqKazUV.

If I run your code with e.g. `lr=1e-4`, the cost decreases.
Check your gradients (just print the result of `model(..., True)`), you will see that they are quite large. As your learning rate is also not too small, you are likely oscillating away from the minimum (see any ML textbook for example plots of this, you should also be able to see this if you just print your parameters after every iteration).