Hot questions for Using Neural networks in mse


I have a generative adversarial networks, where the discriminator gets minimized with the MSE and the generator should get maximized. Because both are opponents who pursue the opposite goal.

generator = Sequential()
generator.add(Dense(units=50, activation='sigmoid', input_shape=(15,)))
generator.add(Dense(units=1, activation='sigmoid'))
generator.compile(loss='mse', optimizer='adam')

generator.train_on_batch(x_data, y_data)

What do I have to adapt, to get a generator model which profits from a high MSE value?



The original MSE implementation looks like as follows:

def mean_squared_error(y_true, y_pred):
    if not K.is_tensor(y_pred):
        y_pred = K.constant(y_pred)
    y_true = K.cast(y_true, y_pred.dtype)
    return K.mean(K.square(y_pred - y_true), axis=-1)

I think the correct maximizer loss function:

def mean_squared_error_max(y_true, y_pred):
    if not K.is_tensor(y_pred):
        y_pred = K.constant(y_pred)
    y_true = K.cast(y_true, y_pred.dtype)
    return K.mean(K.square(1 / (y_pred - y_true)), axis=-1)

This way we get always a positive loss value, like in the case of the MSE function, but with reversed effect.

UPDATE 2: Initially I wrote, that the intuitive first thought to simply negate the loss will NOT give the result what we expected because of the base concept of the optimizing methods (you can read an interesting discussion here). After I double checked both method head to head the result in a particular learning task (Note: I didn't do an all-out test) was that both method gave the loss maximization, though the -loss approach converged a bit faster. I am not sure if it always gives the best solution or any solution because of the possible issue described here. If someone has other experience, please let me know.

So if somebody want to give a try to -loss too:

def mean_squared_error(y_true, y_pred):
    if not K.is_tensor(y_pred):
        y_pred = K.constant(y_pred)
    y_true = K.cast(y_true, y_pred.dtype)
    return - K.mean(K.square(y_pred - y_true), axis=-1)

Additional details:

OP wrote:

I have a generative adversarial networks, where the discriminator gets minimized with the MSE and the generator should get maximized. Because both are opponents who pursue the opposite goal.

From the link provided by Ibragil:

Meanwhile, the generator is creating new, synthetic images that it passes to the discriminator. It does so in the hopes that they, too, will be deemed authentic, even though they are fake. The goal of the generator is to generate passable hand-written digits: to lie without being caught. The goal of the discriminator is to identify images coming from the generator as fake.

So this is an ill-posed problem:

In GAN our final goal to train our two counter-parties the discriminator and the generator to perform as good as possible against each other. It means, that the two base learning algorythm have different tasks but the loss function with which they can achieve the optimal solution is the same i.e. binary_crossentropy, so the models' tasks are to minimize this lost.

A discriminator model's compile method:

self.discriminator.compile(loss='binary_crossentropy', optimizer=optimizer)

A generator model's compile method:

self.generator.compile(loss='binary_crossentropy', optimizer=optimizer)

It is the same like two runner's goal to be minimized their time of reaching the finish even so they are competitors in this task.

So the "opposite goal" doesn't mean opposite task i.e. minimizing the loss (i.e. minimizing the time in the runner example).

I hope it helps.


I stumbled across the definition of mse in Keras and I can't seem to find an explanation.

def mean_squared_error(y_true, y_pred):
    return K.mean(K.square(y_pred - y_true), axis=-1)

I was expecting the mean to be taken across the batches, which is axis=0, but instead, it is axis=-1.

I also played around with it a little to see if K.mean actually behaves like the numpy.mean. I must have misunderstood something. Can somebody please clarify?

I can't actually take a look inside the cost function at run time right? As far as I know the function is called at compile time, which prevents me from evaluating concrete values.

I mean... imagine doing regression and having a single output neuron and training with a batch size of ten.

>>> import numpy as np
>>> a = np.ones((10, 1))
>>> a
array([[ 1.],
       [ 1.],
       [ 1.],
       [ 1.],
       [ 1.],
       [ 1.],
       [ 1.],
       [ 1.],
       [ 1.],
       [ 1.]])
>>> np.mean(a, axis=-1)
array([ 1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.])

All it does is flatten the array instead of taking the mean of all the predictions.


K.mean(a, axis=-1) and also np.mean(a, axis=-1) is just taking the mean across the final dimension. Here a is an array with shape (10, 1) and in this case, taking the mean across the final dimension happens to be the same as flattening it to a 1d array of shape (10,). Implementing it like so supports the more general case of e.g. multiple linear regression.

Also, you can inspect the value of nodes in the computation graph at run-time using keras.backend.print_tensor. See answer: Is there any way to debug a value inside a tensor while training on Keras?

Edit: You question appears to be about why the loss doesn't return a single scalar value but instead returns a scalar value for each data-point in the batch. To support sample weighting, Keras losses are expected to return a scalar for each data-point in the batch. See losses documentation and the sample_weight argument of fit for more information. Note specifically: "The actual optimized objective is the [weighted] mean of the output array across all data points."


Is it possible that the MSE increases during training?

I'm currently calculating the MSE of the validation set per epoch and at a certrain point, the MSE starts to increase instead of decreasing. Does someone has an explanation for this behavior?


Answering your question: Yes, it is possible.

If you are using regularization or estochastic training it is normal some ups and downs on the MSE while training.

Some possible reasons to the problem

  1. You are using a learning rate too high, which let to the problem of overshooting the local minima of the cost function.

  2. The neural network is overfitting. Traning too much and loosing its capabilities to generalize.

What you can try:

  1. When this starts to happen, reduce your learning rate.

  2. Apply some kind of regularization on your network, like dropout, to avoid overfitting.


I am training a model using keras on a regression problem. When I investigate the loss and metrics during training, sometimes mean absolute error (mae) decreases at the end of an epoch, while mean square error (mse) increases. I set mae as loss and mse as metric.

Is it OK? Or is there any problem with the setting? Thanks


MSE and MAE are different metrics. A decrease in the one does not imply a decrease in the other. Consider the following toy example for the size-2 output values of a network with the target value as Target: [0,0]

  • Timestep 1: Output: [2,2], MAE: 2, MSE: 4
  • Timestep 2: Output: [0,3], MAE: 1.5, MSE: 4.5

So MAE decreased while MSE increased. Given that you are optimizing for MAE and only monitor MSE, your observation is perfectly fine and does not imply any problem.


I use the following code:

import numpy as np
import math
import keras
from keras.models import Model, Sequential
from keras.layers import Input, Dense, Activation
from keras import regularizers
from keras import backend as K


def my_regularizer(inputs):
    return a*K.sum(means)**2



print('MSE from Keras: ',hist.history['val_loss'][-1])
print('Calculated MSE: ', np.mean((y_pred-x_test)**2))

The output is:

MSE from Keras:  0.1555381715297699
Calculated MSE:  0.12031101597786406

If I remove activity_regularizer=my_regularizer, then they will be closer, but still different:

MSE from Keras:  0.09773887693881989
Calculated MSE:  0.09773887699599623


Well, the answer is clear. You have a regularizer. The role of the regularizer is to add a term to the loss function, so it's the expected behavior to have a greater loss.

For the other little difference, it's just precision. Maybe using float 32 vs float 64, or doing calculations on the GPU x CPU with different algorythms. I would not worry about that difference.


I am a Python beginner so this may be more obvious than what I'm thinking. I'm using Matplotlib to graphically present my predicted data vs actual data via a neural network. I am able to calculate r-squared, and plot my data, but now I want to combine the value on the graph itself, which changes with every new run.

My NN uses at least 4 different inputs, and gives one output. This is my end code for that:

y_predicted = model.predict(X_test)

This is how i calculate R2:

# Using sklearn
from sklearn.metrics import r2_score
print r2_score(y_test, y_predicted)

and this is my graph:

fig, ax = plt.subplots()
ax.scatter(y_test, y_predicted)
ax.plot([y.min(), y.max()], [y.min(), y.max()], 'k--', lw=4)
#regression line
y_test, y_predicted = y_test.reshape(-1,1), y_predicted.reshape(-1,1)
ax.plot(y_test, LinearRegression().fit(y_test, y_predicted).predict(y_test))

It gives something like the graph attached, and the R2 varies everytime I change the epochs, or number of layers, or type of data etc. The red is my line of regression, which I will label later. Since R2 is a function I can't simply use the legend or text code.

I would also like to display MSE.

Can anyone help me out?



If I understand correctly, you want to show R2 in the graph. You can add it to the graph title:

ax.set_title('R2: ' + str(r2_score(y_test, y_predicted)))



I am very new to TensorFlow, I notice that here there is tf.losses.mean_squared_error which implements the mean squared error loss function.

Before using it, I played around with TF and I wrote

tf.reduce_mean(tf.reduce_sum(tf.square(tf.subtract(y, y_))))

However this gives different results. To me it looks like it is the same formula. What is going wrong?

Are the two formulations different? (and what about tf.nn.l2_loss?)

Also, I am trying to do a MLP and I am using a mse loss function as input to tf.train.GradientDescentOptimizer(0.5).minimize(mse). Can this function (mse = tf.losses.mean_squared_error(y, y_)) be used (in a regression problem) also as "accuracy" on the test set by using, feed_dict = {x:X_test, y: y_test})? Or what is the difference?


It is because you sum before taking the mean, so you get the squared error and not its mean. Change tf.reduce_mean(tf.reduce_sum(tf.square(tf.subtract(y, y_)))) to tf.reduce_mean((tf.square(tf.subtract(y, y_)))

import tensorflow as tf
import numpy as np
y = tf.get_variable("y", [1, 5])
x = tf.get_variable("x", [1, 5])
sess = tf.Session()
t = tf.reduce_mean(tf.reduce_sum(tf.square(tf.subtract(y, x))))
t2 = tf.losses.mean_squared_error(x, y)
t3 = tf.reduce_mean(tf.square(tf.subtract(y, x))), {"x": np.ones((1, 5)), "y": np.zeros((1, 5))})  # 5, {x: np.ones((1, 5)), y: np.zeros((1, 5))})  # 1, {x: np.ones((1, 5)), y: np.zeros((1, 5))})  # 1


Dears, does anyone has an ideia why 'dropout_rate' and 'learning_rate' returne only 0 and does not search for the range I gave when I am doing a RandomizedSearchCV on the hyperparameters?

Here is my code for a ANN using keras/tensoflow:

# Create the model

    def create_model(neurons = 1, init_mode = 'uniform', activation='relu', inputDim = 8792, dropout_rate=0.7, learn_rate=0.01, momentum=0, weight_constraint=0): #, learn_rate=0.01, momentum=0):
        model = Sequential()
    model.add(Dense(neurons, input_dim=inputDim, kernel_initializer=init_mode, activation=activation, kernel_constraint=maxnorm(weight_constraint), kernel_regularizer=regularizers.l2(0.001))) # one inner layer
    #model.add(Dense(neurons, input_dim=inputDim, activation=activation)) # second inner layer
    model.add(Dense(1, activation='sigmoid'))
    optimizer = RMSprop(lr=learn_rate)
# compile model
model.compile(loss='binary_crossentropy', optimizer=optimizer, metrics=['accuracy'])
return model

# model

model = KerasClassifier(build_fn=create_model, verbose=0)

# Define K-fold cross validation test harness

kfold = StratifiedKFold(n_splits=3, shuffle=True, random_state=seed)
for train, test in kfold.split(X_train, Y_train):
    print("TRAIN:", train, "VALIDATION:", test)

# Define Hyperparameters

# specify parameters and distributions to sample from
from scipy.stats import randint as sp_randint

param_dist = {'neurons': sp_randint(300, 360), #, 175, 180, 185, 190, 195, 200],
                  'learn_rate': sp_randint (0.001, 0.01), 
                  'batch_size': sp_randint(50, 60),
                  'epochs': sp_randint(20, 30),
                  'dropout_rate': sp_randint(0.2, 0.8),
                  'weight_constraint': sp_randint(3, 8)

# run randomized search
n_iter_search = 100

print("[INFO] Starting training digits")
print("[INFO] Tuning hyper-parameters for accuracy")
grid = RandomizedSearchCV(estimator=model, param_distributions=param_dist,
                                   n_iter=n_iter_search, n_jobs=10, cv=kfold)
start = time.time()
grid_result =, Y_train)
print("[INFO] GridSearch took {:.2f} seconds".format(time.time() - start))

My answer:

[INFO] GridSearch took 1164.39 seconds
[INFO] GridSearch best score 1.000000 using parameters: {'batch_size': 54, 'dropout_rate': 0, 'epochs': 20, 'learn_rate': 0, 'neurons': 331, 'weight_constraint': 7}
[INFO] Grid scores on development set:
0.614679 (0.034327) with: {'batch_size': 54, 'dropout_rate': 0, 'epochs': 29, 'learn_rate': 0, 'neurons': 354, 'weight_constraint': 6}
0.883792 (0.008650) with: {'batch_size': 53, 'dropout_rate': 0, 'epochs': 27, 'learn_rate': 0, 'neurons': 339, 'weight_constraint': 7}
0.256881 (0.012974) with: {'batch_size': 59, 'dropout_rate': 0, 'epochs': 27, 'learn_rate': 0, 'neurons': 308, 'weight_constraint': 4}

Thanks for helping.


0.2 and 0.8 are not integers, so when you use sp_randint(0.2, 0.8), these are converted to integers so its the same as sp_randint(0, 0). You have to use an equivalent function that generates floating point numbers, not integers.

For example, you can use a uniform distribution (uniform from scipy.stats) to generate real numbers.