Numercially stable softmax

numpy softmax
numerically stable sigmoid
softmax derivative
linear softmax
softmax graph
softmax backpropagation
softmax backpropagation python
log softmax gradient

Is there a numerically stable way to compute softmax function below? I am getting values that becomes Nans in Neural network code.


The softmax exp(x)/sum(exp(x)) is actually numerically well-behaved. It has only positive terms, so we needn't worry about loss of significance, and the denominator is at least as large as the numerator, so the result is guaranteed to fall between 0 and 1.

The only accident that might happen is over- or under-flow in the exponentials. Overflow of a single or underflow of all elements of x will render the output more or less useless.

But it is easy to guard against that by using the identity softmax(x) = softmax(x + c) which holds for any scalar c: Subtracting max(x) from x leaves a vector that has only non-positive entries, ruling out overflow and at least one element that is zero ruling out a vanishing denominator (underflow in some but not all entries is harmless).

Footnote: theoretically, catastrophic accidents in the sum are possible, but you'd need a ridiculous number of terms. For example, even using 16 bit floats which can only resolve 3 decimals---compared to 15 decimals of a "normal" 64 bit float---we'd need between 2^1431 (~6 x 10^431) and 2^1432 to get a sum that is off by a factor of two.

How to implement the Softmax function in Python, They're both correct, but yours is preferred from the point of view of numerical stability. You start with e ^ (x - max(x)) / sum(e^(x - max(x)). By using the fact that� · 5 hours ago. Numercially stable softmax. Is there a numerically stable way to compute softmax function below? I am getting values that becomes Nans in Neural network code.

Softmax function is prone to two issues: overflow and underflow

Overflow: It occurs when very large numbers are approximated as infinity

Underflow: It occurs when very small numbers (near zero in the number line) are approximated (i.e. rounded to) as zero

To combat these issues when doing softmax computation, a common trick is to shift the input vector by subtracting the maximum element in it from all elements. For the input vector x, define z such that:

z = x-max(x)

And then take the softmax of the new (stable) vector z


def stable_softmax(x):
    z = x - max(x)
    numerator = np.exp(z)
    denominator = np.sum(numerator)
    softmax = numerator/denominator

    return softmax

# input vector
In [267]: vec = np.array([1, 2, 3, 4, 5])
In [268]: stable_softmax(vec)
Out[268]: array([ 0.01165623,  0.03168492,  0.08612854,  0.23412166,  0.63640865])

# input vector with really large number, prone to overflow issue
In [269]: vec = np.array([12345, 67890, 99999999])
In [270]: stable_softmax(vec)
Out[270]: array([ 0.,  0.,  1.])

In the above case, we safely avoided the overflow problem by using stable_softmax()

For more details, see chapter Numerical Computation in deep learning book.

Softmax and Cross Entropy Loss, To make our softmax function numerically stable, we simply normalize the values in the vector, by multiplying the numerator and denominator with a constant C. It’s a cost function that is used as loss for machine learning models, telling us how bad it’s performing, the lower the better. Negative: obviously means multiplying by -1. What? The loss of

Thank Paul Panzer's explanation, but I am wondering why we need to subtract max(x). Therefore, I found more detailed information and hope it will be helpful to the people who has the same question as me. See the section, "What’s up with that max subtraction?", in the following link's article.

Exp-normalize trick — Graduate Descent, def sigmoid(x): "Numerically stable sigmoid function. called "softmax," which is unfortunate because log-sum-exp is also called "softmax. In mathematics, the softmax function, also known as softargmax: 184 or normalized exponential function,: 198 is a function that takes as input a vector z of K real numbers, and normalizes it into a probability distribution consisting of K probabilities proportional to the exponentials of the input numbers.

Extending @kmario23's answer to support 1 or 2 dimensional numpy arrays or lists (common if you're passing a batch of results through the softmax function):

import numpy as np

def stable_softmax(x):
    z = x - np.max(x, axis=-1, keepdims=True)
    numerator = np.exp(z)
    denominator = np.sum(numerator, axis=-1, keepdims=True)
    softmax = numerator / denominator
    return softmax

test1 = np.array([12345, 67890, 99999999])  # 1D
test2 = np.array([[12345, 67890, 99999999], [123, 678, 88888888]])  # 2D
test3 = [12345, 67890, 999999999]
test4 = [[12345, 67890, 999999999]]


 [0. 0. 1.]

[[0. 0. 1.]
 [0. 0. 1.]]

 [0. 0. 1.]

[[0. 0. 1.]]

The numerical stability of Softmax and Cross Entropy � GitHub, The numerical stability of Softmax and Cross Entropy. import numpy as np. from scipy.special import expit. import math. EPS = 1e-9. Intuitively, the softmax function is a "soft" version of the maximum function. Instead of just selecting one maximal element, softmax breaks the vector up into parts of a whole (1.0) with the maximal input element getting a proportionally larger chunk, but the other elements getting some of it as well .

There is nothing wrong with calculating the softmax function as it is in your case. The problem seems to come from exploding gradient or this sort of issues with your training methods. Focus on those matters with either "clipping values" or "choosing the right initial distribution of weights".

Numerical instability in deep learning with softmax – All about cool , Softmax is defined as f(X) = exp(xi)/sum(exp(xi)) and it returns probability for As seen from the two examples with stable syntax, it saved more� Hmm, okay. In the original code the input isn't constant and I still have a numerical stability issue. I guess this is because the softmax is an .scan and the categorical_crossentropy is outside the scan, like this:

Part 2: Softmax Regression, Numerical Stability of Softmax function. The Softmax function takes an N- dimensional vector of real values and returns a new N-dimensional vector that sums up� My softmax function. After years of copying one-off softmax code between scripts, I decided to make things a little dry-er: I sat down and wrote a darn softmax function. The goal was to support \(X\) of any dimensionality, and to allow the user to softmax over an arbitrary axis. Here's the function:

Logsoftmax stability, Logsoftmax stability. log softmax stable softmax keras log softmax numerical stability exponential numerically stable normalization softmax negative values def log_sum_exp(x) x_max = return torch.log(torch.sum(torch.exp(x-x_max), 1, keepdim=True)) + x_max In this function ,if we remove x_max,the output of this function is just the same,so why should we use the x_max ?

Underflow while evaluating softmax, despite using exp-normalize , I understand that softmax is given by: eZi∑nj=0ezj. For numerical stability, the exp-normalize trick is used: eZi−max(Z)∑nj=0ezj−max(Z). The softmax function is used in the activation function of the neural network. a = 6digit 10digit 14digit 18digit 22digit 26digit 30digit 34digit 38digit 42digit 46digit 50digit

  • The answers here show the better way to calculate the softmax:
  • @ajcr The accepted answer at this link is actually poor advice. Abhishek, the thing the OP does even though they first didn't seem to understand why is the right thing to do. There are no numerically difficult steps in the softmax except overflow. So shifting all inputs to the left while mathematically being equivalent, removes the possibility of overflow, so is numerically an improvement.
  • Yes, although the author of that accepted answer acknowledges in the comments that subtracting the maximum does not introduce an "necessary term" but actually improves numerical stability (perhaps that answer should be edited...). In any case, the question of numerical stability is addressed in several of the other answers there. @AbhishekBhatia: do you think the link answers your question satisfactorily, or would a new answer here be beneficial?
  • This is still underflowing for me
  • I've been using this for quite some time without an issue. Are you sure you didn't have NaNs or Infs going in as input?
  • I see now - np.seterr(all='raise') will complain about underflows for large values even though the function works correctly. This is indeed the best solution.
  • "There is nothing wrong with calculating the softmax function as it is in your case." Try computing softmax(800) with it.
  • Doing anything in that scale would cause "inf" any thing in python is unstable if you are trying to work in that scale.