## Generic way of handling fused-multiply-add floating-point inaccuracies

std::lerp

c++ fma instruction

c fma

gcc fma function

Yesterday I was tracking a bug in my project, which - after several hours - I've narrowed down to a piece of code which more or less was doing something like this:

#include <iostream> #include <cmath> #include <cassert> volatile float r = -0.979541123; volatile float alpha = 0.375402451; int main() { float sx = r * cosf(alpha); // -0.911326 float sy = r * sinf(alpha); // -0.359146 float ex = r * cosf(alpha); // -0.911326 float ey = r * sinf(alpha); // -0.359146 float mx = ex - sx; // should be 0 float my = ey - sy; // should be 0 float distance = sqrtf(mx * mx + my * my) * 57.2958f; // should be 0, gives 1.34925e-06 // std::cout << "sv: {" << sx << ", " << sy << "}" << std::endl; // std::cout << "ev: {" << ex << ", " << ey << "}" << std::endl; // std::cout << "mv: {" << mx << ", " << my << "}" << std::endl; std::cout << "distance: " << distance << std::endl; assert(distance == 0.f); // assert(sx == ex && sy == ey); // assert(mx == 0.f && my == 0.f); }

After compilation and execution:

$ g++ -Wall -Wextra -Wshadow -march=native -O2 vfma.cpp && ./a.out distance: 1.34925e-06 a.out: vfma.cpp:23: int main(): Assertion `distance == 0.f' failed. Aborted (core dumped)

From my point of view something is wrong, as I've asked for 2 subtractions of two bitwise-identical pairs (I expected to get two zeroes), then squaring them (two zeroes again) and adding them together (zero).

It turns out that the root cause of problem is the use of fused-multiply-add operation, which somewhere along the line makes the result inexact (from my point of view). Generally I have nothing against this optimization, as it promises to give results which are *more* exact, but in this case 1.34925e-06 is really far from the 0 that I was expecting.

The test case is very "fragile" - if you enable more prints or more asserts, it stops asserting, because compiler doesn't use fused-multiply-add anymore. For example if I uncomment all lines:

$ g++ -Wall -Wextra -Wshadow -march=native -O2 vfma.cpp && ./a.out sv: {-0.911326, -0.359146} ev: {-0.911326, -0.359146} mv: {0, 0} distance: 0

As I've considered this to be a bug in the compiler, I've reported that, but it got closed with the explanation that this is correct behaviour.

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79436

So I'm wondering - how should one code such calculations to avoid the problem? I was thinking about a generic solution, but something better than:

mx = ex != sx ? ex - sx : 0.f;

I would like to fix or improve my code - if there's anything to fix/improve - instead of setting `-ffp-contract=off`

for my whole project, as fused-multiply-add is used internally in the compiler libraries anyway (I see a lot of that in sinf() and cosf()), so it would be a "partial work-around", not a solution... I would also like to avoid solutions like "don't use floating-point" (;

In general no: this is exactly the price you pay for using `-ffp-contract=fast`

(coincidently, it is precisely this example that William Kahan notes in the problems with automatic contraction)

Theoretically, if you were using C (not C++), and your compiler supported C-1999 pragmas (i.e. not gcc), you could use

#pragma STDC FP_CONTRACT OFF // non-contracted code #pragma STDC FP_CONTRACT ON

**[PDF] A Mixed-Precision Fused Multiply and Add,** Abstract—The floating-point fused multiply and add, com- puting R=AB+C with a FMA provides the accumulation accuracy of the larger format at a cost that is� The authors of compare the design complexity of single or dual-pass (through the multiplier array) when realizing a fused multiply-add floating-point unit. In [18] a MAF floating-point unit with signed digit addition is presented: a signed digit addition along with a two step normalization method reduces the latency of the addition.

Interestingly, thanks to fma, the floats mx and my gives you the rounding error that was made when multiplying r and cos.

fma( r,cos, -r*cos) = theoretical(r*cos) - float(r*cos)

So the result you get somehow indicates how far was the computed (sx,sy) from the theoretical (sx,sy), due to multiplication of floats (but not accounting from rounding errors in computations of cos and sin).

So the question is how can your program rely on a difference (ex-sx,ey-sy) which is within the uncertainty interval related to floating point rounding?

**Floating-Point Fused Multiply-Add Architectures,** Request PDF | Floating-Point Fused Multiply-Add Architectures | Two new The three-path architecture uses parallel hardware paths similar to those in by extrapolation in the accumulator in order to bypass some accuracy and timing issues on the overall design of the addition unit, on how subtraction is handled when� Generic way of handling fused-multiply-add floating-point inaccuracies. 5. Embedded broadcasts with intrinsics and assembly. See more linked questions. Related. 620.

I can see this question has been around for a while, but in case other folks come across it looking for an answer I figured I'd mention a couple of points..

First, it's difficult to tell exactly without analyzing the resulting assembly code, but I suspect the reason that the FMA gives a result that is so far outside expectations is not just the FMA itself, but also that you are assuming all of the calculations are being done in the order you have specified them, but with optimizing C/C++ compilers this is often not the case. This is also likely why uncommenting the print statements changes the results.

If `mx`

and `my`

were being calculated as the comments suggest, then even if the final `mx*mx + my*my`

were done with an FMA, it would still result in the expected 0 result. The problem is that since none of the `sx`

/`sy`

/`ex`

/`ey`

/`mx`

/`my`

variables are used by anything else, there is a good possibility that the compiler never actually evaluates them as independent variables at all, and simply mushes all the math together into a big mass of multiplications, adds, and subtracts to calculate `distance`

in one single step, which can then be represented any number of different ways in machine code (in any order, potentially with multiple FMAs, etc) however it figures it will get the best performance for that one big calculation.

However, if something else (like a print statement) references `mx`

and `my`

, then it's much more likely the compiler will calculate them separately, before calculating `distance`

as a second step. In that case, the math does work out the way the comments suggest, and even an FMA in the final `distance`

calculation doesn't change the results (because the inputs are all exactly 0).

**The Answer**

But that doesn't actually answer the real question. In answer to that, the most robust (and generally recommended) way to avoid this sort of problem in general is: Never assume floating point operations will *ever* produce an exact number, even if that number is 0. This means that, in general, it's a bad idea to ever use `==`

to compare floating point numbers. Instead, you should choose a small number (often referred to as epsilon), which is larger than any possible/likely accumulated error, but still smaller than any significant result (for example, if you know that the distances you care about are only really significant to a couple of decimal places, then you could choose `EPSILON = 0.01`

, which will mean "any difference less than 0.01 we'll consider to be the same as zero"). Then, instead of saying:

`assert(distance == 0.f);`

you would say:

`assert(distance < EPSILON);`

(the exact value for your epsilon will likely depend on the application, and may even be different for different types of calculations, of course)

Likewise, instead of saying something like `if (a == b)`

for floating point numbers, you would instead say something like `if (abs(a - b) < EPSILON)`

, etc.

Another way to reduce (but not necessarily eliminate) this problem is to implement "fail-fast" logic in your application. For example, in the above code, instead of going all the way through and calculating `distance`

and then seeing if it's 0 at the end, you could "short-circuit" some of the math by testing `if (mx < EPSILON && my < EPSILON)`

before you even get to the point of calculating `distance`

and skipping the rest if they're both zero (since you know the result will be zero in that case). The quicker you catch the situation the less opportunity there is for errors to accumulate (and sometimes you can also avoid doing some more costly calculations in cases you don't need to).

**What Every Computer Scientist Should Know About Floating-Point ,** It gives an algorithm for addition, subtraction, multiplication, division and square One method of computing the difference between two floating-point numbers is to Sometimes a formula that gives inaccurate results can be rewritten to have language has a different method of handling signals (if it has a method at all),� Floating-point fused multiply-add: reduced latency for floating-point addition Abstract: In this paper we propose an architecture for the computation of the double-precision floating-point multiply-add fused (MAF) operation A+(B/spl times/C) that permits to compute the floating-point addition with lower latency than floating-point

**[PDF] A Correctly Rounded Mixed-Radix Fused-Multiply-Add,** Abstract—The IEEE 754-2008 Standard governs Floating-Point. Arithmetic in all types of Computer Systems. The Standard provides for two radices, 2 and 10. This dissertation presents the results of the research, design, and implementations of several new architectures for floating-point fused multiplier-adders used in the x87 units of microprocessors. These new architectures have been designed to provide solutions to the implementation problems found in modern-day fused multiply-add units.

**[PDF] Mixed-precision Fused Multiply and Add,** Abstract—The standard floating-point fused multiply and add FMA provides the accumulation accuracy of the larger format, at a cost that is close to that cost- effective way to extend existing binary64 FPUs to provide binary128 Operators managing several precisions have been proposed, for instance� Floating-point unit is an integral part of any modern microprocessor.The fused multiply add (FMA)operation is very important in many scientific and engineering applications. It isa key feature of the floating-point unit (FPU), which greatly increases the floating-point performance and accuracy.With the recent advancement in FPGA

**math: add guaranteed FMA � Issue #25819 � golang/go � GitHub,** Please consider adding fused multiply–add (FMA) to the standard library. FMA computes a*b + c but only with one floating-point rounding, instead of two. This proposal is about adding a way to explicitly request FMA or its similar cmd/ compile: introduce generic ssa intrinsic for fused-multiply-add … CiteSeerX - Document Details (Isaac Councill, Lee Giles, Pradeep Teregowda): For my wife Eres mi vida, mi alma, y mi corazón. Acknowledgements This work on the design and implementation of the new floating-point fused multiply-add architectures would not be possible without the knowledge, expertise, and support of the following people: First and foremost Leslie K. Quinnell, my wife – for

##### Comments

- Floating point numbers/arithmetic is potentially inexact. This is a well-known "feature" of floating-point arithmetic.
- @barny - I know, but for things like subtracting two identical numbers or multiplying anything by zero floating-point arithmetic was perfectly accurate. "Was" - because with fused-multiply-add it's no longer the case... Also I think that the scale of error here is quite big. If I would get sth like 1e-64 then I would not ask this question...
- GCC (and I think ICC as well) contracts but Clang does not by default. I asked a question about this because I was surprised that GCC does this. Apparently, many people did not expect this either. It turns out it does not violate IEEE so GCC is still conforming to do this.
- I think there are two possible solutions to consider. 1.) Only use FMA explicitly. This means you compile with
`-ffp-contract=off -mfma`

and then use`fma`

functions or intrinsics to get FMA only when you want it. 2.) Design your code so it deals with floating point errors with and without FMA operations so that it's not sensitive to FMA operations. - You could add
`float mx_fma = fmaf(r, cosf(alpha), -r*cosf(alpha))`

to your tests in your question. It should produce the same results. You could then compile with`-ffp-contract=off`

and see what you get. You probably won't learn anything that you don't expect from this but nevertheless I think it's interesting to try. - I can disable contraction for whole file with
`-ffp-contract=off`

or implement any form of work-around, but the problem is that searching for bugs like this is pretty long. That's why I'm wondering whether there's a way of avoiding the problem in the first place. - Well, the code does not care that much about precision of calculations down to the single bit, but in this particular case it really cares whether it gets zero or "something else". That's because this distance is then used in calculation of time and instead of getting a "0 length move in 0 time" I get "very small move in some absurd time".
- So you can't convert a float to an int, ever? Or do any boolean test, ever? It seems like fp math is broken and shouldn't be used for any serious purpose, ever.