How to find (optimal) integer ratio of different precision?

kahan summation algorithm
how many decimal places to use in statistics
kahan summation algorithm java
how many digits after the decimal point should be reported
how many significant figures for standard deviation
rounding off numbers in statistics
more accurate summation python
neumaier summation

If I have a variable m of type uint32 and r of type uint16 as well as a constant float64 with i.e. value f=0.5820766091346741. How do I find m,r which satisfy f=r/m?

Similar as Fraction.limit_denominator from python.

This github repo contains various best-rational approximation algorithms but only limits the denominator.

The straightforward answer would be:

     ROUND(f * 10^8)
f = ----------------

Then, you can implement a small loop that attempts to divide both numerator and denominator by prime numbers (starting from 2 and up). Something like (code not checked of course):

var m = f * 10^8 ;
var r = 10^8     ;
var Prime_Numbers = [2,3,5,7,11,13,17,19,....] ;

for (var I = 0 ; I < Prime_Numbers.length ; I++) {

    if ((Prime_Numbers[I] > m) ||
        (Prime_Numbers[I] > r)    ) {


    if (((m % Prime_Numbers[I]) == 0) &&
         (r % Prime_Numbers[I]) == 0)    ) {
          m = m / Prime_Numbers[I] ;
          r = r / Prime_Numbers[I] ;

console.log("Best m is: " + m) ;
console.log("Best r is: " + r) ;

Now, the question would be how many primary numbers I should include in the list?

Hard to say, but intuitively not too many... I would say it would depend on how rigorous you are about OPTIMAL.

Hope this gives you some direction.



Thinking a little bit further, to always get the ABSOLUTE OPTIMAL values, you need to include all primary number up to half the max value you wish as precision. For instance, if tour precision needs to be 8 digits (99999999), you need to include all primary numbers up to (99999999/2).


Added an exit condition in the loop.

[PDF] On Linear Programming, Integer Programming and Cutting Planes, from the current LP description until we find an optimal solution. Single-​precision floating-point, on the other hand, is unable to match this resolution with A surprise was instance (3,20,5,20,35); in this instance the ratio between the smaller. While you don't work out the mean in this method, it's standard to include the mean when reporting a precision result. The mean is simply the sum of all the values, divided by the number of values. In this example, you have four measurements: 2, 3, 4 and 5. The mean of these values is (2+3+4+5) ÷ 4 = 3.5.

There is a paper by David T. Ashley et al. which proposes an algorithm to find a rational approximation by two integers with different precision.

I implemented a basic version which does not contain the whole complexity of the referred paper 1.

The basic idea is to convert the float number into a continuous fraction and then looking for the highest order convergent which satisfies the constraints. See wiki for an introduction on convergents.

However the referred paper describes a more sophisticated approach on applying constraints on the integer rations (see section 5) which uses an analogy to lattice structures 1.

Kahan summation algorithm, In numerical analysis, the Kahan summation algorithm, also known as compensated This is done by keeping a separate running compensation (a variable to Bresenham's line algorithm, keeping track of the accumulated error in integer However, with compensated summation, we get the correct rounded result of  A classifier with a precision of 1.0 and a recall of 0.0 has a simple average of 0.5 but an F1 score of 0. The F1 score gives equal weight to both measures and is a specific example of the general Fβ metric where β can be adjusted to give more weight to either recall or precision.

How do I find m,r which satisfy f=r/m?

= implies exact.

To do this exactly, if possible, see below. This approach does not attempt a best fit if an exact solution is not possible as that would not satisfy f=r/m.

All finite floating point values are exact. "0.5820766091346741" saved in f may gives f a nearby value, yet the value in f is exact.

Given the base of the floating point number (very commonly 2) they all can be represented exactly with: "integer/(baseexponent)".

With binary64, the largest exponent needed is about (1023 + 53).

As OP wants the result to fit in as 32-bit r and 16-bit m, it is readily understandable that most float64 (64-bits) will not have an exactly solution - just not enough combinations to save the result.

Algorithm below in commented C assuming base 2.

// return true on success
bool fraction(double d, uint32_t *r, uint16_t *m) {
  if (d < 0.0 || isnan(d) || d > UINT32_MAX) {
    return false;

  // Scale d to extract, hopefully a 32+15 bit integer
  uint16_t power_of_2 = 32768; // largest power-of-2 in m
  d *= power_of_2;
  uint64_t ipart = (uint64_t) d;
  // Even after scaling, `d` has a fractional part.
  if (d != ipart) {
    return false;  // value has unrepresentable precision.

  // while big and even, reduce the fraction
  while (ipart > UINT32_MAX && (ipart % 2 == 0)) {
    power_of_2 /= 2;
    ipart /= 2;

  // If reduction was insufficient ...
  if (ipart > UINT32_MAX) {
    return false; // value has unrepresentable precision.

  *r = (uint32_t) ipart;
  *m = power_of_2;
  return true;  // Success!

Fixed Point (Integers), Fixed point representation is used to store integers, the positive and negative Offset binary is not a standardized format, and you will find other offsets used,  First, divide the two numbers to get the ratio in floating-point. Then run the continued fraction algorithm until it terminates. If it doesn't terminate, then it's irrational and there is no solution. If it terminates, evaluate the resulting continued fraction back into a single fraction and that will be the answer.

I don't give you an algorithm, because, IMO, continued fractions is the right path.

But I wanted to illustrate how well this representation of floating point does fit 64bits IEEE754. So i've played a bit with the concept in Smalltalk (Squeak 64 bits).

There are only 48 bits for r/m representation, with many combinations representing the same value (1/1=2/2=... 1/2=2/4=3/6=...) while there are already 2^53 different 64bits float in the interval [0.5,1.0). So we can say that most of the time, we are not going to match f exactly. The problem is then to find a pair (r/m) that rounds nearest to f.

I can't reasonably play with 48bits, but I can with half precision, and gather all the uint8/uint16 combinations:

v := Array new: 1<<24.
0 to: 1<<8-1 do: [:r |
    0 to: 1<<16-1 do: [:m |
        v at: (m<<8+r+1) put: ([r asFloat/m asFloat]
            on: ZeroDivide do: [:exc | exc return: Float infinity])]].
s := v asSet sorted.
s size-2.

Except 0 and inf, that's about 10,173,377 different combinations out of 16,777,216.

I'm interested in the gap between two consecutive representable floats:

x := s copyFrom: 2 to: s size - 1.
y := (2 to: s size-1) collect: [:i |  (s at: i) - (s at: i-1) / (s at: i) ulp].

the minimum is

u := y detectMin: #yourself.

about 2.71618435e8 ulp.

Let's see how the numerator and denominator are formed:

p := y indexOf: u.
{((v  indexOf: (x at: p)) - 1) hex.
 ((v  indexOf: (x at: p-1)) - 1) hex}.

result in #('16rFDFFFE' '16rFEFFFF') the first 4 digits encode den (m), the last two num (r).

So the minimum gap is obtained for

s1 := (1<<8-1) / (1<<8-1<<8-1).
s2 := (1<<8-2) / (1<<8-2<<8-1).
s2 asFloat - s1 asFloat / s2 asFloat ulp = u.

It is around the value 1/256 (or somewhere near).

We can conjecture that the minimum gap for 48 bits repersentation is

s1 := (1<<16-1) / (1<<16-1<<16-1).
s2 := (1<<16-2) / (1<<16-2<<16-1).
s2 asFloat - s1 asFloat / s2 asFloat ulp.

That is around 16 ulp, not that bad, and the maximum density is around 1/65536 (or somewhere near).

What will be the density near 0.5 as in your example? For 24 bits representation:

h := x indexOf: 0.5.

is 10133738. Let's inspect the precision in the neighbourhood:

k := (h to: h +512) detectMin: [:i | (y at: i)].
u2 := y at: k.

That's 3.4903102168e10 ulp (about 128 times less density). It is obtained for:

s1 := (1<<8-1) / (1<<8-1<<1-1).
s2 := (1<<8-2) / (1<<8-2<<1-1).
s2 asFloat- s1 asFloat / s2 asFloat ulp = u2.

So, with 48bits, we can expect a density of about

s1 := (1<<16-1) / (1<<16-1<<1-1).
s2 := (1<<16-2) / (1<<16-2<<1-1).
s2 asFloat- s1 asFloat / s2 asFloat ulp.

that is 524320 ulp, or a precision of approximately 5.821121362714621e-11.

Edit: What about the worst precision?

In the zone of best density:

q := (p-512 to:p+512) detectMax: [:i | y at: i].
{((v  indexOf: (x at: q)) - 1) hex.
 ((v  indexOf: (x at: q-1)) - 1) hex.}.

That is #('16rFEFFFF' '16r10001'), or in other word, just before the best precision, we have locally the worst: w := y at: q. which is 6.8990021713e10 ulp for those numbers:

s2 := (1<<8-1) / (1<<8-1<<8-1).
s1 := (1) / (1<<8).
s2 asFloat - s1 asFloat / s2 asFloat ulp = w.

Translated to 48 bits, that is about 1.048592e6 ulp:

s2 := (1<<16-1) / (1<<16-1<<16-1).
s1 := (1) / (1<<16).
s2 asFloat - s1 asFloat / s2 asFloat ulp.

And near 0.5, the worst is about 8.847936399549e12 ulp for 24 bits:

j := (h-512 to: h +512) detectMax: [:i | (y at: i)].
w2 := y at: j.
s2 := (1<<8-1) / (1<<8-1<<1-1).
s1 := (1) / (1<<1).
s2 asFloat- s1 asFloat / s2 asFloat ulp = w2.

or translated to 48 bits, 3.4360524818e10 ulp:

s2 := (1<<16-1) / (1<<16-1<<1-1).
s1 := (1) / (1<<1).
s2 asFloat- s1 asFloat / s2 asFloat ulp.

That's about 3.814784579114772e-6 of absolute precision, not that good.

Before adopting such a representation, it would be good to know what is the domain of f, and know about average precision, and worst case precision achievable in this domain.

Too many digits: the presentation of numerical data, A number's precision relates to its decimal places or significant to the different units they correspond to six and three significant digits, This is particularly useful in columns of risk ratios or p values—see the examples in the table. However, optimal precision, like beauty, is in the eye of the beholder,  Since the default setting of the Tolerance option is 1%, Solver will stop when it has found a solution satisfying the integer constraints whose objective is within 1% of the true integer optimal solution. Therefore, you may know of or be able to discover an integer solution that is better than the one found by Solver.

[PDF] Fixed-Point Arithmetic, Ratio between the largest number and the smallest. (positive) number Dynamic Range and Precision of Integer and Fractional 16-Bit Numbers (Kuo & Gan) point numbers in different Q formats of D, and multiply that reciprocal by N to find the final quotient Q. outside optimal range - allowed because ratio D/K. Input : d = 2.5 Output : 5/2 Explanation: 5/2 gives 2.5 which is the reduced form of any fraction that gives 2.5 Input : d = 1.5 Output : 3/2. as_integer_ratio() function Python: Returns a pair of integers whose ratio is exactly equal to the original float and with a positive denominator.

[PDF] Stochastic Optimization of Floating-Point Programs with Tunable , Carlo; MCMC; Stochastic Search; SMT; Floating-Point; Precision. 1. Introduction tions can produce wildly different results, compilers are forced to preserve floating point is theoretically guaranteed to find the maximum value in the range of acceptance can be reduced to the much simpler Metropolis ratio by dropping   The Adjusted Present Value Approach: The optimal debt ratio is the one that maximizes the overall value of the firm.  The Sector Approach: The optimal debt ratio is the one that brings the firm closes to its peer group in terms of financing mix.

Numerical Evaluation, When two numbers with different precision are used together in an arithmetic (​but not exact) approximation φ100/√5 where φ is the golden ratio. Binet's formula), we get an expression that is exactly zero, but N does not know this: Note that evalf makes some assumptions that are not always optimal. Recall all arithmetics on fixed point numbers are the same as integer, we can simply reuse the integer type int in C to perform fixed point arithmetic. The position of binary point only matters in cases when we print it on screen or perform arithmetic with different "type" (such as when adding int to fixed<32,6> ).

  • What would be the level of precision that would be enough in your case (i.e. number of digits after the .)?
  • @FDavidov 8-10 should probably be enough
  • OK. I'll add one idea as an answer in a couple of minutes. Stay tuned...
  • Related:…
  • 8 to 10 digits is not achievable for this r/m representation, see counter example near 0.5 in my answer.
  • this is an interesting approach and looks simple to implement, however I will look for some more rigor solution
  • No problem. Indeed, it is very simple and WILL give you the optimal values for m and r (after the two edits I added). Still, if you happen to find a good solution, it would be nice to read about it. So please post a comment here so I get an alert. Good Luck!!!!
  • (You do know that you can delete your comment.)
  • Yes, but I don't feel any shame whenever I need to apologize. Moreover, the comment contains a part that I wouldn't wish to delete, so...
  • Please include the gist of the algorithm tested and found sufficient here.