## C floating point precision

double precision in c

float in c

double precision floating point example

float in c programming example

how to round a float in c

precision in float

print double in c

**This question already has answers here**:

You can only represent numbers exactly in IEEE754 (at least for the single and double precision binary formats) if they can be constructed from adding together inverted powers of two (i.e., `2-n`

like `1`

, `1/2`

, `1/4`

, `1/65536`

and so on) subject to the number of bits available for precision.

There is no combination of inverted powers of two that will get you exactly to 101.1, within the scaling provided by floats (23 bits of precision) *or* doubles (52 bits of precision).

If you want a quick tutorial on how this inverted-power-of-two stuff works, see this answer.

Applying the knowledge from that answer to your `101.1`

number (as a single precision float):

s eeeeeeee mmmmmmmmmmmmmmmmmmmmmmm 1/n 0 10000101 10010100011001100110011 | | | || || || |+- 8388608 | | | || || || +-- 4194304 | | | || || |+----- 524288 | | | || || +------ 262144 | | | || |+--------- 32768 | | | || +---------- 16384 | | | |+------------- 2048 | | | +-------------- 1024 | | +------------------ 64 | +-------------------- 16 +----------------------- 2

The mantissa part of that actually continues forever for `101.1`

:

mmmmmmmmm mmmm mmmm mmmm mm 100101000 1100 1100 1100 11|00 1100 (and so on).

hence it's not a matter of precision, no amount of finite bits will represent that number exactly in IEEE754 format.

Using the bits to calculate the *actual* number (closest approximation), the sign is positive. The exponent is 128+4+1 = 133 - 127 bias = 6, so the multiplier is 26 or 64.

The mantissa consists of 1 (the implicit base) plus (for all those bits with each being worth 1/(2n) as n starts at 1 and increases to the right), `{1/2, 1/16, 1/64, 1/1024, 1/2048, 1/16384, 1/32768, 1/262144, 1/524288, 1/4194304, 1/8388608}`

.

When you add all these up, you get `1.57968747615814208984375`

.

When you multiply that by the multiplier previously calculated, `64`

, you get `101.09999847412109375`

.

All numbers were calculated with `bc`

using a scale of 100 decimal digits, resulting in a lot of trailing zeros, so the numbers *should* be very accurate. Doubly so, since I checked the result with:

#include <stdio.h> int main (void) { float f = 101.1f; printf ("%.50f\n", f); return 0; }

which *also* gave me `101.09999847412109375000...`

.

**Setting decimal precision in C,** The precision of the float is 24 bits. There are 23 bits denoting the fraction after the binary point, plus there's also an "implicit leading bit", according to the online The precision of the float is 24 bits. There are 23 bits denoting the fraction after the binary point, plus there's also an "implicit leading bit", according to the online source. This gives 24 significant bits in total. Hence in decimal digits this is approximately: 24 * log (2) / log (10) = 7.22. share.

You need to read more about how floating-point numbers work, especially the part on representable numbers.

You're not giving much of an explanation as to why you think that "32 bits should be enough for 101.1", so it's kind of hard to refute.

Binary floating-point numbers don't work well for all decimal numbers, since they basically store the number in, wait for it, base 2. As in binary.

This is a well-known fact, and it's the reason why e.g. money should never be handled in floating-point.

**What is float precision in the C programming language?,** Float is a datatype which is used to represent the floating point numbers. It is a 32-bit IEEE 754 single precision floating point number ( 1-bit for The three floating point types differ in how much space they use (32, 64, or 80 bits on x86 CPUs; possibly different amounts on other machines), and thus how much precision they provide. Most math library routines expect and return doubles (e.g., sinis declared as double sin(double), but there are usually floatversions as well (float sinf(float)).

Your number `101.1`

in base `10`

is `1100101.0(0011)`

in base `2`

. The `0011`

part is repeating. Thus, no matter how many digits you'll have, the number cannot be represented exactly in the computer.

Looking at the IEE754 standard for floating points, you can find out why the `double`

version seemed to show it entirely.

PS: Derivation of `101.1`

in base `10`

is `1100101.0(0011)`

in base `2`

:

101 = 64 + 32 + 4 + 1 101 -> 1100101 .1 * 2 = .2 -> 0 .2 * 2 = .4 -> 0 .4 * 2 = .8 -> 0 .8 * 2 = 1.6 -> 1 .6 * 2 = 1.2 -> 1 .2 * 2 = .4 -> 0 .4 * 2 = .8 -> 0 .8 * 2 = 1.6 -> 1 .6 * 2 = 1.2 -> 1 .2 * 2 = .4 -> 0 .4 * 2 = .8 -> 0 .8 * 2 = 1.6 -> 1 .6 * 2 = 1.2 -> 1 .2 * 2 = .4 -> 0 .4 * 2 = .8 -> 0 .8 * 2 = 1.6 -> 1 .6 * 2 = 1.2 -> 1 .2 * 2 = .4 -> 0 .4 * 2 = .8 -> 0 .8 * 2 = 1.6 -> 1 .6 * 2 = 1.2 -> 1 .2 * 2....

PPS: It's the same if you'd wanted to store exactly the result of `1/3`

in base `10`

.

**Float and Double in C,** You can only represent numbers exactly in IEEE754 (at least for the single and double precision binary formats) if they can be constructed from adding together Floating point numbers in C use IEEE 754 encoding. This type of encoding uses a sign, a significand, and an exponent. Because of this encoding, many numbers will have small changes to allow them to be stored. Also, the number of significant digits can change slightly since it is a binary representation, not a decimal one.

**C floating point precision,** Real numbers are represented in C by the floating point types float, double, and long So (in a very low-precision format), 1 would be 1.000*20, 2 would be C++ assumes that a number followed by a decimal point is a floating-point constant. Thus it assumes that 2.5 is a floating point. This decimal-point rule is true even if the value to the right of the decimal point is zero. Thus 3.0 is also a floating point.

If you had more digits to the print of the `double`

you'll see that even `double`

cannot be represented exactly:

printf ("b: %.16f\n", b); b: 101.0999999999999943

The thing is `float`

and `double`

are using binary format and not all floating pointer numbers can be represented exactly with binary format.

**C/FloatingPoint,** Single-precision floating-point format is a computer number format, usually occupying 32 bits in Single precision is termed REAL in Fortran, SINGLE-FLOAT in Common Lisp, float in C, C++, C#, Java, Float in Haskell, and Single in Object Setting decimal precision in C How to print floating point numbers with a specified precision? Rounding is not required. For example, 5.48958123 should be printed as 5.4895 if given precision is 4.

**Single-precision floating-point format,** C and C++ offer a wide variety of arithmetic types. Double precision is not required by the standards (except by the optional There are many situations in which precision, rounding, and accuracy in floating-point calculations can work to generate results that are surprising to the programmer. They should follow the four general rules: In a calculation involving both single and double precision, the result will not usually be any more accurate than single precision.

**Double-precision floating-point format,** Floating point numbers have limited precision. A float has 23 bits of mantissa, so the precision we have at 3.5 is: \begin{array}{c|c|c|c|c}. The number of digits of precision a floating point variable has depends on both the size (floats have less precision than doubles) and the particular value being stored (some values have more precision than others). Float values have between 6 and 9 digits of precision, with most float values having at least 7 significant digits.

**Demystifying Floating Point Precision « The blog at the bottom of the ,** C provides various floating-point types to represent non-integer number with a float : for numbers with single precision. double : for numbers with double Single-precision floating-point format is a computer number format, usually occupying 32 bits in computer memory; it represents a wide dynamic range of numeric values by using a floating radix point. A floating-point variable can represent a wider range of numbers than a fixed-point variable of the same bit width at the cost of precision. A signed 32-bit integer variable has a maximum value of 231 − 1 = 2,147,483,647, whereas an IEEE 754 32-bit base-2 floating-point variable has a maximum

##### Comments

- "... only represent numbers exactly in IEEE754 if they can be constructed from adding ... inverted powers of two ... " seems incomplete in that IEEE754 also defines floating point numbers with inverted powers of ten. Certainly IEEE754 binary formats are more common though.
- @chux, that's a very valid point, adjusted the answer to make that clear.
- can you illustrate by example how 101.1 is stored in computer?
- 101.1 can certainly be represented in 32 bits. Just not with any of the usual floating point formats supported by hardware.
- @Jeremy It depends on the system. I'd recommend the Wikipedia article "floating point" for starters, although it doesn't give you enough information to actually start using them. The article [What Every Computer Scientist Should Know About Floating-Point Arithmetic] (docs.oracle.com/cd/E19957-01/806-3568/ncg_goldberg.html) is about the best introduction I know.
- For example, it can be represented in fixed point
`999V9`

BCD format i 16 bits as`0001 0001 0001 0001`

. - More a question of vocabulary, but I'd says "most real numbers cannot be accurately represented in (machine) floating point", or "most decimal floating point numbers cannot be accurately represented in (machine) floating point". (The latter is obviously only true if machine floating point isn't decimal. But while I've used machines with decimal floating point in the past, I think today only bases 2, 8 and 16 are still around.)
- @James, edited the answer, thanks ;)