## Fast modulo 10 in c

fast modulo c
fast modulo calculation
fast modulo prime
modulus operator
is modulo expensive
is modulo slow
how is modulo implemented
simd modulus

I am looking for a fast modulo 10 algorithm because I need to speed up my program which does many modulo operations in cycles.

I have checked out this page which compares some alternatives. As far as I understand it correctly, T3 was the fastest of all. My question is, how would `x % y` look like using T3 technique?

I copied T3 technique here for simplicity in case the link gets down.

```for (int x = 0; x < max; x++)
{
if (y > (threshold - 1))
{
y = 0; //reset
total += x;
}
y += 1;
}
```

Regarding to comments, if this is not really faster then regular mod, I am looking for at least 2 times faster modulo than using `%`. I have seen many examples with use power of two, but since 10 is not, how can I get it to work?

Edit:

For my program, let's say I have 2 for cycles where `n=1 000 000` and `m=1000`.

Looks like this:

```for (i = 1; i <= n; i++) {
D[(i%10)*m] = i;
for (j = 1; j <= m; j++) {
...
}
}
```

Here's the fastest modulo-10 function you can write:

```unsigned mod10(unsigned x)
{
return x % 10;
}
```

And here's what it looks like once compiled:

```movsxd rax, edi
imul rcx, rax, 1717986919
mov rdx, rcx
shr rdx, 63
sar rcx, 34
lea ecx, [rcx + 4*rcx]
sub eax, ecx
ret
```

Note the lack of division/modulus instructions, the mysterious constants, the use of an instruction which was originally intended for complex array indexing, etc. Needless to say, the compiler knows a lot of tricks to make your program as fast as possible. You'll rarely beat it on tasks like this.

Modulo 10^9+7 (1000000007), It is the first 10-digit prime number and fits in int data type as well. Method 1: First, multiply all the number and then take modulo: (a*b*c)%m these applications is that there exists a very fast algorithm (the extended Euclidean algorithm) that  Fast inverse square root; Few Tips for Fast & Productive Work on a Linux Terminal; How does Floyd's slow and fast pointers approach work? Fast method to calculate inverse square root of a floating point number in IEEE 754 format; LCM of N numbers modulo M; Compute n! under modulo p; Modulo 10^9+7 (1000000007) Fibonacci modulo p; Sum of two

The code isn’t a direct substitute for modulo, it substitutes modulo in that situation. You can write your own `mod` by analogy (for `a`, `b` > 0):

```int mod(int a, int b) {
while (a >= b) a -= b;
return a;
}
```

… but whether that’s faster than `%` is highly questionable.

Fastest way to get integer mod 10 and integer divide 10?, Heres a binary to BCD algorithm I used several years ago based on one found here. I was using an external BCD to 7 seg display driver so the  Fast Division/Modulo Operation. C / C++ Forums on Bytes.

You likely can't beat the compiler.

Debug build

```//     int foo = x % 10;
010341C5  mov         eax,dword ptr [x]
010341C8  cdq
010341C9  mov         ecx,0Ah
010341CE  idiv        eax,ecx
010341D0  mov         dword ptr [foo],edx
```

Retail build (doing some ninja math there...)

```//    int foo = x % 10;
00BD100E  mov         eax,66666667h
00BD1013  imul        esi
00BD1015  sar         edx,2
00BD1018  mov         ecx,edx
00BD101A  shr         ecx,1Fh
00BD101F  lea         eax,[ecx+ecx*4]
00BD1024  sub         esi,eax
```

divmod10() : a fast replacement for /10 and %10 (unsigned), PUPROSE: fast divide and modulo by 10 in one function variables are auto by default, therefore register is completely redundant in C++. In C  The answer is easy to compute: divide 11 by 3 and take the remainder: 2. But how would you compute this in a programming language like C or C++? It's not hard to come up with a formula, but the language provides a built-in mechanism, the modulus operator ('%'), that computes the remainder that results from performing integer division.

This will work for (multiword) values larger than the machineword (but assuming a binary computer ...):

```#include <stdio.h>

unsigned long mod10(unsigned long val)
{
unsigned res=0;

res =val &0xf;
while (res>=10) { res -= 10; }

for(val >>= 4; val; val >>= 4){
res += 6 * (val&0xf);
while (res >= 10) { res -= 10; }
}

return res;
}

int main (int argc, char **argv)
{
unsigned long val;
unsigned res;

sscanf(argv[1], "%lu", &val);

res = mod10(val);
printf("%lu -->%u\n", val,res);

return 0;
}
```

UPDATE: With some extra effort, you could get the algoritm free of multiplications, and with the proper amount of optimisation we can even get the recursive call inlined:

```static unsigned long mod10_1(unsigned long val)
{
unsigned char res=0; //just to show that we don't need a big accumulator

res =val &0xf; // res can never be > 15
if (res>=10) { res -= 10; }

for(val >>= 4; val; val >>= 4){
res += (val&0xf)<<2 | (val&0xf) <<1;
res= mod10_1(res); // the recursive call
}

return res;
}
```

And the result for mod10_1 appears to be mul/div free and almost without branches:

```mod10_1:
.LFB25:
.cfi_startproc
movl    %edi, %eax
andl    \$15, %eax
leal    -10(%rax), %edx
cmpb    \$10, %al
cmovnb  %edx, %eax
movq    %rdi, %rdx
shrq    \$4, %rdx
testq   %rdx, %rdx
je      .L12
pushq   %r12
.cfi_def_cfa_offset 16
.cfi_offset 12, -16
pushq   %rbp
.cfi_def_cfa_offset 24
.cfi_offset 6, -24
pushq   %rbx
.cfi_def_cfa_offset 32
.cfi_offset 3, -32
.L4:
movl    %edx, %ecx
andl    \$15, %ecx
leal    (%rcx,%rcx,2), %ecx
leal    (%rax,%rcx,2), %eax
movl    %eax, %ecx
movzbl  %al, %esi
andl    \$15, %ecx
leal    -10(%rcx), %r9d
cmpb    \$9, %cl
cmovbe  %ecx, %r9d
shrq    \$4, %rsi
leal    (%rsi,%rsi,2), %ecx
leal    (%r9,%rcx,2), %ecx
movl    %ecx, %edi
movzbl  %cl, %ecx
andl    \$15, %edi
testq   %rsi, %rsi
setne   %r10b
cmpb    \$9, %dil
leal    -10(%rdi), %eax
seta    %sil
testb   %r10b, %sil
cmove   %edi, %eax
shrq    \$4, %rcx
andl    \$1, %r10d
leal    (%rcx,%rcx,2), %r8d
movl    %r10d, %r11d
leal    (%rax,%r8,2), %r8d
movl    %r8d, %edi
andl    \$15, %edi
testq   %rcx, %rcx
setne   %sil
leal    -10(%rdi), %ecx
andl    %esi, %r11d
cmpb    \$9, %dil
seta    %bl
testb   %r11b, %bl
cmovne  %ecx, %edi
andl    \$1, %r11d
andl    \$240, %r8d
leal    6(%rdi), %ebx
setne   %cl
movl    %r11d, %r8d
andl    %ecx, %r8d
leal    -4(%rdi), %ebp
cmpb    \$9, %bl
seta    %r12b
testb   %r8b, %r12b
cmovne  %ebp, %ebx
andl    \$1, %r8d
cmovne  %ebx, %edi
xorl    \$1, %ecx
andl    %r11d, %ecx
orb     %r8b, %cl
cmovne  %edi, %eax
xorl    \$1, %esi
andl    %r10d, %esi
orb     %sil, %cl
cmove   %r9d, %eax
shrq    \$4, %rdx
testq   %rdx, %rdx
jne     .L4
popq    %rbx
.cfi_restore 3
.cfi_def_cfa_offset 24
popq    %rbp
.cfi_restore 6
.cfi_def_cfa_offset 16
movzbl  %al, %eax
popq    %r12
.cfi_restore 12
.cfi_def_cfa_offset 8
ret
.L12:
movzbl  %al, %eax
ret
.cfi_endproc
.LFE25:
.size   mod10_1, .-mod10_1
.p2align 4,,15
.globl  mod10
.type   mod10, @function
```

lemire/fastrange: A fast alternative to the modulo reduction, A fast alternative to the modulo reduction. This library provides a single portable header file that you should be able to just drop in your C/C++ projects. The modulo operation is the same as ‘ the remainder of the division ’. If I say a modulo b is c, it means that the remainder when a is divided by b is c. The modulo operation is represented by the ‘%’ operator in most programming languages (including C/C++/Java/Python). So, 5 % 2 = 1, 17 % 5 = 2, 7 % 9 = 7 and so on. WHY IS MODULO NEEDED..

Modulus Operator in C and C++, The answer is easy to compute: divide 11 by 3 and take the remainder: 2. But how would you compute this in a programming language like C or C++? It's not hard  Read and learn for free about the following article: Fast modular exponentiation If you're seeing this message, it means we're having trouble loading external resources on our website. If you're behind a web filter, please make sure that the domains *.kastatic.org and *.kasandbox.org are unblocked.

A fast alternative to the modulo reduction – Daniel Lemire's blog, So how fast is our map compared to a 32-bit modulo reduction? Useful code: I published a C/C++ header on GitHub that you can use in your I'm using x64 laptop-windows 10 and try fast modulo function but not success. Fast I/O for Competitive Programming In competitive programming, it is important to read input as fast as possible so we save valuable time. You must have seen various problem statements saying: “ Warning: Large I/O data, be careful with certain languages (though most should be OK if the algorithm is well designed)” .

Efficient C Tip #13 – use the modulus (%) operator with caution , In all three cases I used an IAR compiler with full speed optimization. The number of cycles quoted are for 10 invocations of the test code and  10^9+7 fulfills both the criteria. It is the first 10-digit prime number and fits in int data type as well. In fact, any prime number less than 2^30 will be fine in order to prevent possible overflows. How modulo is used: A few distributive properties of modulo are as follows: ( a + b) % c = ( ( a % c ) + ( b % c ) ) % c

• Do you really believe it will be faster than `x % y` ?
• First check: Has your compiler writer perhaps also read this and already implemented an optimization for `x % 10`?
• @mrRobot Consider optimizing the loop and not just the `%` calculation.
• How about breaking up the outer loop? `for (ii = 1; ii <= n; ii += 10) { for (i=0; i< 10; i++) { D[i*m] = ii + i;...` or some variation and skip the use of `%`? It that allowable?
• That might (if the compiler is optimal) the fastest code one could write for `x % 10` in general for `unsigned x`. However, given constraints for a specific program, optimizations may be possible. We should not assert this is the fastest possible solution without qualification.
• @EricPostpischil Fair enough. Particularly in the case of wrapping in a loop, I would expect `if(i >= 10) i = 0;` to be superior unless the optimizer was exceptionally clever.