## Fast modulo 10 in c

fast modulo calculation

fast modulo prime

modulus operator

is modulo expensive

is modulo slow

how is modulo implemented

simd modulus

I am looking for a fast modulo 10 algorithm because I need to speed up my program which does many modulo operations in cycles.

I have checked out this page which compares some alternatives.
As far as I understand it correctly, T3 was the fastest of all.
My question is, how would `x % y`

look like using T3 technique?

I copied T3 technique here for simplicity in case the link gets down.

for (int x = 0; x < max; x++) { if (y > (threshold - 1)) { y = 0; //reset total += x; } y += 1; }

Regarding to comments, if this is not really faster then regular mod, I am looking for at least 2 times faster modulo than using `%`

.
I have seen many examples with use power of two, but since 10 is not, how can I get it to work?

**Edit:**

For my program, let's say I have 2 for cycles where `n=1 000 000`

and `m=1000`

.

Looks like this:

for (i = 1; i <= n; i++) { D[(i%10)*m] = i; for (j = 1; j <= m; j++) { ... } }

Here's the fastest modulo-10 function you can write:

unsigned mod10(unsigned x) { return x % 10; }

And here's what it looks like once compiled:

movsxd rax, edi imul rcx, rax, 1717986919 mov rdx, rcx shr rdx, 63 sar rcx, 34 add ecx, edx add ecx, ecx lea ecx, [rcx + 4*rcx] sub eax, ecx ret

Note the lack of division/modulus instructions, the mysterious constants, the use of an instruction which was originally intended for complex array indexing, etc. Needless to say, the compiler knows a lot of tricks to make your program as fast as possible. You'll rarely beat it on tasks like this.

**Modulo 10^9+7 (1000000007),** It is the first 10-digit prime number and fits in int data type as well. Method 1: First, multiply all the number and then take modulo: (a*b*c)%m these applications is that there exists a very fast algorithm (the extended Euclidean algorithm) that Fast inverse square root; Few Tips for Fast & Productive Work on a Linux Terminal; How does Floyd's slow and fast pointers approach work? Fast method to calculate inverse square root of a floating point number in IEEE 754 format; LCM of N numbers modulo M; Compute n! under modulo p; Modulo 10^9+7 (1000000007) Fibonacci modulo p; Sum of two

The code isn’t a direct substitute for modulo, it substitutes modulo *in that situation*. You can write your own `mod`

by analogy (for `a`

, `b`

> 0):

int mod(int a, int b) { while (a >= b) a -= b; return a; }

… but whether that’s faster than `%`

is *highly* questionable.

**Fastest way to get integer mod 10 and integer divide 10?,** Heres a binary to BCD algorithm I used several years ago based on one found here. I was using an external BCD to 7 seg display driver so the Fast Division/Modulo Operation. C / C++ Forums on Bytes.

You likely can't beat the compiler.

Debug build

// int foo = x % 10; 010341C5 mov eax,dword ptr [x] 010341C8 cdq 010341C9 mov ecx,0Ah 010341CE idiv eax,ecx 010341D0 mov dword ptr [foo],edx

Retail build (doing some ninja math there...)

// int foo = x % 10; 00BD100E mov eax,66666667h 00BD1013 imul esi 00BD1015 sar edx,2 00BD1018 mov ecx,edx 00BD101A shr ecx,1Fh 00BD101D add ecx,edx 00BD101F lea eax,[ecx+ecx*4] 00BD1022 add eax,eax 00BD1024 sub esi,eax

**divmod10() : a fast replacement for /10 and %10 (unsigned),** PUPROSE: fast divide and modulo by 10 in one function variables are auto by default, therefore register is completely redundant in C++. In C The answer is easy to compute: divide 11 by 3 and take the remainder: 2. But how would you compute this in a programming language like C or C++? It's not hard to come up with a formula, but the language provides a built-in mechanism, the modulus operator ('%'), that computes the remainder that results from performing integer division.

This will work for (multiword) values larger than the machineword (but assuming a binary computer ...):

#include <stdio.h> unsigned long mod10(unsigned long val) { unsigned res=0; res =val &0xf; while (res>=10) { res -= 10; } for(val >>= 4; val; val >>= 4){ res += 6 * (val&0xf); while (res >= 10) { res -= 10; } } return res; } int main (int argc, char **argv) { unsigned long val; unsigned res; sscanf(argv[1], "%lu", &val); res = mod10(val); printf("%lu -->%u\n", val,res); return 0; }

UPDATE:
With some extra effort, you could get the algoritm free of multiplications, and *with the proper amount of optimisation* we can even get the recursive call inlined:

static unsigned long mod10_1(unsigned long val) { unsigned char res=0; //just to show that we don't need a big accumulator res =val &0xf; // res can never be > 15 if (res>=10) { res -= 10; } for(val >>= 4; val; val >>= 4){ res += (val&0xf)<<2 | (val&0xf) <<1; res= mod10_1(res); // the recursive call } return res; }

And the result for mod10_1 appears to be mul/div free and almost without branches:

mod10_1: .LFB25: .cfi_startproc movl %edi, %eax andl $15, %eax leal -10(%rax), %edx cmpb $10, %al cmovnb %edx, %eax movq %rdi, %rdx shrq $4, %rdx testq %rdx, %rdx je .L12 pushq %r12 .cfi_def_cfa_offset 16 .cfi_offset 12, -16 pushq %rbp .cfi_def_cfa_offset 24 .cfi_offset 6, -24 pushq %rbx .cfi_def_cfa_offset 32 .cfi_offset 3, -32 .L4: movl %edx, %ecx andl $15, %ecx leal (%rcx,%rcx,2), %ecx leal (%rax,%rcx,2), %eax movl %eax, %ecx movzbl %al, %esi andl $15, %ecx leal -10(%rcx), %r9d cmpb $9, %cl cmovbe %ecx, %r9d shrq $4, %rsi leal (%rsi,%rsi,2), %ecx leal (%r9,%rcx,2), %ecx movl %ecx, %edi movzbl %cl, %ecx andl $15, %edi testq %rsi, %rsi setne %r10b cmpb $9, %dil leal -10(%rdi), %eax seta %sil testb %r10b, %sil cmove %edi, %eax shrq $4, %rcx andl $1, %r10d leal (%rcx,%rcx,2), %r8d movl %r10d, %r11d leal (%rax,%r8,2), %r8d movl %r8d, %edi andl $15, %edi testq %rcx, %rcx setne %sil leal -10(%rdi), %ecx andl %esi, %r11d cmpb $9, %dil seta %bl testb %r11b, %bl cmovne %ecx, %edi andl $1, %r11d andl $240, %r8d leal 6(%rdi), %ebx setne %cl movl %r11d, %r8d andl %ecx, %r8d leal -4(%rdi), %ebp cmpb $9, %bl seta %r12b testb %r8b, %r12b cmovne %ebp, %ebx andl $1, %r8d cmovne %ebx, %edi xorl $1, %ecx andl %r11d, %ecx orb %r8b, %cl cmovne %edi, %eax xorl $1, %esi andl %r10d, %esi orb %sil, %cl cmove %r9d, %eax shrq $4, %rdx testq %rdx, %rdx jne .L4 popq %rbx .cfi_restore 3 .cfi_def_cfa_offset 24 popq %rbp .cfi_restore 6 .cfi_def_cfa_offset 16 movzbl %al, %eax popq %r12 .cfi_restore 12 .cfi_def_cfa_offset 8 ret .L12: movzbl %al, %eax ret .cfi_endproc .LFE25: .size mod10_1, .-mod10_1 .p2align 4,,15 .globl mod10 .type mod10, @function

**lemire/fastrange: A fast alternative to the modulo reduction,** A fast alternative to the modulo reduction. This library provides a single portable header file that you should be able to just drop in your C/C++ projects. The modulo operation is the same as ‘ the remainder of the division ’. If I say a modulo b is c, it means that the remainder when a is divided by b is c. The modulo operation is represented by the ‘%’ operator in most programming languages (including C/C++/Java/Python). So, 5 % 2 = 1, 17 % 5 = 2, 7 % 9 = 7 and so on. WHY IS MODULO NEEDED..

**Modulus Operator in C and C++,** The answer is easy to compute: divide 11 by 3 and take the remainder: 2. But how would you compute this in a programming language like C or C++? It's not hard Read and learn for free about the following article: Fast modular exponentiation If you're seeing this message, it means we're having trouble loading external resources on our website. If you're behind a web filter, please make sure that the domains *.kastatic.org and *.kasandbox.org are unblocked.

**A fast alternative to the modulo reduction – Daniel Lemire's blog,** So how fast is our map compared to a 32-bit modulo reduction? Useful code: I published a C/C++ header on GitHub that you can use in your I'm using x64 laptop-windows 10 and try fast modulo function but not success. Fast I/O for Competitive Programming In competitive programming, it is important to read input as fast as possible so we save valuable time. You must have seen various problem statements saying: “ Warning: Large I/O data, be careful with certain languages (though most should be OK if the algorithm is well designed)” .

**Efficient C Tip #13 – use the modulus (%) operator with caution ,** In all three cases I used an IAR compiler with full speed optimization. The number of cycles quoted are for 10 invocations of the test code and 10^9+7 fulfills both the criteria. It is the first 10-digit prime number and fits in int data type as well. In fact, any prime number less than 2^30 will be fine in order to prevent possible overflows. How modulo is used: A few distributive properties of modulo are as follows: ( a + b) % c = ( ( a % c ) + ( b % c ) ) % c

##### Comments

- Do you really believe it will be faster than
`x % y`

? - First check: Has your compiler writer perhaps also read this and already implemented an optimization for
`x % 10`

? - Have you measured and benchmarked and profiled this is indeed a bottle-neck in your program? Have you checked the (optimized) generated code? Perhaps your problem is less of a modulo problem and more of a cache problem?
- @mrRobot Consider optimizing the loop and not just the
`%`

calculation. - How about breaking up the outer loop?
`for (ii = 1; ii <= n; ii += 10) { for (i=0; i< 10; i++) { D[i*m] = ii + i;...`

or some variation and skip the use of`%`

? It that allowable? - I was about to write a comment saying "...I'd be surprised if a DIV instruction doesn't figure prominently in the computation of remainder". Well - I'm surprised that a DIV instruction doesn't figure prominently in the computation of the remainder. :-)
- @BobJarvis Not the DIV instruction itself, but looking closely at that assembly, it's built from a div-by-constant, a mul-by-constant, and a subtraction.
- That might (if the compiler is optimal) the fastest code one could write for
`x % 10`

in general for`unsigned x`

. However, given constraints for a specific program, optimizations may be possible. We should not assert this is the fastest possible solution without qualification. - The constant 1717986919 is 0x66666667. So, the number of the beast, expressed in base 16, extended to 32 bits by duplicating the top-most hex digit, plus one. Clearly a sign of the upcoming End Of The World!
- @EricPostpischil Fair enough. Particularly in the case of wrapping in a loop, I would expect
`if(i >= 10) i = 0;`

to be superior unless the optimizer was*exceptionally*clever.