Fast implementation of a large integer counter (in C/C++)

c program to count number of digits in an integer using for loop
factorial of large numbers in c
program to count number of digits in c
c program to find factorial of 100 or very large numbers
count number of 1 bits in c
c++ program to count number of digits in a number
big integer in c++
big integers in c

My goal is as the following,

Generate successive values, such that each new one was never generated before, until all possible values are generated. At this point, the counter start the same sequence again. The main point here is that, all possible values are generated without repetition (until the period is exhausted). It does not matter if the sequence is simple 0, 1, 2, 3,..., or in other order.

For example, if the range can be represented simply by an unsigned, then

void increment (unsigned &n) {++n;}

is enough. However, the integer range is larger than 64-bits. For example, in one place, I need to generated 256-bits sequence. A simple implementation is like the following, just to illustrate what I am trying to do,

typedef std::array<uint64_t, 4> ctr_type;
static constexpr uint64_t max = ~((uint64_t) 0);
void increment (ctr_type &ctr)
{
    if (ctr[0] < max) {++ctr[0]; return;}
    if (ctr[1] < max) {++ctr[1]; return;}
    if (ctr[2] < max) {++ctr[2]; return;}
    if (ctr[3] < max) {++ctr[3]; return;}
    ctr[0] = ctr[1] = ctr[2] = ctr[3] = 0;
}

So if ctr start with all zeros, then first ctr[0] is increased one by one until it reach max, and then ctr[1], and so on. If all 256-bits are set, then we reset it to all zero, and start again.

The problem is that, such implementation is surprisingly slow. My current improved version is sort of equivalent to the following,

void increment (ctr_type &ctr)
{
    std::size_t k = (!(~ctr[0])) + (!(~ctr[1])) + (!(~ctr[2])) + (!(~ctr[3]))
    if (k < 4)
        ++ctr[k];
    else
        memset(ctr.data(), 0, 32);

}

If the counter is only manipulated with the above increment function, and always start with zero, then ctr[k] == 0 if ctr[k - 1] == 0. And thus the value k will be the index of the first element that is less than the maximum.

I expected the first to be faster, since branch mis-prediction shall happen only once in every 2^64 iterations. The second, though mis-predication only happen every 2^256 iterations, it shall not make a difference. And apart from the branching, it needs four bitwise negation, four boolean negation, and three addition. Which might cost much more than the first.

However, both clang, gcc, or intel icpc generate binaries that the second was much faster.

My main question is that does anyone know if there any faster way to implement such a counter? It does not matter if the counter start by increasing the first integers or if it is implemented as an array of integers at all, as long as the algorithm generate all 2^256 combinations of 256-bits.

What makes things more complicated, I also need non uniform increment. For example, each time the counter is incremented by K where K > 1, but almost always remain a constant. My current implementation is similar to the above.

To provide some more context, one place I am using the counters is using them as input to AES-NI aesenc instructions. So distinct 128-bits integer (loaded into __m128i), after going through 10 (or 12 or 14, depending on the key size) rounds of the instructions, a distinct 128-bits integer is generated. If I generate one __m128i integer at once, then the cost of increment matters little. However, since aesenc has quite a bit latency, I generate integers by blocks. For example, I might have 4 blocks, ctr_type block[4], initialized equivalent to the following,

block[0]; // initialized to zero
block[1] = block[0]; increment(block[1]);
block[2] = block[1]; increment(block[2]);
block[3] = block[2]; increment(block[3]);

And each time I need new output, I increment each block[i] by 4, and generate 4 __m128i output at once. By interleaving instructions, overall I was able to increase the throughput, and reduce the cycles per bytes of output (cpB) from 6 to 0.9 when using 2 64-bits integers as the counter and 8 blocks. However, if instead, use 4 32-bits integers as counter, the throughput, measured as bytes per sec is reduced to half. I know for a fact that on x86-64, 64-bits integers could be faster than 32-bits in some situations. But I did not expect such simple increment operation makes such a big difference. I have carefully benchmarked the application, and the increment is indeed the one slow down the program. Since the loading into __m128i and store the __m128i output into usable 32-bits or 64-bits integers are done through aligned pointers, the only difference between the 32-bits and 64-bits version is how the counter is incremented. I expected that the AES-NI expected, after loading the integers into __m128i, shall dominate the performance. But when using 4 or 8 blocks, it was clearly not the case.

So to summary, my main question is that, if anyone know a way to improve the above counter implementation.

It's not only slow, but impossible. The total energy of universe is insufficient for 2^256 bit changes. And that would require gray counter.

Next thing before optimization is to fix the original implementation

void increment (ctr_type &ctr)
{
    if (++ctr[0] != 0) return;
    if (++ctr[1] != 0) return;
    if (++ctr[2] != 0) return;
    ++ctr[3];
}

If each ctr[i] was not allowed to overflow to zero, the period would be just 4*(2^32), as in 0-9, 19,29,39,49,...99, 199,299,... and 1999,2999,3999,..., 9999.

As a reply to the comment -- it takes 2^64 iterations to have the first overflow. Being generous, upto 2^32 iterations could take place in a second, meaning that the program should run 2^32 seconds to have the first carry out. That's about 136 years.

EDIT

If the original implementation with 2^66 states is really what is wanted, then I'd suggest to change the interface and the functionality to something like:

  (*counter) += 1;
  while (*counter == 0)
  {
     counter++;  // Move to next word
     if (counter > tail_of_array) {
        counter = head_of_array;
        memset(counter,0, 16);
        break;
     }
  }

The point being, that the overflow is still very infrequent. Almost always there's just one word to be incremented.

Fast implementation of operations on large sets of quite big integers , Fast implementation of operations on large sets of quite big integers � c++ performance algorithm integer set. Description : I implemented the following class � Summary of C/C++ integer rules This is my own collection of hard-earned knowledge about how integers work in C/C++, and how to use them carefully and correctly. In this article, I try to strike a balance between brevity (easing the reader) and completeness (providing absolute correctness and extensive detail).

If you're using GCC or compilers with __int128 like Clang or ICC

unsigned __int128 H = 0, L = 0;
L++;
if (L == 0) H++;

On systems where __int128 isn't available

std::array<uint64_t, 4> c[4]{};
c[0]++;
if (c[0] == 0)
{
    c[1]++;
    if (c[1] == 0)
    {
        c[2]++;
        if (c[2] == 0)
        {
            c[3]++;
        }
    }
}

In inline assembly it's much easier to do this using the carry flag. Unfortunately most high level languages don't have means to access it directly. Some compilers do have intrinsics for adding with carry like __builtin_uaddll_overflow in GCC and __builtin_addcll

Anyway this is rather wasting time since the total number of particles in the universe is only about 1080 and you cannot even count up the 64-bit counter in your life

A Faster Approach to Count Set Bits in a 32-bit Integer , To count how many 'set' bits in a 32-bit integer (signed or unsigned), we can use the following Tags:binary operation, c/c++, loop, shift� If there were an intrinsic to convert a pair of 32-bit ints to 64-bit, you could use that, but I wouldn't count on it saving the time you expect. You could use the intrinsic to convert a pair of 32-bit ints to double, and sum the doubles by _mm_add_pd, if you like that alternate approach to protecting against overflow.

Neither of your counter versions increment correctly. Instead of counting up to UINT256_MAX, you are actually just counting up to UINT64_MAX 4 times and then starting back at 0 again. This is apparent from the fact that you do not bother to clear any of the indices that has reached the max value until all of them have reached the max value. If you are measuring performance based on how often the counter reaches all bits 0, then this is why. Thus your algorithms do not generate all combinations of 256 bits, which is a stated requirement.

Program to count digits in an integer (4 Different Methods , Then the test expression is evaluated to false and the loop terminates. C++; C; Java; Python3; C#; PHP. C++. count = count + 1; Because C works out the math first, the current value of count is incremented by 1. Then that new value is stored in the count variable. So, if count now equals 6, count + 1 results in 7, and 7 is then stored back into the count variable. count then equals 7. But you can build the code more compactly like this:

Factorial of a large number, How to compute factorial of 100 using a C/C++ program? Factorial of 100 has 158 digits. It is not possible to store these many digits even if we use long long int . NOTE : In the below implementation, maximum digits in the output are assumed as Count distinct median possible for an Array using given ranges of elements� The current version takes about 20 + 4.3N cycles for an ARM processor. As an expensive operation, it is desirable to avoid it where possible. Sometimes, such expressions can be rewritten by replacing the division by a multiplication. For example, (a / b) > c can be rewritten as a > (c * b) if it is known that b is positive and b *c fits in an

Multi-word addition can easily be accomplished in portable fashion by using three macros that mimic three types of addition instructions found on many processors:

ADDcc adds two words, and sets the carry if their was unsigned overflow ADDC adds two words plus carry (from a previous addition) ADDCcc adds two words plus carry, and sets the carry if their was unsigned overflow

A multi-word addition with two words uses ADDcc of the least significant words followed by ADCC of the most significant words. A multi-word addition with more than two words forms sequence ADDcc, ADDCcc, ..., ADDC. The MIPS architecture is a processor architecture without conditions code and therefore without carry flag. The macro implementations shown below basically follow the techniques used on MIPS processors for multi-word additions.

The ISO-C99 code below shows the operation of a 32-bit counter and a 64-bit counter based on 16-bit "words". I chose arrays as the underlying data structure, but one might also use struct, for example. Use of a struct will be significantly faster if each operand only comprises a few words, as the overhead of array indexing is eliminated. One would want to use the widest available integer type for each "word" for best performance. In the example from the question that would likely be a 256-bit counter comprising four uint64_t components.

#include <stdlib.h>
#include <stdio.h>
#include <stdint.h>

#define ADDCcc(a,b,cy,t0,t1) \
  (t0=(b)+cy, t1=(a), cy=t0<cy, t0=t0+t1, t1=t0<t1, cy=cy+t1, t0=t0)

#define ADDcc(a,b,cy,t0,t1) \
  (t0=(b), t1=(a), t0=t0+t1, cy=t0<t1, t0=t0)

#define ADDC(a,b,cy,t0,t1) \
  (t0=(b)+cy, t1=(a), t0+t1)

typedef uint16_t T;

/* increment a multi-word counter comprising n words */
void inc_array (T *counter, const T *increment, int n)
{
    T cy, t0, t1;
    counter [0] = ADDcc (counter [0], increment [0], cy, t0, t1);
    for (int i = 1; i < (n - 1); i++) {
        counter [i] = ADDCcc (counter [i], increment [i], cy, t0, t1);
    }
    counter [n-1] = ADDC (counter [n-1], increment [n-1], cy, t0, t1);
}

#define INCREMENT (10)
#define UINT32_ARRAY_LEN (2)
#define UINT64_ARRAY_LEN (4)

int main (void)
{
    uint32_t count32 = 0, incr32 = INCREMENT;
    T count_arr2 [UINT32_ARRAY_LEN] = {0};
    T incr_arr2  [UINT32_ARRAY_LEN] = {INCREMENT};
    do {
        count32 = count32 + incr32;
        inc_array (count_arr2, incr_arr2, UINT32_ARRAY_LEN);
    } while (count32 < (0U - INCREMENT - 1));
    printf ("count32 = %08x  arr_count = %08x\n", 
            count32, (((uint32_t)count_arr2 [1] << 16) +
                      ((uint32_t)count_arr2 [0] <<  0)));

    uint64_t count64 = 0, incr64 = INCREMENT;
    T count_arr4 [UINT64_ARRAY_LEN] = {0};
    T incr_arr4  [UINT64_ARRAY_LEN] = {INCREMENT};
    do {
        count64 = count64 + incr64;
        inc_array (count_arr4, incr_arr4, UINT64_ARRAY_LEN);
    } while (count64 < 0xa987654321ULL);
    printf ("count64 = %016llx  arr_count = %016llx\n", 
            count64, (((uint64_t)count_arr4 [3] << 48) + 
                      ((uint64_t)count_arr4 [2] << 32) +
                      ((uint64_t)count_arr4 [1] << 16) +
                      ((uint64_t)count_arr4 [0] <<  0)));
    return EXIT_SUCCESS;
}

Compiled with full optimization, the 32-bit example executes in about a second, while the 64-bit example runs for about a minute on a modern PC. The output of the program should look like so:

count32 = fffffffa  arr_count = fffffffa
count64 = 000000a987654326  arr_count = 000000a987654326

Non-portable code that is based on inline assembly or proprietary extensions for wide integer types may execute about two to three times as fast as the portable solution presented here.

Brian Kernighan's Algorithm to count set bits in an integer, The idea is to consider only set bits of the integer by turning off the rightmost set (unsigned long) and int __builtin_popcountll (unsigned long long) similar to to post code in comments using C, C++, Java, Python, JavaScript, C#, PHP and� C Programming Projects for $30 - $250. The Problem The unsigned int type in C requires 4 bytes of memory storage. With 4 bytes we can store integers as large as 232-1; but what if we need bigger integers, for example ones having hundreds o

How Should You Write a Fast Integer Overflow Check? – Embedded , Detecting integer overflow in languages that have wraparound at taking an implementation of checked_add() in C/C++ and turning it into this� Write a “C” function, int addOvf(int* result, int a, int b) If there is no overflow, the function places the resultant = sum a+b in “result” and returns 0. Otherwise it returns -1. The solution of casting to long and adding to find detecting the overflow is not allowed.

Hamming weight, The Hamming weight of a string is the number of symbols that are different from the Here, the operations are as in C programming language, so X >> Y means to shift than any other known //implementation on machines with fast multiplication. bitCount(long) functions to count bits in primitive 32-bit and 64- bit integers,� Reaga February 20, 2014 at 7:12 pm. Ok, had HW to implement fast exponentiation w/o recursion, used the second to last code. But I have a question: I understand the algorithm.

Concise Encyclopedia of Computer Science, c [ 0 ] = 147 c [ 1 ] = 89 c [ 2 ] = 463 there are 89 1s there are 463 2s c [ 9 ] we overwrite the array of unsorted numbers with consecutively , 147 zeros , 89 that can be handled with this often overlooked algorithm is limited only by how large a 6digit random integers 20 000 times faster than insertion sort , the fastest of the � You're right to suspect the data++, data--lines, but I'm not sure "leaking memory" is the worst of your problems. If you pass one of these "offset from the original allocation" addresses into realloc or free, for example, you're likely to get crashes.

Comments
  • How quickly do you expect a 64 bit counter to overflow? By my math, it would take ~300 years if you could update it at 2GHz. A 256 bit counter is mind bogglingly huge.
  • Are you attempting a brute force attack on AES256???
  • why did someone downvote this question.
  • Is it really 2^256 combinations you want, or 2^66, which is what you have written? Anyway, during anybodys lifetime, an early exit dominates the performance.
  • The speed of light isn't fast enough to let you brute force AES-256. E=MC^2 just isn't enough energy even when you throw in the entire mass-energy of the observable universe. (dark matter included)
  • Traditionally one uses an add-with-carry instruction to efficiently implement multi-precision add, but this is sadly not available in C or C++ without inline asm.
  • Thanks for highlighting 128-bit implementation in GCC, available since ... 2003 (more than 10 years ago). GCC generates nearly-optimal branch-free code from H += (++L == 0)
  • @Pierre officially it was announced in GCC 4.4 which was released in 2009, but 128-bit int support was already available from GCC 4.1 around 2006-2007 although I don't know the first version that introduced this feature
  • No, it won't. The first carry out may not happen for hundreds of years.
  • suppose you can reach 2^64 in 1 month, then you'll need 2^192 months to reach 2^256 which is ~5.23e56 years