## Decimal concatenation of two integers using bit operations

I want to concatenate two integers using only bit operations since i need efficiency as much as possible.There are various answers available but they are not fast enough what I want is implementation that uses only bit operations like left shift,or and etc. Please guide me how to do it.

example

int x=32; int y=12; int result=3212;

I am having and FPGA implentation of AES.I need this on my system to reduce time consumption of some task

The most efficient way to do it, is likely something similar to this:

uint32_t uintcat (uint32_t ms, uint32_t ls) { uint32_t mult=1; do { mult *= 10; } while(mult <= ls); return ms * mult + ls; }

Then let the compiler worry about optimization. There's likely not a lot it can improve since this is base 10, which doesn't get along well with the various instructions of the computer, like shifting.

**EDIT : BENCHMARK TEST**

Intel i7-3770 2 3,4 GHz OS: Windows 7/64 Mingw, GCC version 4.6.2 gcc -O3 -std=c99 -pedantic-errors -Wall 10 million random values, from 0 to 3276732767.

Result (approximates):

Algorithm 1: 60287 micro seconds Algorithm 2: 65185 micro seconds

Benchmark code used:

#include <stdint.h> #include <stdio.h> #include <windows.h> #include <time.h> uint32_t uintcat (uint32_t ms, uint32_t ls) { uint32_t mult=1; do { mult *= 10; } while(mult <= ls); return ms * mult + ls; } uint32_t myConcat (uint32_t a, uint32_t b) { switch( (b >= 10000000) ? 7 : (b >= 1000000) ? 6 : (b >= 100000) ? 5 : (b >= 10000) ? 4 : (b >= 1000) ? 3 : (b >= 100) ? 2 : (b >= 10) ? 1 : 0 ) { case 1: return a*100+b; break; case 2: return a*1000+b; break; case 3: return a*10000+b; break; case 4: return a*100000+b; break; case 5: return a*1000000+b; break; case 6: return a*10000000+b; break; case 7: return a*100000000+b; break; default: return a*10+b; break; } } static LARGE_INTEGER freq; static void print_benchmark_results (LARGE_INTEGER* start, LARGE_INTEGER* end) { LARGE_INTEGER elapsed; elapsed.QuadPart = end->QuadPart - start->QuadPart; elapsed.QuadPart *= 1000000; elapsed.QuadPart /= freq.QuadPart; printf("%lu micro seconds", elapsed.QuadPart); } int main() { const uint32_t TEST_N = 10000000; uint32_t* data1 = malloc (sizeof(uint32_t) * TEST_N); uint32_t* data2 = malloc (sizeof(uint32_t) * TEST_N); volatile uint32_t* result_algo1 = malloc (sizeof(uint32_t) * TEST_N); volatile uint32_t* result_algo2 = malloc (sizeof(uint32_t) * TEST_N); srand (time(NULL)); // Mingw rand() apparently gives numbers up to 32767 // worst case should therefore be 3,276,732,767 // fill up random data in arrays for(uint32_t i=0; i<TEST_N; i++) { data1[i] = rand(); data2[i] = rand(); } QueryPerformanceFrequency(&freq); LARGE_INTEGER start, end; // run algorithm 1 QueryPerformanceCounter(&start); for(uint32_t i=0; i<TEST_N; i++) { result_algo1[i] = uintcat(data1[i], data2[i]); } QueryPerformanceCounter(&end); // print results printf("Algorithm 1: "); print_benchmark_results(&start, &end); printf("\n"); // run algorithm 2 QueryPerformanceCounter(&start); for(uint32_t i=0; i<TEST_N; i++) { result_algo2[i] = myConcat(data1[i], data2[i]); } QueryPerformanceCounter(&end); // print results printf("Algorithm 2: "); print_benchmark_results(&start, &end); printf("\n\n"); // sanity check both algorithms against each other for(uint32_t i=0; i<TEST_N; i++) { if(result_algo1[i] != result_algo2[i]) { printf("Results mismatch for %lu %lu. Expected: %lu%lu, algo1: %lu, algo2: %lu\n", data1[i], data2[i], data1[i], data2[i], result_algo1[i], result_algo2[i]); } } // clean up free((void*)data1); free((void*)data2); free((void*)result_algo1); free((void*)result_algo2); }

**How to concatenate two Integer values into one?,** Given two integers n1 and n2, the task is to concatenate these two integers into one integer. Convert both numbers to string; Concatenate both strings into one, as this is comparatively easy; Convert using namespace std; Check if a Number is Odd or Even using Bitwise Operators · Program to print half Diamond star Given two integers M and N the task is to find the number formed by concatenating the binary equivalents of M and N i.e. M + N. Examples: Input: M = 4, N = 5 Output: 37 Binary equivalent of 4 is 100 and for 5 it is 101 after concatenation, the resultant binary number formed is 100101 whose decimal equivalent is 37. Input: M = 3, N = 4 Output: 28

Bit operations use the binary representation of the numbers. However what you try to achieve is to concatenate the numbers in decimal notation. Please note that concatenating the decimal representations has little to do with concatenating the binary representations. Though it is theoretically possible to solve the problem using binary operations I am sure it will be far from the most efficient way.

**Find the number obtained after concatenation of binary ,** Given two integers M and N the task is to find the number formed by concatenating the Approach: Convert both the numbers M and N into there binary equivalents then concatenate these numbers as M + N and print the decimal equivalent of the resulting binary number formed after concatenation. using namespace std;. About Bitwise Calculator . The Bitwise Calculator is used to perform bitwise AND, bitwise OR, bitwise XOR (bitwise exclusive or) operations on two integers. It is also possible to perform bit shift operations on integral types.

We need to calculate a*10^N + b very fast.

Bit operations isn't a best idea to optimize it (even using tricks like a := (a<<1) + (a<<3) <=> a := a*10 as compiler can make it himself).

The first problem is to calculate 10^N, but there is no need to calculate it, there is just 9 possible values.

The second problem is to calculate N from b (length of 10 representation). If your data have uniform distribution you can minimize operations count in average case.

Check b <= 10^9, b <= 10^8, ..., b <= 10 with ()?: (it's faster than if( ) after optimizations, it has much simplier grammar and functionality), call the result N. Next, make switch(N) with lines "return a*10^N + b" (where 10^N is constant). As I know, switch() with 3-4 "case" is faster than same if( ) construction after optimizations.

unsigned int myConcat(unsigned int& a, unsigned int& b) { switch( (b >= 10000000) ? 7 : (b >= 1000000) ? 6 : (b >= 100000) ? 5 : (b >= 10000) ? 4 : (b >= 1000) ? 3 : (b >= 100) ? 2 : (b >= 10) ? 1 : 0 ) { case 1: return a*100+b; break; case 2: return a*1000+b; break; case 3: return a*10000+b; break; case 4: return a*100000+b; break; case 5: return a*1000000+b; break; case 6: return a*10000000+b; break; case 7: return a*100000000+b; break; default: return a*10+b; break; // I don't really know what to do here //case 8: return a*1000*1000*1000+b; break; //case 9: return a*10*1000*1000*1000+b; break; } }

As you can see, there is 2-3 operations in average case + optimisations is very effective here. I've benchmarked it in comparison with Lundin's suggestion, here is the result. 0ms vs 100ms

**Sams Teach Yourself Microsoft Access 2002 Programming in 24 Hours,** Bitwise operations only use the integer value of numbers. Converting 5 to The two concatenation operators are two of the most commonly used operators. ans = '1011'. Another common use of bitset is to convert a vector of binary digits into decimal format. For example, use a loop to set the individual bits of the integer 11001101. data = [1 1 0 0 1 1 0 1]; n = length (data); dec = 0b0u8; for k = 1:n dec = bitset (dec,n+1-k,data (k)); end dec. dec = uint8 205.

If you care about decimal digit concatenation, you might want to simply do that as you're printing, and convert both numbers to a sequence of digits sequentially. e.g. How do I print an integer in Assembly Level Programming without printf from the c library? shows an efficient C function, as well as asm. Call it twice into the same buffer.

@Lundin's answer loops increasing powers of 10 to find the right decimal-shift, i.e. a linear search for the right power of 10. If it's called very frequently so a lookup table can stay hot in cache, a speedup maybe be possible.

**If you can use GNU C __builtin_clz (Count Leading Zeros) or some other way of quickly finding the MSB position of the right-hand input (ls, the least-significant part of the resulting concatenation), you can start the search for the right mult from a 32-entry lookup table.** (And you only have to check at most one more iteration, so it's not a loop.)

Most of the common modern CPU architectures have a HW instruction that the compiler can use directly or with a bit of processing to implement clz. https://en.wikipedia.org/wiki/Find_first_set#Hardware_support. (And on all but x86, the result is well-defined for an input of 0, but unfortunately GNU C doesn't portably give us access to that.)

** If the table stays hot in L1d cache, this can be pretty good**. The extra latency of a

`clz`

and a table lookup are comparable to a couple iterations of the loop (on a modern x86 like Skylake or Ryzen for example, where `bsf`

or `tzcnt`

is 3 cycle latency, L1d latency is 4 or 5 cycles, `imul`

latency is 3 cycles.)Of course, on many architectures (including x86), multiplying by 10 is cheaper than by a runtime variable, using shift and add. 2 LEA instructions on x86, or an `add`

+`lsl`

on ARM/AArch64 using a shifted input to do `tmp = x + x*4`

with the add. So on Intel CPUs, we're only looking at a 2-cycle loop-carried dependency chain, not 3. But AMD CPUs have slower LEA when using a scaled index.

This doesn't sound great for small numbers. But **it can reduce branch mispredictions by needing at most one iteration. It even makes a branchless implementation possible**. And it means less total work for large lower parts (large powers of 10). But large integers will easily overflow unless you use a wider result type.

Unfortunately, 10 is not a power of 2, so the MSB position alone can't give us the exact power of 10 to multiply by. e.g. all numbers from 64 to 127 all have MSB = `1<<7`

, but some of them have 2 decimal digits and some have 3. Since we want to avoid division (because it requires a multiplication by a magic constant and shifting the high half), we want to always start with the lower power of 10 and see if that's big enough.

**But fortunately, a bitscan does get us within one power of 10 so we no longer need a loop.**

I probably wouldn't have written the part with `_lzcnt_u32`

or ARM `__clz`

if I'd learned of the `clz(a|1)`

trick for avoiding problems with input=0 beforehand. But I did, and played around with the source a bit to try to get nicer asm from gcc and clang. Index the table on clz or BSR depending on target platform makes it a bit of a mess.

#include <stdint.h> #include <limits.h> #include <assert.h> // builtin_clz matches Intel's docs for x86 BSR: garbage result for input=0 // actual x86 HW leaves the destination register unmodified; AMD even documents this. // but GNU C doesn't let us take advantage with intrinsics. // unless you use BMI1 _lzcnt_u32 // if available, use an intrinsic that gives us a leading-zero count // *without* an undefined result for input=0 #ifdef __LZCNT__ // x86 CPU feature #include <immintrin.h> // Intel's intrinsics #define HAVE_LZCNT32 #define lzcnt32(a) _lzcnt_u32(a) #endif #ifdef __ARM__ // TODO: do older ARMs not have this? #define HAVE_LZCNT32 #define lzcnt32(a) __clz(a) // builtin, no header needed #endif // Some POWER compilers define `__cntlzw`? // index = msb position, or lzcnt, depending on which the HW can do more efficiently // defined later; one or the other is unused and optimized out, depending on target platform // alternative: fill this at run-time startup // with a loop that does mult*=10 when (x<<1)-1 > mult, or something //#if INDEX_BY_MSB_POS == 1 __attribute__((unused)) static const uint32_t catpower_msb[] = { 10, // 1 and 0 10, // 2..3 10, // 4..7 10, // 8..15 100, // 16..31 // 2 digits even for the low end of the range 100, // 32..63 100, // 64..127 1000, // 128..255 // 3 digits 1000, // 256..511 1000, // 512..1023 10000, // 1024..2047 10000, // 2048..4095 10000, // 4096..8191 10000, // 8192..16383 100000, // 16384..32767 100000, // 32768..65535 // up to 2^16-1, enough for 16-bit inputs // ... // fill in the rest yourself }; //#elif INDEX_BY_MSB_POS == 0 // index on leading zeros __attribute__((unused)) static const uint32_t catpower_lz32[] = { // top entries overflow: 10^10 doesn't fit in uint32_t // intentionally wrong to make it easier to spot bad output. 4000000000, // 2^31 .. 2^32-1 2*10^9 .. 4*10^9 2000000000, // 1,073,741,824 .. 2,147,483,647 // first correct entry 1000000000, // 536,870,912 .. 1,073,741,823 // ... fill in the rest // for testing, skip until 16 leading zeros [16] = 100000, // 32768..65535 // up to 2^16-1, enough for 16-bit inputs 100000, // 16384..32767 10000, // 8192..16383 10000, // 4096..8191 10000, // 2048..4095 10000, // 1024..2047 1000, // 512..1023 1000, // 256..511 1000, // 128..255 100, // 64..127 100, // 32..63 100, // 16..31 // low end of the range has 2 digits 10, // 8..15 10, // 4..7 10, // 2..3 10, // 1 // lzcnt32(0) == 32 10, // 0 // treat 0 as having one significant digit. }; //#else //#error "INDEX_BY_MSB_POS not set correctly" //#endif //#undef HAVE_LZCNT32 // codegen for the other path, for fun static inline uint32_t msb_power10(uint32_t a) { #ifdef HAVE_LZCNT32 // 0-safe lzcnt32 macro available #define INDEX_BY_MSB_POS 0 // a |= 1 would let us shorten the table, in case 32*4 is a lot nicer than 33*4 bytes unsigned lzcnt = lzcnt32(a); // 32 for a=0 return catpower_lz32[lzcnt]; #else // only generic __builtin_clz available static_assert(sizeof(uint32_t) == sizeof(unsigned) && UINT_MAX == (1ULL<<32)-1, "__builtin_clz isn't 32-bit"); // See also https://foonathan.net/blog/2016/02/11/implementation-challenge-2.html // for C++ templates for fixed-width wrappers for __builtin_clz #if defined(__i386__) || defined(__x86_64__) // x86 where MSB_index = 31-clz = BSR is most efficient #define INDEX_BY_MSB_POS 1 unsigned msb = 31 - __builtin_clz(a|1); // BSR return catpower_msb[msb]; //return unlikely(a==0) ? 10 : catpower_msb[msb]; #else // use clz directly while still avoiding input=0 // I think all non-x86 CPUs with hardware CLZ do define clz(0) = 32 or 64 (the operand width), // but gcc's builtin is still documented as not valid for input=0 // Most ISAs like PowerPC and ARM that have a bitscan instruction have clz, not MSB-index // set the LSB to avoid the a==0 special case unsigned clz = __builtin_clz(a|1); // table[32] unused, could add yet another #ifdef for that #define INDEX_BY_MSB_POS 0 //return unlikely(a==0) ? 10 : catpower_lz32[clz]; return catpower_lz32[clz]; // a|1 avoids the special-casing #endif // optimize for BSR or not #endif // HAVE_LZCNT32 } uint32_t uintcat (uint32_t ms, uint32_t ls) { // if (ls==0) return ms * 10; // Another way to avoid the special case for clz uint32_t mult = msb_power10(ls); // catpower[clz(ls)]; uint32_t high = mult * ms; #if 0 if (mult <= ls) high *= 10; return high + ls; #else // hopefully compute both and then select // because some CPUs can shift and add at the same time (x86, ARM) // so this avoids having an ADD *after* the cmov / csel, if the compiler is smart uint32_t another10 = high*10 + ls; uint32_t enough = high + ls; return (mult<=ls) ? another10 : enough; #endif }

**From the Godbolt compiler explorer**, this compiles efficiently for x86-64 with and without BSR:

# clang7.0 -O3 for x86-64 SysV, -march=skylake -mno-lzcnt uintcat(unsigned int, unsigned int): mov eax, esi or eax, 1 bsr eax, eax # 31-clz(ls|1) mov ecx, dword ptr [4*rax + catpower_msb] imul edi, ecx # high = mult * ms lea eax, [rdi + rdi] lea eax, [rax + 4*rax] # retval = high * 10 cmp ecx, esi cmova eax, edi # if(mult>ls) retval = high (drop the *10 result) add eax, esi # retval += ls ret

Or *with* lzcnt, (enabled by `-march=haswell`

or later, or some AMD uarches),

uintcat(unsigned int, unsigned int): # clang doesn't try to break the false dependency on EAX; gcc uses xor eax,eax lzcnt eax, esi # C source avoids the |1, saving instructions mov ecx, dword ptr [4*rax + catpower_lz32] imul edi, ecx # same as above from here on lea eax, [rdi + rdi] lea eax, [rax + 4*rax] cmp ecx, esi cmova eax, edi add eax, esi ret

Factoring the last `add`

out of both sides of the ternary is a missed optimization, adding 1 cycle of latency after the `cmov`

. We can multiply by 10 and add just as cheaply as multiplying by 10 alone, on Intel CPUs:

... same start # hand-optimized version that clang should use imul edi, ecx # high = mult * ms lea eax, [rdi + 4*rdi] # high * 5 lea eax, [rsi + rdi*2] # retval = high * 10 + ls add edi, esi # tmp = high + ls cmp ecx, esi cmova eax, edi # if(mult>ls) retval = high+ls ret

So the `high + ls`

latency would run in parallel with the `high*10 + ls`

latency, both needed as inputs for `cmov`

.

GCC branches instead of using CMOV for the last condition. GCC also makes a mess of `31-clz(a|1)`

, calculating `clz`

with `BSR`

and `XOR`

with 31. But then subtracting that from 31. And it has some extra `mov`

instructions. Strangely, gcc seems to do better with that BSR code when `lzcnt`

is available, even though it chooses not to use it.

clang has no trouble optimizing away the `31-clz`

double-inversion and just using BSR directly.

For PowerPC64, clang also makes branchless asm. gcc does something similar, but with a branch like on x86-64.

uintcat: .Lfunc_gep0: addis 2, 12, .TOC.-.Lfunc_gep0@ha addi 2, 2, .TOC.-.Lfunc_gep0@l ori 6, 4, 1 # OR immediate addis 5, 2, catpower_lz32@toc@ha cntlzw 6, 6 # CLZ word addi 5, 5, catpower_lz32@toc@l # static table address rldic 6, 6, 2, 30 # rotate left and clear immediate (shift and zero-extend the CLZ result) lwzx 5, 5, 6 # Load Word Zero eXtend, catpower_lz32[clz] mullw 3, 5, 3 # mul word cmplw 5, 4 # compare mult, ls mulli 6, 3, 10 # mul immediate isel 3, 3, 6, 1 # conditional select high vs. high*10 add 3, 3, 4 # + ls clrldi 3, 3, 32 # zero extend, clearing upper 32 bits blr # return

##### Compressing the table

Using `clz(ls|1) >> 1`

or that +1 should work, because 4 < 10. The table always takes at least 3 entries to gain another digit. I haven't investigated this. (And have already spent longer than I meant to on this. :P)

**Or right-shift a lot more to just get a starting point for the loop.** e.g. `mult = clz(ls) >= 18 ? 100000 : 10;`

, or a 3 or 4-way chain of `if`

.

Or loop on `mult *= 100`

, and after exiting that loop sort out whether you want `old_mult * 10`

or `mult`

. (i.e. check if you went too far). This cuts the iteration count in half for even numbers of digits.

(**Watch out for a possible infinite loop on large ls that would overflow the result.** If

`mult *= 100`

wraps to 0, it will always stay `<= ls`

for `ls = 1000000000`

, for example.)**Bit Manipulation,** For characters, we use ASCII representation, which are in the form of integers AND ( & ): Bitwise AND is a binary operator that operates on two equal-length bit shift the some number of bits, in the given bit pattern, to the left and append 0 In C, the following 6 operators are bitwise operators (work at bit-level) The & (bitwise AND) in C or C++ takes two numbers as operands and does AND on every bit of two numbers. The result of AND is 1 only if both bits are 1.

**Bitwise And Bit-Shift Operators,** All of the bit-manipulation operators require two integer arguments, which we will often The result of combining two bits with a bitwise operator is a single bit, so the easiest way 0) { // then configure the file for writing in append mode } . the left hand operand in binary but will now view the right hand operand in decimal. In words, this operation takes the binary value 0000 0000 1110 0011 1000 1110 0011 1000 (14913080 is the equivalent decimal value - notice that it's just a series of 3 0's and 3 1's repeated a few times) and shifts it 50 places left. But since an Integer is only 32 bits long, shifting it 50 places is meaningless.

**How to concatenate two integer numbers in Turbo C,** You can concatenate by converting them into string using sprintf() in C. Here is the How do I convert from an integer in decimal to an int in binary without using an of operating on big integers (at least 50 decimal digits) using C language? On a 16 bit DSP two Q15 numbers are multiplied to get a Q30 number with two sign bits. On the 320C50 there are two ways to accomplish this. The first is to use the p-scaler immediately after the multiplier, or the postscaler after the accumulator. to shift the result to the left by one . Floating Point Arithmetic. This section is not complete

**Built-in Types,** Two more operations with the same syntactic priority, in and not in , are Numeric literals containing a decimal point or an exponent sign yield floating point numbers. Appending 'j' or 'J' to a numeric literal yields an imaginary number (a Python fully supports mixed arithmetic: when a binary arithmetic operator has Solution for Homework 2 Problem 1 a. What is the minimum number of bits that are required to uniquely represent the characters of English alphabet? (Consider upper case characters alone) The number of unique bit patterns using i bits is 2i. We need at least 26 unique bit patterns. The cleanest approach is to compute log 2