How to get the CPU cycle count in x86_64 from C++?

how to count clock cycles in c
rdtsc in c
rdtsc performance
how to measure cpu cycles
builtin_ read cycle counter
rdtsc arm
c++ get cpu frequency
rdtsc serializing instruction

I saw this post on SO which contains C code to get the latest CPU Cycle count:

CPU Cycle count based profiling in C/C++ Linux x86_64

Is there a way I can use this code in C++ (windows and linux solutions welcome)? Although written in C (and C being a subset of C++) I am not too certain if this code would work in a C++ project and if not, how to translate it?

I am using x86-64

EDIT2:

Found this function but cannot get VS2010 to recognise the assembler. Do I need to include anything? (I believe I have to swap uint64_t to long long for windows....?)

static inline uint64_t get_cycles()
{
  uint64_t t;
  __asm volatile ("rdtsc" : "=A"(t));
  return t;
}

EDIT3:

From above code I get the error:

"error C2400: inline assembler syntax error in 'opcode'; found 'data type'"

Could someone please help?

Get CPU cycle count? - c++ - android, C code to get the latest CPU Cycle count: CPU Cycle count based profiling in existed far longer, because MSVC never supported inline asm for x86-64. This is a high-resolution counter inside the CPU which counts CPU cycles. This counter is called Timer Stamp Counter (TSC) on x86/Intel®64 architectures. It can be read through an assembler instruction, so the overhead is much lower than gettimeofday().

Time Stamp Counter, Problems with RDTSC Instruction in C Inline Assembly . Intel CPUs have a timestamp counter to keep track of every cycle that occurs on the CPU. Starting with  Stack Overflow Public questions and answers Teams Private questions and answers for your team Enterprise Private self-hosted questions and answers for your enterprise

VC++ uses an entirely different syntax for inline assembly -- but only in the 32-bit versions. The 64-bit compiler doesn't support inline assembly at all.

In this case, that's probably just as well -- rdtsc has (at least) two major problem when it comes to timing code sequences. First (like most instructions) it can be executed out of order, so if you're trying to time a short sequence of code, the rdtsc before and after that code might both be executed before it, or both after it, or what have you (I am fairly sure the two will always execute in order with respect to each other though, so at least the difference will never be negative).

Second, on a multi-core (or multiprocessor) system, one rdtsc might execute on one core/processor and the other on a different core/processor. In such a case, a negative result is entirely possible.

Generally speaking, if you want a precise timer under Windows, you're going to be better off using QueryPerformanceCounter.

If you really insist on using rdtsc, I believe you'll have to do it in a separate module written entirely in assembly language (or use a compiler intrinsic), then linked with your C or C++. I've never written that code for 64-bit mode, but in 32-bit mode it looks something like this:

   xor eax, eax
   cpuid
   xor eax, eax
   cpuid
   xor eax, eax
   cpuid
   rdtsc
   ; save eax, edx

   ; code you're going to time goes here

   xor eax, eax
   cpuid
   rdtsc

I know this looks strange, but it's actually right. You execute CPUID because it's a serializing instruction (can't be executed out of order) and is available in user mode. You execute it three times before you start timing because Intel documents the fact that the first execution can/will run at a different speed than the second (and what they recommend is three, so three it is).

Then you execute your code under test, another cpuid to force serialization, and the final rdtsc to get the time after the code finished.

Along with that, you want to use whatever means your OS supplies to force this all to run on one process/core. In most cases, you also want to force the code alignment -- changes in alignment can lead to fairly substantial differences in execution spee.

Finally you want to execute it a number of times -- and it's always possible it'll get interrupted in the middle of things (e.g., a task switch), so you need to be prepared for the possibility of an execution taking quite a bit longer than the rest -- e.g., 5 runs that take ~40-43 clock cycles apiece, and a sixth that takes 10000+ clock cycles. Clearly, in the latter case, you just throw out the outlier -- it's not from your code.

Summary: managing to execute the rdtsc instruction itself is (almost) the least of your worries. There's quite a bit more you need to do before you can get results from rdtsc that will actually mean anything.

[PDF] How to Benchmark Code Execution Times on Intel IA-32 and IA-64 , clock increments at a constant rate or is synchronized across all logical. // cpus in serializing. So if you're trying to count at cycles granularity, your extern "C" uint64_t __rdtsc(); NOTE: only i386 and x86_64 have been well tested. // PPC​  Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 4 On-line CPU(s) list: 0-3 Thread(s) per core: 2 Core(s) per socket: 2 Socket(s): 1 NUMA node(s): 1 Vendor ID: GenuineIntel CPU family: 6 Model: 69 Model name: Intel(R) Core(TM) i5-4210U CPU @ 1.70GHz Stepping: 1 CPU MHz: 1303.687 CPU max MHz: 2700.0000 CPU min MHz: 800.0000 BogoMIPS: 4788.92 Virtualization

For Windows, Visual Studio provides a convenient "compiler intrinsic" (i.e. a special function, which the compiler understands) that executes the RDTSC instruction for you and gives you back the result:

unsigned __int64 __rdtsc(void);

benchmark/cycleclock.h at master · google/benchmark · GitHub, I'm looking for some table or something similar that could help me to calculate efficiency of assembly code. As I know bit shifting takes 1 CPU clock, but I really  I saw this post on SO which contains C code to get the latest CPU Cycle count: CPU Cycle count based profiling in C/C++ Linux x86_64. Is there a way I can use this code in C++ (windows and linux solutions welcome)? Although written in C (and C being a subset of C++) I am not too certain if this code would work in a C++ project and if not, how

Latency of CPU instructions on x86 and x64 processors, David O'Hallaron (droh@cs.cmu.edu) about the alpha, i386 and X86-64 code * Ask With a * 450MhZ clock the counter can time things for about 9 * seconds. int (*counter)(void)= (void *)counterRoutine; void start_counter() { /* Get cycle  #include<stdio.h> #include<time.h> #I have taken this code from C library function - clock() int main() { clock_t start_t, end_t, total_t; int i; start_t = clock

clock.c, The ARM processor has three registers assigned to a particular task or special Register r15 is the program counter (pc) and contains the address of the next Most operations have 1-cycle latency and are supported by both IEUs, but a few If we examine the GCC output for x86_64 and x86_32 platforms, we can see a​  We can get processor status, like cycle, instruction executed, branch taken, cache miss/hit, memory read/write, etc from these PMU event counters. Performance counters support has been added in Linux Kernel since 3.6. Kernel has a utility named perf to view CPU PMU event statistics. Perf supports raw event id or named event.

General-Purpose Register, The processor time stamp records the number of clock cycles since the last reset. A 64-bit unsigned integer representing a tick count. C++. Copy. // rdtsc.cpp // processor: x86, x64 #include <stdio.h> #include <intrin.h>  rdtsc counts reference cycles, not CPU core clock cycles It counts at a fixed frequency regardless of turbo / power-saving, so if you want uops-per-clock analysis, use performance counters. rdtsc is exactly correlated with wall-clock time (except for system clock adjustments, so it's basically steady_clock ).

Comments
  • "C++ being a subset of C" - did you mean that the other way around?
  • Visual Studio does not support assembly on x86-64.
  • @MarkRansom I presume you mean MSVC? I think I have the ICC compiler installed too and just to be sure I am just installing MinGW
  • To get uint64_t you should #include <stdint.h> (actually <cstdint> but your compiler is probably too old to have that one.)
  • @user997112, yes I meant MSVC. I completely forgot that you can substitute compilers in it since I've never tried it.
  • That's a nice way to package it.
  • FWIW, gcc 4.5 and newer include __rdtsc() -- #include <x86intrin.h> get it. Header also includes many other intel intrinsics found in Microsoft's <intrin.h>, and it gets included by default these days when you include most any of the SIMD headers -- emmintrin.h, xmmintrin.h, etc.
  • std::uint64_t x; asm volatile ("rdtsc" : "=A"(x)); is another way to read EAX and EDX together.
  • @Orient: only in 32-bit mode. In 64-bit mode, "=A" will pick either RAX or RDX.
  • Any reason you prefer inline asm for GNU compilers? <x86intrin.h> defines __rdtsc() for compilers other than MSVC, so you can just #ifdef _MSC_VER. I added an answer on this question, since it looks like a good place for a canonical about rdtsc intrinsics, and gotchas on how to use rdtsc.
  • The tsc doesn't necessarily tick at the "sticker frequency", but rather at the tsc frequency. On some machines these are the same, but on many recent machines (like Skylake client and derived uarchs) they are often not. For example, my i7-6700HQ sticker frequency is 2600 MHz, but the tsc frequency is 2592 MHz. They are probably not the same in cases the different clocks they are based on can't be made to line up to exactly the same frequency when scaling the frequency by an integer. Many tools don't account for this difference leading to small errors.
  • @BeeOnRope: Thanks, I hadn't realized that. That probably explains some not-quite-4GHz results I've seen from RDTSC stuff on my machine, like 4008 MHz vs. the sticker frequency of 4.0 GHz.