Prefetching Examples?

cache prefetching
binding prefetching
prefetch linux
__builtin_prefetch
c++ prefetch
gcc prefetch
compiler controlled prefetching
hardware prefetcher

Can anyone give an example or a link to an example which uses __builtin_prefetch in GCC (or just the asm instruction prefetcht0 in general) to gain a substantial performance advantage? In particular, I'd like the example to meet the following criteria:

  1. It is a simple, small, self-contained example.
  2. Removing the __builtin_prefetch instruction results in performance degradation.
  3. Replacing the __builtin_prefetch instruction with the corresponding memory access results in performance degradation.

That is, I want the shortest example showing __builtin_prefetch performing an optimization that couldn't be managed without it.

Here's an actual piece of code that I've pulled out of a larger project. (Sorry, it's the shortest one I can find that had a noticable speedup from prefetching.) This code performs a very large data transpose.

This example uses the SSE prefetch instructions, which may be the same as the one that GCC emits.

To run this example, you will need to compile this for x64 and have more than 4GB of memory. You can run it with a smaller datasize, but it will be too fast to time.

#include <iostream>
using std::cout;
using std::endl;

#include <emmintrin.h>
#include <malloc.h>
#include <time.h>
#include <string.h>

#define ENABLE_PREFETCH


#define f_vector    __m128d
#define i_ptr       size_t
inline void swap_block(f_vector *A,f_vector *B,i_ptr L){
    //  To be super-optimized later.

    f_vector *stop = A + L;

    do{
        f_vector tmpA = *A;
        f_vector tmpB = *B;
        *A++ = tmpB;
        *B++ = tmpA;
    }while (A < stop);
}
void transpose_even(f_vector *T,i_ptr block,i_ptr x){
    //  Transposes T.
    //  T contains x columns and x rows.
    //  Each unit is of size (block * sizeof(f_vector)) bytes.

    //Conditions:
    //  - 0 < block
    //  - 1 < x

    i_ptr row_size = block * x;
    i_ptr iter_size = row_size + block;

    //  End of entire matrix.
    f_vector *stop_T = T + row_size * x;
    f_vector *end = stop_T - row_size;

    //  Iterate each row.
    f_vector *y_iter = T;
    do{
        //  Iterate each column.
        f_vector *ptr_x = y_iter + block;
        f_vector *ptr_y = y_iter + row_size;

        do{

#ifdef ENABLE_PREFETCH
            _mm_prefetch((char*)(ptr_y + row_size),_MM_HINT_T0);
#endif

            swap_block(ptr_x,ptr_y,block);

            ptr_x += block;
            ptr_y += row_size;
        }while (ptr_y < stop_T);

        y_iter += iter_size;
    }while (y_iter < end);
}
int main(){

    i_ptr dimension = 4096;
    i_ptr block = 16;

    i_ptr words = block * dimension * dimension;
    i_ptr bytes = words * sizeof(f_vector);

    cout << "bytes = " << bytes << endl;
//    system("pause");

    f_vector *T = (f_vector*)_mm_malloc(bytes,16);
    if (T == NULL){
        cout << "Memory Allocation Failure" << endl;
        system("pause");
        exit(1);
    }
    memset(T,0,bytes);

    //  Perform in-place data transpose
    cout << "Starting Data Transpose...   ";
    clock_t start = clock();
    transpose_even(T,block,dimension);
    clock_t end = clock();

    cout << "Done" << endl;
    cout << "Time: " << (double)(end - start) / CLOCKS_PER_SEC << " seconds" << endl;

    _mm_free(T);
    system("pause");
}

When I run it with ENABLE_PREFETCH enabled, this is the output:

bytes = 4294967296
Starting Data Transpose...   Done
Time: 0.725 seconds
Press any key to continue . . .

When I run it with ENABLE_PREFETCH disabled, this is the output:

bytes = 4294967296
Starting Data Transpose...   Done
Time: 0.822 seconds
Press any key to continue . . .

So there's a 13% speedup from prefetching.

EDIT:

Here's some more results:

Operating System: Windows 7 Professional/Ultimate
Compiler: Visual Studio 2010 SP1
Compile Mode: x64 Release

Intel Core i7 860 @ 2.8 GHz, 8 GB DDR3 @ 1333 MHz
Prefetch   : 0.868
No Prefetch: 0.960

Intel Core i7 920 @ 3.5 GHz, 12 GB DDR3 @ 1333 MHz
Prefetch   : 0.725
No Prefetch: 0.822

Intel Core i7 2600K @ 4.6 GHz, 16 GB DDR3 @ 1333 MHz
Prefetch   : 0.718
No Prefetch: 0.796

2 x Intel Xeon X5482 @ 3.2 GHz, 64 GB DDR2 @ 800 MHz
Prefetch   : 2.273
No Prefetch: 2.666

[PDF] Prefetching Techniques, 10. Example: Vector Product. ▫ No prefetching for (i = 0; i < N; i++) { sum += a[i]*b​[i];. } ▫ Assume each cache block holds 4 elements. → 2 misses/4 iterations. Prerendering is the least supported type of prefetching across the major browsers. Supported only by IE, Edge, Chrome, Opera, and Chrome for Android. Prefetching example. Popular search engines such as Bing and Google are known for using prefetch to deliver results to users when a search is made. When the user enters their search query, the search engine goes ahead and delivers the results to the user.

Binary search is a simple example that could benefit from explicit prefetching. The access pattern in a binary search looks pretty much random to the hardware prefetcher, so there is little chance that it will accurately predict what to fetch.

In this example, I prefetch the two possible 'middle' locations of the next loop iteration in the current iteration. One of the prefetches will probably never be used, but the other will (unless this is the final iteration).

 #include <time.h>
 #include <stdio.h>
 #include <stdlib.h>

 int binarySearch(int *array, int number_of_elements, int key) {
         int low = 0, high = number_of_elements-1, mid;
         while(low <= high) {
                 mid = (low + high)/2;
            #ifdef DO_PREFETCH
            // low path
            __builtin_prefetch (&array[(mid + 1 + high)/2], 0, 1);
            // high path
            __builtin_prefetch (&array[(low + mid - 1)/2], 0, 1);
            #endif

                 if(array[mid] < key)
                         low = mid + 1; 
                 else if(array[mid] == key)
                         return mid;
                 else if(array[mid] > key)
                         high = mid-1;
         }
         return -1;
 }
 int main() {
     int SIZE = 1024*1024*512;
     int *array =  malloc(SIZE*sizeof(int));
     for (int i=0;i<SIZE;i++){
       array[i] = i;
     }
     int NUM_LOOKUPS = 1024*1024*8;
     srand(time(NULL));
     int *lookups = malloc(NUM_LOOKUPS * sizeof(int));
     for (int i=0;i<NUM_LOOKUPS;i++){
       lookups[i] = rand() % SIZE;
     }
     for (int i=0;i<NUM_LOOKUPS;i++){
       int result = binarySearch(array, SIZE, lookups[i]);
     }
     free(array);
     free(lookups);
 }

When I compile and run this example with DO_PREFETCH enabled, I see a 20% reduction in runtime:

 $ gcc c-binarysearch.c -DDO_PREFETCH -o with-prefetch -std=c11 -O3
 $ gcc c-binarysearch.c -o no-prefetch -std=c11 -O3

 $ perf stat -e L1-dcache-load-misses,L1-dcache-loads ./with-prefetch 

  Performance counter stats for './with-prefetch':

    356,675,702      L1-dcache-load-misses     #   41.39% of all L1-dcache hits  
   861,807,382      L1-dcache-loads                                             

   8.787467487 seconds time elapsed

 $ perf stat -e L1-dcache-load-misses,L1-dcache-loads ./no-prefetch 

 Performance counter stats for './no-prefetch':

   382,423,177      L1-dcache-load-misses     #   97.36% of all L1-dcache hits  
   392,799,791      L1-dcache-loads                                             

  11.376439030 seconds time elapsed

Notice that we are doing twice as many L1 cache loads in the prefetch version. We're actually doing a lot more work but the memory access pattern is more friendly to the pipeline. This also shows the tradeoff. While this block of code runs faster in isolation, we have loaded a lot of junk into the caches and this may put more pressure on other parts of the application.

Cache prefetching, Prefetching in computer science is a technique for speeding up fetch operations by beginning a for prefetching links; The Prefetcher technology in modern releases of Microsoft Windows; prefetch instructions, for example provided by. Binary search is a simple example that could benefit from explicit prefetching. The access pattern in a binary search looks pretty much random to the hardware prefetcher, so there is little chance that it will accurately predict what to fetch.

I learned a lot from the excellent answers provided by @JamesScriven and @Mystical. However, their examples give only a modest boost - the objective of this answer is to present a (I must confess somewhat artificial) example, where prefetching has a bigger impact (about factor 4 on my machine).

There are three possible bottle-necks for the modern architectures: CPU-speed, memory-band-width and memory latency. Prefetching is all about reducing the latency of the memory-accesses.

In a perfect scenario, where latency corresponds to X calculation-steps, we would have a oracle, which would tell us which memory we would access in X calculation-steps, the prefetching of this data would be launched and it would arrive just in-time X calculation-steps later.

For a lot of algorithms we are (almost) in this perfect world. For a simple for-loop it is easy to predict which data will be needed X steps later. Out-of-order execution and other hardware tricks are doing a very good job here, concealing the latency almost completely.

That is the reason, why there is such a modest improvement for @Mystical's example: The prefetcher is already pretty good - there is just not much room for improvement. The task is also memory-bound, so probably not much band-width is left - it could be becoming the limiting factor. I could see at best around 8% improvement on my machine.

The crucial insight from the @JamesScriven example: neither we nor the CPU knows the next access-address before the the current data is fetched from memory - this dependency is pretty important, otherwise out-of-order execution would lead to a look-forward and the hardware would be able to prefetch the data. However, because we can speculate about only one step there is not that much potential. I was not able to get more than 40% on my machine.

So let's rig the competition and prepare the data in such a way that we know which address is accessed in X steps, but make it impossible for hardware to find it out due to dependencies on not yet accessed data (see the whole program at the end of the answer):

//making random accesses to memory:
unsigned int next(unsigned int current){
   return (current*10001+328)%SIZE;
}

//the actual work is happening here
void operator()(){

    //set up the oracle - let see it in the future oracle_offset steps
    unsigned int prefetch_index=0;
    for(int i=0;i<oracle_offset;i++)
        prefetch_index=next(prefetch_index);

    unsigned int index=0;
    for(int i=0;i<STEP_CNT;i++){
        //use oracle and prefetch memory block used in a future iteration
        if(prefetch){
            __builtin_prefetch(mem.data()+prefetch_index,0,1);    
        }

        //actual work, the less the better
        result+=mem[index];

        //prepare next iteration
        prefetch_index=next(prefetch_index);  #update oracle
        index=next(mem[index]);               #dependency on `mem[index]` is VERY important to prevent hardware from predicting future
    }
}

Some remarks:

  1. data is prepared in such a way, that the oracle is alway right.
  2. maybe surprisingly, the less CPU-bound task the bigger the speed-up: we are able to hide the latency almost completely, thus the speed-up is CPU-time+original-latency-time/CPU-time.

Compiling and executing leads:

>>> g++ -std=c++11 prefetch_demo.cpp -O3 -o prefetch_demo
>>> ./prefetch_demo
#preloops   time no prefetch    time prefetch   factor
...
7   1.0711102260000001  0.230566831 4.6455521002498408
8   1.0511602149999999  0.22651144600000001 4.6406494398521474
9   1.049024333 0.22841439299999999 4.5926367389641687
....

to a speed-up between 4 and 5.


Listing of prefetch_demp.cpp:

//prefetch_demo.cpp

#include <vector>
#include <iostream>
#include <iomanip>
#include <chrono>

const int SIZE=1024*1024*1;
const int STEP_CNT=1024*1024*10;

unsigned int next(unsigned int current){
   return (current*10001+328)%SIZE;
}


template<bool prefetch>
struct Worker{
   std::vector<int> mem;

   double result;
   int oracle_offset;

   void operator()(){
        unsigned int prefetch_index=0;
        for(int i=0;i<oracle_offset;i++)
            prefetch_index=next(prefetch_index);

        unsigned int index=0;
        for(int i=0;i<STEP_CNT;i++){
            //prefetch memory block used in a future iteration
            if(prefetch){
                __builtin_prefetch(mem.data()+prefetch_index,0,1);    
            }
            //actual work:
            result+=mem[index];

            //prepare next iteration
            prefetch_index=next(prefetch_index);
            index=next(mem[index]);
        }
   }

   Worker(std::vector<int> &mem_):
       mem(mem_), result(0.0), oracle_offset(0)
   {}
};

template <typename Worker>
    double timeit(Worker &worker){
    auto begin = std::chrono::high_resolution_clock::now();
    worker();
    auto end = std::chrono::high_resolution_clock::now();
    return std::chrono::duration_cast<std::chrono::nanoseconds>(end-begin).count()/1e9;
}


 int main() {
     //set up the data in special way!
     std::vector<int> keys(SIZE);
     for (int i=0;i<SIZE;i++){
       keys[i] = i;
     }

     Worker<false> without_prefetch(keys);
     Worker<true> with_prefetch(keys);

     std::cout<<"#preloops\ttime no prefetch\ttime prefetch\tfactor\n";
     std::cout<<std::setprecision(17);

     for(int i=0;i<20;i++){
         //let oracle see i steps in the future:
         without_prefetch.oracle_offset=i;
         with_prefetch.oracle_offset=i;

         //calculate:
         double time_with_prefetch=timeit(with_prefetch);
         double time_no_prefetch=timeit(without_prefetch);

         std::cout<<i<<"\t"
                  <<time_no_prefetch<<"\t"
                  <<time_with_prefetch<<"\t"
                  <<(time_no_prefetch/time_with_prefetch)<<"\n";
     }

 }

Prefetching, Prefetching allows a browser to silently fetch the necessary look for prefetch in either an HTML <link> or the HTTP header Link , for example:. Penalty index comparison of different version of wrong path prefetching. Figure 6 Examples of prefetch sequences terminated by unconditional transitions. Figure 7 Elapsed time and number of disk fetches. Figure 8 Normalized number of instruction misses in the operating system for different sequential prefetching schemes.

From the documentation:

      for (i = 0; i < n; i++)
        {
          a[i] = a[i] + b[i];
          __builtin_prefetch (&a[i+j], 1, 1);
          __builtin_prefetch (&b[i+j], 0, 1);
          /* ... */
        }

What Is Prefetching and Why Use It, examples and empirical data. Furthermore, we also observe that software prefetching can interfere with the training of the hardware prefetcher, resulting in​  Unlike DNS prefetching, we’re actually requesting and downloading that asset and storing it in the cache. However, this is dependent on a number of conditions, as prefetching can be ignored by the browser. For example, a client might abandon the request of a large font file on a slow network.

Pre-fetching data can be optimized to the Cache Line size, which for most modern 64-bit processors is 64 bytes to for example pre-load a uint32_t[16] with one instruction.

For example on ArmV8 I discovered through experimentation casting the memory pointer to a uint32_t 4x4 matrix vector (which is 64 bytes in size) halved the required instructions required as before I had to increment by 8 as it was only loading half the data, even though my understanding was that it fetches a full cache line.

Pre-fetching an uint32_t[32] original code example...

int addrindex = &B[0];
    __builtin_prefetch(&V[addrindex]);
    __builtin_prefetch(&V[addrindex + 8]);
    __builtin_prefetch(&V[addrindex + 16]);
    __builtin_prefetch(&V[addrindex + 24]);

After...

int addrindex = &B[0];
__builtin_prefetch((uint32x4x4_t *) &V[addrindex]);
__builtin_prefetch((uint32x4x4_t *) &V[addrindex + 16]);

For some reason int datatype for the address index/offset gave better performance. Tested with GCC 8 on Cortex-a53. Using an equivalent 64 byte vector on other architectures might give the same performance improvement if you find it is not pre-fetching all the data like in my case. In my application with a one million iteration loop, it improved performance by 5% just by doing this. There were further requirements for the improvement.

the 128 megabyte "V" memory allocation had to be aligned to 64 bytes.

uint32_t *V __attribute__((__aligned__(64))) = (uint32_t *)(((uintptr_t)(__builtin_assume_aligned((unsigned char*)aligned_alloc(64,size), 64)) + 63) & ~ (uintptr_t)(63));

Also, I had to use C operators instead of Neon Intrinsics, since they require regular datatype pointers (in my case it was uint32_t *) otherwise the new built in prefetch method had a performance regression.

My real world example can be found at https://github.com/rollmeister/veriumMiner/blob/main/algo/scrypt.c in the scrypt_core() and its internal function which are all easy to read. The hard work is done by GCC8. Overall improvement to performance was 25%.

[PDF] When Prefetching Works, When It Doesn&rsquo, Prefetching aims to reduce access speed and is used at several levels of a system, for example, even with DDR, main memories or boot operations of operating  Poplar prefetching callbacks. The example program prefetch.cpp defines a poplar program that runs multiple times. The program consists on a repeated sequence: Copy from input stream to tensor. Modify tensor value Copy tensor to output stream.

Prefetching - Ryte Wiki, For example if each cache line is 16 bytes, 2 or more consecutive cache lines may be prefetched for each prefetch initialized. Second, in the scheme described​  Example: Vector Product (Cont.) „Previous assumption: prefetching 1 iteration ahead is sufficient to hide the memory latency. „When loops contain small computational bodies, it may be necessary to initiate prefetches δiterations before the data is referenced.

Prefetching Schemes -- Final Project, Resource prefetching is another performance enhancing technique. We can use <link rel="dns-prefetch" href="//example.com">. Then, when  Example app with prefetching pages. This example features an app with four simple pages: Home/Features: Default API; About: Imperative API; Contact: Disable API; Deploy your own. Deploy the example using ZEIT Now: How to use Using create-next-app. Execute create-next-app with npm or Yarn to bootstrap the example:

Prefetching, preloading, prebrowsing, (assume 2 elements per cache line). 15-745: Prefetching Arrays. Carnegie Mellon. Prefetch Predicate. Example: 15-745: Prefetching Arrays. 11. Locality Type. Fig. 1. Summary of the software and hardware prefetching and their interactions. Benchmarks are sorted in decreasing order of the speedup of the best software+hardware prefetching (i.e., the best among SW, SW+GHB,andSW+STR) over the best performing hardware prefetching (i.e., the best among Base, GHB, and STR).

Comments
  • Interesting. Unfortunately on the two machines I tested (Macbook Pro with "Core 2 Duo" and a Linux machine with a "Quad-Core AMD Opteron Processor 2376") I didn't get a significant difference between the two versions. I suspect it has to do with cache size -- it looks you have a better machine than those two. What do you think?
  • My machine is a Core i7 920 @ 3.5 GHz. 8MB L3 cache. This 10% speedup is more or less consistent on 3 other machines that I've tested: Core i7 2600K @ 4.6 GHz, and 2 x Xeon X5482 @ 3.2 GHz. But I'll admit that I've never tested it on a laptop or an AMD machine.
  • I just edited my answer with the benchmarks on all 4 machines that I tested. They're all Intel desktops/workstations. So that could be the reason. I didn't test if your 3rd point holds. It could be that substituting it with a memory access could produce the same result.
  • The third point is tricky to test due to the out-of-order execution. In order for the third point to hold, you will need to have some 100 - 200 instructions between the load to when it's actually used. A stalled load will block the pipeline after the re-order buffer is filled up. But a prefetch won't. The only time you'll see the penalty of the stalled load is when you actually have enough instructions to overflow the re-order buffer... If you just replace my prefetch with a normal load, the compiler will probably optimize out the load as dead code... (which satisfies your last point, lol)
  • Yes, you'd have to add some sort of "dummy" thing to take in the memory access and then print its value so it wouldn't be optimized away -- that's what i have been doing. Can you give me a link to information regarding what you are discussing about stalled loads and re-order buffers? I think that might do me a world of good.
  • I expect that the CPU's hardware prefetcher, would have prefetched this anyway. This is usually the cause of people discovering that "prefetch does nothing" - it really requires that the access pattern is something that a reasonably simple piece of logic, analyzing access patterns cannot predict.
  • @Crowley9 I disagree that this is a bad answer. The OP wanted a simple example (probably to know how to use it), this answers on that.
  • Older CPUs with less smart hardware prefetching benefited from software prefetching in more cases. I think even P4 would have been smart enough to HW prefetch sequential accesses to contiguous data, though. This is a terrible example because it's a case where the extra prefetch instructions just slow things down. @a3mlord: The OP wanted a performance win, not just the syntax.
  • This example is too short to answer the question.