OpenMP Tutorial


Previous | Top | Next

3. Data Reduction

Consider the following piece of code from integrate.c . It integrates a function (in this case x2) by dividing the area under the function into lots of little rectangles, calculating their areas and then adding them all up.

    21:   /* time starts now */
    22:   start = clock();
    23: 
    24:   sum = 0;
    25:   for (i = 0; i<NUM_STEPS; i++) {
    26:     x = 2.0 * (double)i / (double)(NUM_STEPS); /* value of x */
    27:     sum += x * x / NUM_STEPS;
    28:   }
    30: 
    31:   /* we're done so stop the timer */
    32:   stop = clock();

The variable sum keeps a running tally of the sums of the areas of all of the rectangles. Now, we would like to parallelize this loop, but it is not clear yet what we should do with sum. It clearly cannot be private to each thread because each thread would end up with it's own version of the sum, neither of which is correct. It cannot be shared either. The reason for this is a little more complicated.

The above for loop is a pretty tight loop. The variable sum is likely kept in a register and never written to memory in well optimized code because those memory operations are quite slow compared to register operations. However, parallel processors or even parallel cores may share memory but they do not share registers. Thus, if sum were to be shared, each cycle through the loop, each thread would have to read sum from memory, add x * x / NUM_STEPS to it and write it back. That is two more memory operations than necessary.

In addition to the execution speed problem above, there is another issue that will cause two threads operating on a shared sum to reach the wrong result. It is necessary for each thread to access sum sequentially or very strange things can happen. As a simple example consider the following:

Before the iteration, sum contains the value 1, thread A wants to add the value 2, and thread B wants to add the value 3. If all goes well, sum should end up with a value of 6.

sum in shared memory sum in core A sum in core B Operation
1undefinedundefinedInitial configuration
11undefinedThread A fetches sum from memory
111Thread B fetches sum from memory
131Thread A adds 2 to sum
134Thread B adds 3 to sum
334Thread A writes sum back to memory
434Thread B writes sum back to memory

As you can see, the parallel process give the value 4 when it should give the value 6. How do we solve this problem? With data reduction as shown in the following code snippet from integrate_mp.c.

    24:   sum = 0;
    25: #pragma omp parallel for reduction(+: sum) 
    26:   for (i = 0; i<NUM_STEPS; i++) {
    27:     x = 2.0 * (double)i / (double)(NUM_STEPS); /* value of x */
    28:     sum += x * x / NUM_STEPS;
    29:   }

Note the reduction argument. Data reductions take the form reduction(operator: variable). What the reduction above (reduction(+: sum)) means is that each thread should get its own private copy of sum, but at the end of the loop, the two private copies should be added together. This solves both the speed and concurrent access problems described above.

Try it.

   C:\openmp>integrate
   found result 1.333333 in 2.015 seconds
   C:\openmp>integrate_mp
   found result 1.333333 in 1.171 seconds


Next: Putting It All Together