OpenMP Tutorial
Previous | Top | Next
3. Data Reduction
Consider the following piece of code from integrate.c . It integrates a function (in this case x2) by dividing the area under the function into lots of little rectangles, calculating their areas and then adding them all up.
21: /* time starts now */
22: start = clock();
23:
24: sum = 0;
25: for (i = 0; i<NUM_STEPS; i++) {
26: x = 2.0 * (double)i / (double)(NUM_STEPS); /* value of x */
27: sum += x * x / NUM_STEPS;
28: }
30:
31: /* we're done so stop the timer */
32: stop = clock();
The variable sum keeps a running tally of the sums of the areas of all of the rectangles. Now, we would like to parallelize this loop, but it is not clear yet what we should do with sum. It clearly cannot be private to each thread because each thread would end up with it's own version of the sum, neither of which is correct. It cannot be shared either. The reason for this is a little more complicated.
The above for loop is a pretty tight loop. The variable sum is likely kept in a register and never written to memory in well optimized code because those memory operations are quite slow compared to register operations. However, parallel processors or even parallel cores may share memory but they do not share registers. Thus, if sum were to be shared, each cycle through the loop, each thread would have to read sum from memory, add x * x / NUM_STEPS to it and write it back. That is two more memory operations than necessary.
In addition to the execution speed problem above, there is another issue that will cause two threads operating on a shared sum to reach the wrong result. It is necessary for each thread to access sum sequentially or very strange things can happen. As a simple example consider the following:
Before the iteration, sum contains the value 1, thread A wants to add the value 2, and thread B wants to add the value 3. If all goes well, sum should end up with a value of 6.
| sum in shared memory | sum in core A | sum in core B | Operation |
|---|---|---|---|
| 1 | undefined | undefined | Initial configuration |
| 1 | 1 | undefined | Thread A fetches sum from memory |
| 1 | 1 | 1 | Thread B fetches sum from memory |
| 1 | 3 | 1 | Thread A adds 2 to sum |
| 1 | 3 | 4 | Thread B adds 3 to sum |
| 3 | 3 | 4 | Thread A writes sum back to memory |
| 4 | 3 | 4 | Thread B writes sum back to memory |
As you can see, the parallel process give the value 4 when it should give the value 6. How do we solve this problem? With data reduction as shown in the following code snippet from integrate_mp.c.
24: sum = 0;
25: #pragma omp parallel for reduction(+: sum)
26: for (i = 0; i<NUM_STEPS; i++) {
27: x = 2.0 * (double)i / (double)(NUM_STEPS); /* value of x */
28: sum += x * x / NUM_STEPS;
29: }
Note the reduction argument. Data reductions take the form reduction(operator: variable). What the reduction above (reduction(+: sum)) means is that each thread should get its own private copy of sum, but at the end of the loop, the two private copies should be added together. This solves both the speed and concurrent access problems described above.
Try it.
C:\openmp>integrate found result 1.333333 in 2.015 seconds C:\openmp>integrate_mp found result 1.333333 in 1.171 seconds
Next: Putting It All Together
|
|
Copyright 2006 kallipolis.com |