OpenMP Tutorial


Previous | Top | Next

2. Loops and Shared Data

Next we consider an algorithm useful for solving systems of linear equations. The algorithm is known as LU decomposition. The program LU.c generates a random matrix and then performs an LU decomposition on it. If you know what LU decomposition is, that's great. If you don't, the important algorithm is shown below and it is not important to understand its innerworkings in order to understand the example.

    28: 
    29:   /* The core algorithm */
    30:   for (k = 0; k<SIZE-1; k++) {
    31:     /* set col values to column k of A */
    32:     for (n = k; n<SIZE; n++) {
    33:       col[n] = A[n][k];
    34:     }
    35: 
    36:     /* scale values of A by multiplier */
    37:     for (n = k+1; n<SIZE; n++) {
    38:       A[k][n] /= col[k];
    39:     }
    40: 
    41:     /* set row values to row k of A */
    42:     for (n = k+1; n<SIZE; n++) {
    43:       row[n] = A[k][n];
    44:     }
    45: 
    46:     /* Here we update A by subtracting the appropriate values from row
    47:        and column.  Note that these adjustments to A can be done in
    48:        any order */
    49:     for (i = k+1; i<SIZE; i++) {
    50:       for (j = k+1; j<SIZE; j++) {
    51:         A[i][j] = A[i][j] - row[i] * col[j];
    52:       }
    53:     }
    54:   }
    55: 
    56:   /* we're done so stop the timer */
    57:   stop = clock();

Most of our computing time is spent in that last O(n2) inner loop. We can dramatically improve performance by adding one line. Or, if you are really lazy, you can download LU_mp.c.

    47:     /* Here we update A by subtracting the appropriate values from row
    48:        and column.  Note that these adjustments to A can be done in
    49:        any order */
    50: #pragma omp parallel for shared(A, row, col) 
    51:     for (i = k+1; i<SIZE; i++) {
    52:       for (j = k+1; j<SIZE; j++) {
    53:         A[i][j] = A[i][j] - row[i] * col[j];
    54:       }
    55:     }

Now instead of a parallel sections directive we used a parallel for directive. When the program reaches this directive, it will split into multiple threads. Each thread will take an iteration of the for loop. When a thread finishes an iteration it will get the next iteration until there are none left. At that point, the thread will wait until all threads are caught up and then execution will continue normally.

Note that the for loop must be in what is known as "canonical form." Basically, what this means is that the iterated variable must be a signed integer type and you can't do anything funny. The code below is bad.

    x=21;
    for (;--x;) {
      do something;
    }

There are a few other differences to take note of. Before, the two threads were acting completely independently of each other. Go back and look at the code again if you need to. When the two threads split, each thread got a private copy of each variable. That way each thread was able to iterate on the variable i without interfering with the other. One thread modified e and one modified pi, again without interfering with each other. This time the two threads need to modify the same array A[][]. To enable the sharing of variables between threads we declare them shared in the parallel for directive.

Another difference is that we did not define num_threads as we did last time. This time, we let the compiler decide for us. The number of parallel threads to run is determined by the following rules in order of priority.

  1. The number requested in the num_threads clause of the declaration.
  2. The number requested in the most recent call to the omp_set_num_threads() library function. (You will need to #include <omp.h> for this one)
  3. The value of the OMP_NUM_THREADS environment variable.
  4. Determined by implementation.

Thus, by not declaring num_threads() we have enabled the user to set the number of threads with the OMP_NUM_THREADS environment variable at run time, or let the compiler decide. Since our compiler is icl optimizing for a dual core processor, it will select 2.

Ok, now it's time to try running them (selected output shown):

  C:\openmp>icl /O2 /Qopenmp /F50000000 LU.c
  C:\openmp>icl /O2 /Qopenmp /F50000000 LU_mp.c
  LU_mp.c(52) : (col. 1) remark: OpenMP DEFINED LOOP WAS PARALLELIZED.
  C:\openmp>LU
  Completed decomposition in 12.937 seconds
  C:\openmp>LU_mp
  Completed decomposition in 6.563 seconds

This time we have cut our execution time almost exactly in half. Note that because of the large array defined in the main method, we had to enlarge our stack above the default with the /F option.

One more thing before we go on to the next lesson. The compiler will define the macro _OPENMP when it is compiling using OpenMP. So, if you want your code to still compile when OpenMP is not available you can protect the OpenMP directives as follows:

    49: #ifdef _OPENMP
    50: #pragma omp parallel for shared(A, row, col) 
    51: #endif
    52:     for (i = k+1; i<SIZE; i++) {
    53:       for (j = k+1; j<SIZE; j++) {
    54:         A[i][j] = A[i][j] - row[i] * col[j];
    55:       }
    56:     }


Next: Data Reduction