Previous | Contents | Index |
For your convenience, the following examples have been adapted from the
ANSI X3H5 Parallel Extensions for FORTRAN document.
C.1 DO: A Simple Difference Operator
This example shows a simple parallel loop where the amount of work in each iteration is different. We used dynamic scheduling to get good load balancing. The end do has a nowait because there is an implicit barrier at the end parallel . Alternately, using the option -optimize=1 would have also eliminated the barrier .
subroutine do_1 (a,b,n) real a(n,n), b(n,n) !$omp parallel !$omp& shared(a,b,n) !$omp& private(i,j) !$omp do schedule(dynamic,1) do i = 2, n do j = 1, i b(j,i) = ( a(j,i) + a(j,i-1) ) / 2 enddo enddo !$omp end do nowait !$omp end parallel end |
Shows two parallel regions fused to reduce fork/join overhead. The first end do has a nowait because all the data used in the second do is different than all the data used in the first do .
subroutine do_2 (a,b,c,d,m,n) real a(n,n), b(n,n), c(m,m), d(m,m) !$omp parallel !$omp& shared(a,b,c,d,m,n) !$omp& private(i,j) !$omp do schedule(dynamic,1) do i = 2, n do j = 1, i b(j,i) = ( a(j,i) + a(j,i-1) ) / 2 enddo enddo !$omp end do nowait !$omp do schedule(dynamic,1) do i = 2, m do j = 1, i d(j,i) = ( c(j,i) + c(j,i-1) ) / 2 enddo enddo !$omp end do nowait !$omp end parallel end |
Routines do_3a and do_3b perform numerically equivalent computations, but because the parallel directive in routine do_3b is outside the do j loop, routine do_3b probably forms teams less often, and thus reduces overhead.
subroutine do_3a (a,b,m,n) real a(n,m), b(n,m) do j = 2, m !$omp parallel !$omp& shared(a,b,n,j) !$omp& private(i) !$omp do do i = 1, n a(i,j) = b(i,j) / a(i,j-1) enddo !$omp end do nowait !$omp end parallel enddo end subroutine do_3b (a,b,m,n) real a(n,m), b(n,m) !$omp parallel !$omp& shared(a,b,m,n) !$omp& private(i,j) do j = 2, m !$omp do do i = 1, n a(i,j) = b(i,j) / a(i,j-1) enddo !$omp end do nowait enddo !$omp end parallel end |
This example is identical to Section C.2 but uses sections instead of do . Here the speedup is limited to 2 because there are only two units of work, whereas in Section C.2 there are n-1 + m-1 units of work.
subroutine sections_1 (a,b,c,d,m,n) real a(n,n), b(n,n), c(m,m), d(m,m) !$omp parallel !$omp& shared(a,b,c,d,m,n) !$omp& private(i,j) !$omp sections !$omp section do i = 2, n do j = 1, i b(j,i) = ( a(j,i) + a(j,i-1) ) / 2 enddo enddo !$omp section do i = 2, m do j = 1, i d(j,i) = ( c(j,i) + c(j,i-1) ) / 2 enddo enddo !$omp end sections nowait !$omp end parallel end |
This example demonstrates how to use a single construct to update an element of the shared array a . The optional end do nowait after the first do is omitted because we need to wait at the end of the do before proceeding into the single .
subroutine sp_1a (a,b,n) real a(n), b(n) !$omp parallel !$omp& shared(a,b,n) !$omp& private(i) !$omp do do i = 1, n a(i) = 1.0 / a(i) enddo !$omp single a(1) = min( a(1), 1.0 ) !$omp end single !$omp do do i = 1, n b(i) = b(i) / a(i) enddo !$omp end do nowait !$omp end parallel end |
This example is identical to Section C.5 but uses different directives.
subroutine sections_sp_1 (a,b,n) real a(n), b(n) !$omp parallel !$omp& shared(a,b,n) !$omp& private(i) !$omp do do i = 1, n a(i) = 1.0 / a(i) enddo !$omp sections a(1) = min( a(1), 1.0 ) !$omp end sections !$omp do do i = 1, n b(i) = b(i) / a(i) enddo !$omp end do nowait !$omp end parallel end |
This example is identical to Section C.5 but uses different directives.
subroutine do_sp_1 (a,b,n) real a(n), b(n) !$omp parallel !$omp& shared(a,b,n) !$omp& private(i) !$omp do do i = 1, n a(i) = 1.0 / a(i) enddo !$omp end do !$omp do do i = 1, 1 a(1) = min( a(1), 1.0 ) enddo !$omp end do !$omp do do i = 1, n b(i) = b(i) / a(i) enddo !$omp end do nowait !$omp end parallel end |
This example is identical to Section C.1 but uses different directives.
subroutine paralleldo_1 (a,b,n) real a(n,n), b(n,n) !$omp parallel do !$omp& shared(a,b,n) !$omp& private(i,j) !$omp& schedule(dynamic,1) do i = 2, n do j = 1, i b(j,i) = ( a(j,i) + a(j,i-1) ) / 2 enddo enddo end |
This example is identical to Section C.4 but uses different directives. The maximum performance improvement is limited to the number of sections run in parallel, so this example has a maximum parallelism of 2.
subroutine sections_2 (a,b,c,d,m,n) real a(n,n), b(n,n), c(m,m), d(m,m) !$omp parallel sections !$omp& shared(a,b,c,d,m,n) !$omp& private(i,j) !$omp section do i = 2, n do j = 1, i b(j,i) = ( a(j,i) + a(j,i-1) ) / 2 enddo enddo !$omp section do i = 2, m do j = 1, i d(j,i) = ( c(j,i) + c(j,i-1) ) / 2 enddo enddo !$omp end parallel sections end |
This example demonstrates how to perform a reduction using partial sums while avoiding synchronization in the loop body.
subroutine reduction_1 (a,m,n,sum) real a(m,n) !$omp parallel !$omp& shared(a,m,n,sum) !$omp& private(i,j,local_sum) local_sum = 0.0 !$omp do do i = 1, n do j = 1, m local_sum = local_sum + a(j,i) enddo enddo !$omp end do nowait !$omp critical sum = sum + local_sum !$omp end critical !$omp end parallel end |
The above reduction could also use the REDUCTION () clause as follows:
subroutine reduction_2 (a,m,n,sum) real a(m,n) !$omp parallel do !$omp& shared(a,m,n) !$omp& private(i,j) !$omp& reduction(+:sum) do i = 1, n do j = 1, m local_sum = local_sum + a(j,i) enddo enddo end |
This example demonstrates the use of taskcommon privatizable common blocks.
subroutine tc_1 (n) common /shared/ a real a(100,100) common /private/ work real work(10000) !$omp threadprivate (/private/) ! this privatizes the ! common /private/ !$omp parallel !$omp& shared(n) !$omp& private(i) !$omp do do i = 1, n call construct_data() ! fills in array work() call use_data() ! uses array work() enddo !$omp end do nowait !$omp end parallel end |
In this example, the value 2 is printed because the master thread's copy of a variable in a threadprivate privatizable common block is accessed within a master section or in serial code sections. If a single was used in place of the master section, some single thread, but not necessarily the master thread, would set j to 2 and the printed result would be indeterminate.
subroutine tc_2 common /blk/ j !$omp threadprivate (/blk/) j = 1 !$omp parallel !$omp master j = 2 !$omp end master !$omp end parallel print *, j end |
This example demonstrates the use of instance parallel privatizable common blocks.
subroutine ip_1 (n) common /shared/ a real a(100,100) common /private/ work real work(10000) !$omp instance parallel (/private/) !$omp parallel !$omp& shared(n) !$omp& private(i) !$omp new (/private/) ! this privatizes the !$omp do ! common /private/ do i = 1, n call construct_data()! fills in array work() call use_data() ! uses array work() enddo !$omp end do nowait !$omp end parallel end |
This example demonstrates the use of an instance parallel common block first as a shared common block and then as a private common block. This would not be possible with taskcommon blocks because taskcommon blocks are always private.
subroutine ip_2 (n,m) common /shared/ a,b real a(100,100), b(100,100) common /private/ work real work(10000) !$omp instance parallel (/private/) !$omp parallel ! common /private/ is !$omp& shared(a,b,n) ! shared here since !$omp& private(i) ! no new appears !$omp do do i = 1, n work(i) = b(i,i) / 4.0 enddo !$omp end do nowait !$omp end parallel do i = 1, n do j = 1, m a(j,i) = work(i) * ( a(j-1,i) + a(j+1,i) x + a(j,i-1) + a(j,i+1) ) enddo enddo !$omp parallel !$omp& shared(m) !$omp& private(i) !$omp new (/private/) ! this privatizes the !$omp do ! common /private/ do i = 1, m call construct_data() ! fills in array work() call use_data() ! uses array work() enddo !$omp end do nowait !$omp end parallel end |
This example demonstrates two coding styles for reductions, one using the external routines omp_get_max_threads() and omp_get_thread_num() and the other using only OpenMP directives.
subroutine reduction_3a (n) real gx( 0:7 ) ! assume 8 processors do i = 0, omp_get_max_threads()-1 gx(i) = 0 enddo !$omp parallel !$omp& shared(a) !$omp& private(i,lx) lx = 0 !$omp do do i = 1, n lx = lx + a(i) enddo !$omp end do nowait gx( omp_get_thread_num() ) = lx !$omp end parallel x = 0 do i = 0, omp_get_max_threads()-1 x = x + gx(i) enddo print *, x end |
As is shown below, this example could have been written without the external routines:
subroutine reduction_3b (n) x = 0 !$omp parallel !$omp& shared(a,x) !$omp& private(i,lx) lx = 0 !$omp do do i = 1, n lx = lx + a(i) enddo !$omp end do nowait !$omp critical x = x + lx !$omp end critical !$omp end parallel print *, x end |
This example could have also been written more simply using the reduction() clause as follows:
subroutine reduction_3c (n) x = 0 !$omp parallel !$omp& shared(a) !$omp& private(i) !$omp do reduction(+:x) do i = 1, n x = x + a(i) enddo !$omp end do nowait !$omp end parallel print *, x end |
This example demonstrates three coding styles for temporary storage, one using the external routine omp_get_thread_num() and the other two using only directives.
subroutine local_1a (n) dimension a(100) common /cmn/ t( 100, 0:7 ) ! assume 8 processors ! max. !$omp parallel do !$omp& shared(a,t) !$omp& private(i) do i = 1, n do j = 1, n t(j, omp_get_thread_num()) = a(i) ** 2 enddo call work( t(1,omp_get_thread_num()) ) enddo end |
If t is not global, then the above could be accomplished by putting t in the private clause:
subroutine local_1b (n) dimension t(100) !$omp parallel do !$omp& shared(a) !$omp& private(i,t) do i = 1, n do j = 1, n t(j) = a(i) ** 2 enddo call work( t ) enddo end |
If t is global, then the instance parallel and new directives can be used instead:
subroutine local_1c (n) dimension t(100) common /cmn/ t !$omp instance parallel (/cmn/) !$omp parallel do !$omp& shared(a) !$omp& private(i) !$omp new (/cmn/) do i = 1, n do j = 1, n t(j) = a(i) ** 2 enddo call work ! access t from common /cmn/ enddo end |
Not all of the values of a and b are initialized in the loop before they are used (the rest of the values are produced by init_a and init_b ). Using firstprivate for a and b causes the initialization values produced by init_a and init_b to be copied into private copies of a and b for use in the loops:
subroutine dsq3_b (c,n) integer n real a(100), b(100), c(n,n), x, y call init_a( a, n ) call init_b( b, n ) !$omp parallel do shared(c,n) private(i,j,x,y) firstprivate(a,b) do i = 1, n do j = 1, i a(j) = calc_a(i) b(j) = calc_b(i) enddo do j = 1, n x = a(i) - b(i) y = b(i) + a(i) c(j,i) = x * y enddo enddo !$omp end parallel do print *, x, y end |
This example is similar to Section C.17 except it uses threadprivate common blocks. For threadprivate , copyin is used instead of firstprivate to copy initialization values from the shared (master) copy of /blk/ to the private copies:
subroutine dsq3_b_tc (c,n) integer n real a(100), b(100), c(n,n), x, y common /blk/ a,b !$omp threadprivate (/blk/) call init_a( a, n ) call init_b( b, n ) !$omp parallel do shared(c,n) private(i,j,x,y) copyin(a,b) do i = 1, n do j = 1, i a(j) = calc_a(i) b(j) = calc_b(i) enddo do j = 1, n x = a(i) - b(i) y = b(i) + a(i) c(j,i) = x * y enddo enddo !$omp end parallel do print *, x, y end |
This example is similar to Section C.17 except is uses instance parallel privatizable common blocks. For instance parallel , copy new is used instead of firstprivate to privatize the common block and to copy initialization values from the shared (master) copy of /blk/ to the private copies:
subroutine dsq3_b_ip (c,n) integer n real a(100), b(100), c(n,n), x, y common /blk/ a,b !$omp instance parallel (/blk/) call init_a( a, n ) call init_b( b, n ) !$omp parallel do shared(c,n) private(i,j,x,y) !$omp copy new (/blk/) do i = 1, n do j = 1, i a(j) = calc_a(i) b(j) = calc_b(i) enddo do j = 1, n x = a(i) - b(i) y = b(i) + a(i) c(j,i) = x * y enddo enddo !$omp end parallel do print *, x, y end |
Previous | Next | Contents | Index |