Programming for multi-core environments

CPUs with multiple cores are currently the norm. Getting optimal performance out of these systems is challenging. I recently read Parallel Programming in C with MPI and OpenMP by Michael Quinn, a book that, while released in 2004, remains relevant and actual.

Dr. Quinn introduces two technologies which are available in C (and in Fortran as well, I believe), MPI and OpenMP. These two frameworks, which both aim at taking advantage of multiple cores, offer a very different level of abstraction and performance.

MPI

MPI (message passing interface) is a framework that has its roots in supercomputing in the 80’s. MPI can be used to take advantage not only of multiple-core CPUs, but also commodity computer clusters and supercomputers. From a programmer’s perspective, it consists of a library of pure C functions, all starting with MPI_*, that specify low-level communications between different workers, which are somewhat similar to threads.

Here’s an example that approximates $\pi$ through the definite integral:

$\int_0^1 {4 \over{1+x^2}} = \pi$

This translates into the C program:

/*  example from MPICH  */
#include "mpi.h"
#include <stdio.h>
#include <math.h>

int main(int argc,char *argv[])
{
    int done = 0, n, myid, numprocs, i;
    double PI25DT = 3.141592653589793238462643;
    double mypi, pi, h, sum, x;
    double startwtime = 0.0, endwtime;
    int  namelen;
    char processor_name[MPI_MAX_PROCESSOR_NAME];

    MPI_Init(&argc,&argv);
    MPI_Comm_size(MPI_COMM_WORLD,&numprocs);
    MPI_Comm_rank(MPI_COMM_WORLD,&myid);
    MPI_Get_processor_name(processor_name,&namelen);

    fprintf(stdout,"Process %d of %d on %s\n",
	    myid, numprocs, processor_name);

    n = 0;
    while (!done)
    {
        if (myid == 0)
        {
/*
            printf("Enter the number of intervals: (0 quits) ");
            scanf("%d",&n);
*/
	    if (n==0) n=10000; else n=0;

	    startwtime = MPI_Wtime();
        }
        MPI_Bcast(&n, 1, MPI_INT, 0, MPI_COMM_WORLD);
        if (n == 0)
            done = 1;
        else
        {
            h   = 1.0 / (double) n;
            sum = 0.0;
	    /* A slightly better approach starts from large i and works back */
            for (i = myid + 1; i <= n; i += numprocs)
            {
                x = h * ((double)i - 0.5);
                sum += 4.0 / (1.0 + x*x);
            }
            mypi = h * sum;

            MPI_Reduce(&mypi, &pi, 1, MPI_DOUBLE, MPI_SUM, 0, MPI_COMM_WORLD);

            if (myid == 0)
	    {
                printf("pi is approximately %.16f, Error is %.16f\n",
                       pi, fabs(pi - PI25DT));
		endwtime = MPI_Wtime();
		printf("wall clock time = %f\n", endwtime-startwtime);	       
		fflush( stdout );
	    }
        }
    }
    MPI_Finalize();
    return 0;
}

This code is taken from the National Center for Computer Sciences. It shows several features of MPI. 
At the start of the program MPI_Init is called, with the command line arguments as parameters. This causes multiple workers to be spawned, each with its own ID. The 0th worker asks the user for the number of intervals to use for the integration. 
MPI_Bcast acts differently depending on whether it is executed by the 0th worker or by any of the others. For the other workers, it causes execution to stall until the 0th worker (the 4th argument to MPI_Bcast) reaches MPI_Bcast. After the sync, all the workers have access to the n variable. A broadcast is necessary because each worker lives in its own memory space (recall that the workers need not even be living on the same computer). MPI supports many different types of communication: broadcasting, one-to-one messaging, scatters and gathers, what have you.
One convenient functionality is the ability to perform reductions easily. Reductions are essentially the result of cascaded binary associative operations. For example, addition (+) is associative. To add all the element of a 800-element vector, you might send 100 elements to each of 8 cores, then once you are left with 1 element per core, you would add them all together to obtain the total; this last bit could be implemented as a reduction. There’s some tricks to increase the speed of reductions, which are hardware-specific. For example, in a high-latency environment, you might send all the 8 elements to one core in a single step, and then add all the elements together in 7 steps. In a low-latency environments, you could send the 4 last elements to the first four cores, add, send the 3rd and 4th elements to the 1st and 2nd cores, add, then finally send the 2nd element to the 1st core, and add (3 computation steps, 3 communication steps). How MPI_Reduce is performed is left to the implementation, so different flavors of MPI implement reduction strategies adapted to their target deployment environment. 
One thing I should mention: I have not been able to figure out how to use MPI in a Matlab mex file. If somebody figures it out, give me a shout.
OpenMP
MPI gives some hardware-independance, and it can be very fast because the programmer works quite close to the metal, but it requires some significant effort from the programmer’s part. For the lazier among us, OpenMP offers the possibility of taking advantage of multiple processors with very little effort; the drawback is that the implementation is rather opaque, the control over the communication less fine-grained, and the results are typically not as fast as with MPI.
OpenMP uses a form of meta-programming through C preprocessor instructions. Here’s some C that performs a similar function to the previous MPI example:
#include <stdio.h>
#include <stdlib.h>
#include <omp.h>               
#define NUM_STEPS 100000000
/*main program*/
int main(int argc, char *argv[]) {
   int i;
   double x, pi;
   double sum = 0.0;
   double step = 1.0/(double) NUM_STEPS;
   int nthreads;
   /* do computation -- using all available threads */
   #pragma omp parallel
   {
       #pragma omp master
       {
           nthreads = omp_get_num_threads();
       }
       #pragma omp for private(x) reduction(+:sum) schedule(runtime)
       for (i=0; i < NUM_STEPS; ++i) {
           x = (i+0.5)*step;
           sum = sum + 4.0/(1.0+x*x);
       }
       #pragma omp master
       {
           pi = step * sum;
       }
   }
   /* print results */
   printf("parallel program results with %d threads:\n", nthreads);
   printf("pi = %g  (%17.15f)\n",pi, pi);
   return EXIT_SUCCESS;
}

I found this code here. The parallelization is done through the definition of a parallel section and the subsequent use of #pragma omp for private(x) reduction(+:sum) schedule(runtime), which performs a reduction, similar to the previous MPI example. By keeping parallelization hints in preprocessor instructions, the resulting C code is still (mostly) valid as a single-threaded program; it’s less obtrusive than MPI. 
The way the for loop is split is left to OpenMP by the schedule(runtime) instruction; other choices of scheduling options are possible, but it’s not as fine-grained as in MPI. By default, all variables are shared, and private variables must be indicated, the opposite of MPI, where everything is private and syncs between workers must be made explicit. 
I tried OpenMP recently in a mex project, and it took only a few minutes to obtain a 2x speedup. However, this was on a quad-core, so the code did not fully utilize the multi-core environment, but the ratio of speedup to effort was high nonetheless.
Conclusion
I hope these simple examples give you a flavor of the different options for C code parallelization in multi-core environments. Darren Wilkinson has some examples of how to implement parallel Monte Carlo with MPI. On the OpenMP side, there’s an intro to OpenMP usage in mex files over at Walking Randomly. Perhaps in another post I will explore how to program in many-core environments, especially GPUs.

Programming for multi-core environments

MPI

OpenMP

Conclusion

Leave a comment Cancel reply