Coarse-grained parallelism in Matlab with parfor

Previously, I discussed how you can take advantage of multiple cores in C. In day-to-day research, however, it’s more common to work with high-level languages like Matlab and Python. Although Matlab has been multithreaded for several years now, it’s not very good at maximally using all the cores in a computer. You can verify this yourself by running the bench function in Matlab, and using maxNumCompThreads to manipulate the number of cores it uses; I also tried disabling all but one core in the BIOS, and the results were the same.

Here are the results from my new computer with all 6 cores enabled:

Clearly the new computer is pretty good, since it beats the Mathworks’ highest rated test machine, but how will it look with only one core:

The difference in speed is marginal except for the LU decomposition. WTF? Some of this is likely due to the choice of tests, so I also tried gbench which uses tests which parallelize better (or so I assume, since it’s a showcase for GPU computing). The LU, BLAS, and equations tests saw speedups on the order of 4X with multiple cores enabled, while the FFT, 3D Conv, and for/gfor saw now difference at all. So in short, multithreading in Matlab is hit-and-miss.

What gives? Although Matlab is closed-source, and I cannot therefore say much about how it works internally, my experience with the software leads me to think that it uses fine-grained parallelism at the function level. Many linear algebra functions can be parallelized to some extent, for example matrix multiplication and Fourier transforms. Thus Matlab provides multithreaded implementations of the mtimes and fft functions, among others.

One issue with this fine-grained parallelization strategy is that this tends to suffer from high communication overhead. The best parallelization performance occurs when workers are completely independent, never communicate, and finish their job at the same time. Multithreading strategies for matrix multiplication and FFTs are nothing like that; they may have uneven loads and require tight coupling and frequent communication.

Now you might see that your N-core CPU is running at 100% while running a script. That does not mean that Matlab is using all the cores effectively; for all you know, most of this activity could be the results of shuffling data around, not doing actual computations. There is much room for improvement.

Enter parfor

There are many circumstances, however, where coarse-grained parallelism is easy to implement and much more effective than Matlab’s default strategy. Oftentimes, we deal with problems that have an obvious latent parallel component. For example, you might run the same receptive field estimation algorithm on all of your cells. All these analyses are independent of each other, so they can be run in parallel. Because there’s no communication between threads, it’s possible to obtain speedups on the order of the number of cores by mapping one cell to one thread.

Matlab has a parallel computing toolbox which can be used to run analyses on computer clusters if you buy the server. The same toolbox can be used to run multiple threads independently on multiple cores of a computer, and that doesn’t require a special license. The simplest and most useful construct in the toolbox is the parfor statement. parfor replaces the regular for loop, and can be used when loop iterations are completely independent of each other.

Behind the scenes, Matlab creates duplicates of the workspace for each thread, runs one iteration per core until all iterations are done, gathers all results, and returns them to the initial workspace. parfor has a number of limitations, but thankfully the built-in editor knows these and will show red squiggly lines under the offending statements. Note that because there is duplication of workspace variables, parfor may require boatloads of RAM to be effective, especially on a 6-core CPU. This is one of the reasons I got 24GB of RAM for the new computer.

Here’s a real-life example of how parfor can be useful. k-fold cross-validation is a standard technique to estimate regularization hyperparameters when estimating. In 5-fold CV, you take 4/5ths of your data, fit your model to it, and use the model to predict the remaining 1/5th of the data. You do this for 5 different splittings of the data. The resulting CV goodness-of-fit is then an estimate of the quality of the model fit that penalizes overparametrized models which overfit to noise and have poor generalization performance.

To parallelize cross-validation, we simply map folds to cores. I typically use 5-fold cross-validation, which is perfect for a 6-core computer (the unused core will be used by Matlab to coordinate things). Here’s how I reworked fitcvbgam, part of my boosted generalized additive model (bgam) package, so that it can take advantage of multiple cores.

The non-parallelized version of the main loop looks like this:

for ii = 1:nfolds
    [thefits{ii},cvdeviancep,etas{ii},minDeviance,maxDeviance] = fitAFold(ii,jj,folds,refFitParams,thefits{ii},y,X,etas{ii},minDeviances(ii),maxDeviances(ii),trainer);
    minDeviances(ii) = minDeviance;
    maxDeviances(ii) = maxDeviance;
    cvdeviances(rg,ii) = cvdeviancep;

    if jj > 0
        tgt = cvdeviances(jj+fitParams.blockSize,ii);
    else
        tgt = maxDeviances(ii)-minDeviances(ii);
    end

    fitParams.displayBlockFun(tgt);
end

Each fold is fit sequentially, variables are saved, and results are displayed. The parallel version of the same code is:

parfor ii = 1:nfolds
    [thefits{ii},cvdeviancep,etas{ii},minDeviance,maxDeviance] = fitAFold(ii,jj,folds,refFitParams,thefits{ii},y,X,etas{ii},minDeviances(ii),maxDeviances(ii),trainer);
    minDeviances(ii) = minDeviance;
    maxDeviances(ii) = maxDeviance;
    cvdeviances(rg,ii) = cvdeviancep;
end

for ii = 1:nfolds
    if jj > 0
        tgt = cvdeviances(jj+fitParams.blockSize,ii);
    else
        tgt = maxDeviances(ii)-minDeviances(ii);
    end

    fitParams.displayBlockFun(tgt);
end

Notice here that I’ve parallelized the model fitting, but not the display, which is in a normal for loop. That’s because the order in which iterations are run is not deterministic. So if the display had been kept inside the parfor loop, the cross-validation scores of each fold would have been jumbled.

Apart from that, the changes are minimal. The parfor loop works as is as long as the workers have been initiated through the matlabpool open statement; this only needs to be once.

On some test data I have, the regular for loop running on one core took a total of 58 seconds, or 11.7 seconds per fold. Using all 6 cores with Matlab’s fine-grained parallelization lead to no measurable improvement in performance. None whatsoever! This is despite the fact that all the code is fully vectorized and there are plenty of parallelizable operations in the code (pointwise operations, matrix multiplications, the \ operation, etc.).

You would think that the lack of improvement is due to Matlab not trying to parallelize things, because the matrices are too small or some other excuse. Actually, the task manager shows that all 6 cores are practically at 100% while running this particular piece of code. So that means that 5 cores are dicking around waiting for syncs and messages, while one actually does the job. Pretty sad, really.

The parfor version, however, took a total of 15.1 seconds. That’s a 3.8x speedup, compared with a theoretical maximum of 5x.

A few important lessons:

  1. If the CPU is used at 100%, that doesn’t mean it’s actually doing useful work
  2. Matlab is not the smartest at parallelization
  3. parfor is a quick and dirty way to utilize more cores assuming you have lots and lots of RAM
  4. Ergo, buy more RAM

One response to “Coarse-grained parallelism in Matlab with parfor”

Leave a comment

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s