I’ve been transferring some code from Matlab to NumPy and my first tests on my Ubuntu 12.10 installation were very disappointing – slow as molasses. Specifically, numpy.dot was taking forever. I figured this might have something to do with BLAS libraries, a subject which is completely beyond mere mortals to understand.

### Is something wrong with BLAS config?

Following some comments on StackOverflow, I wrote a little script to compare the performance of numpy.dot to that of a similar dot product computed through the s/cgemm function in the BLAS library. Here’s the code:

import ctypes from ctypes import byref, c_char, c_int, c_float, c_double import numpy as np import cProfile _blaslib = ctypes.cdll.LoadLibrary("libblas.so") def lldot(A, B): """lldot(A,B) performs the matrix multiplication A*B using low-level (blas lib) functions""" if A.shape[1] != B.shape[0]: raise Exception("Incompatible array dimensions") if A.flags.c_contiguous: tA = c_char('t') lda = c_int(A.shape[1]) else: tA = c_char('n') lda = c_int(A.shape[0]) if B.flags.c_contiguous: tB = c_char('t') ldb = c_int(B.shape[1]) else: tB = c_char('n') ldb = c_int(B.shape[0]) M = c_int(A.shape[0]) N = c_int(B.shape[1]) K = c_int(B.shape[0]) ldc = c_int(A.shape[0]) #upconvert what needs to be upconverted if A.dtype != np.float32 and A.dtype != np.float64: print "A1" A = A.astype(np.float64) if B.dtype != np.float32 and B.dtype != np.float64: print "A2" B = B.astype(np.float64) #common stuff if A.dtype != B.dtype: print "A3" A = A.astype(np.float64) B = B.astype(np.float64) C = np.zeros((A.shape[0],B.shape[1]),dtype=A.dtype,order='F') if A.dtype == np.float64: thefun = _blaslib.dgemm_ one = c_double(1.0) zero = c_double(0.0) else: thefun = _blaslib.sgemm_ one = c_float(1.0) zero = c_float(0.0) thefun(byref(tA), byref(tB), byref(M), byref(N), byref(K), byref(one), A.ctypes.data_as(ctypes.c_void_p), byref(lda), B.ctypes.data_as(ctypes.c_void_p), byref(ldb), byref(zero), C.ctypes.data_as(ctypes.c_void_p), byref(ldc)) return C n = 1000 A = np.asfortranarray(np.random.randn(n,n).astype(np.float64)) B = np.asfortranarray(np.random.randn(n,n).astype(np.float64)) C1 = np.dot(A,B) C2 = lldot(A,B) cProfile.run('np.dot(A,B)') cProfile.run('lldot(A,B)') print 'Max discrepancy: %e'% abs(C1-C2).max()

In my case, the second version using libblas directly was fully *10 times slower* than np.dot! Clearly, there was something wrong with the BLAS configuration in NumPy.

### Which BLAS library is numpy using?

Running numpy.show_config() should give information about how NumPy is configured, for instance:

blas_info: libraries = ['blas'] library_dirs = ['/usr/lib'] language = f77 lapack_info: libraries = ['lapack'] library_dirs = ['/usr/lib'] language = f77 atlas_threads_info: NOT AVAILABLE blas_opt_info: libraries = ['blas'] library_dirs = ['/usr/lib'] language = f77 define_macros = [('NO_ATLAS_INFO', 1)] atlas_blas_threads_info: NOT AVAILABLE lapack_opt_info: libraries = ['lapack', 'blas'] library_dirs = ['/usr/lib'] language = f77 define_macros = [('NO_ATLAS_INFO', 1)] atlas_info: NOT AVAILABLE lapack_mkl_info: NOT AVAILABLE blas_mkl_info: NOT AVAILABLE atlas_blas_info: NOT AVAILABLE mkl_info: NOT AVAILABLE

In my case, it referenced ATLAS rather than BLAS. In theory, ATLAS should be faster, but for whatever reason, there was something off about my ATLAS installation.

### Removing ATLAS and reinstall SciPy

The solution was pretty straightforward: uninstall ATLAS via:

sudo apt-get remove libatlas*

Then reinstall SciPy via PIP:

sudo pip install scipy --upgrade

Ran the test script again, and voilà! The timing is now similar for both calls.

### Alternatives

These might be useful (via twitter):

- Building Numpy with GotoBlas [1, 2] (auto-translated from Japanese)
- Building NumPy with OpenBLAS