I’ve been transferring some code from Matlab to NumPy and my first tests on my Ubuntu 12.10 installation were very disappointing – slow as molasses. Specifically, numpy.dot was taking forever. I figured this might have something to do with BLAS libraries, a subject which is completely beyond mere mortals to understand.
Is something wrong with BLAS config?
Following some comments on StackOverflow, I wrote a little script to compare the performance of numpy.dot to that of a similar dot product computed through the s/cgemm function in the BLAS library. Here’s the code:
import ctypes from ctypes import byref, c_char, c_int, c_float, c_double import numpy as np import cProfile _blaslib = ctypes.cdll.LoadLibrary("libblas.so") def lldot(A, B): """lldot(A,B) performs the matrix multiplication A*B using low-level (blas lib) functions""" if A.shape[1] != B.shape[0]: raise Exception("Incompatible array dimensions") if A.flags.c_contiguous: tA = c_char('t') lda = c_int(A.shape[1]) else: tA = c_char('n') lda = c_int(A.shape[0]) if B.flags.c_contiguous: tB = c_char('t') ldb = c_int(B.shape[1]) else: tB = c_char('n') ldb = c_int(B.shape[0]) M = c_int(A.shape[0]) N = c_int(B.shape[1]) K = c_int(B.shape[0]) ldc = c_int(A.shape[0]) #upconvert what needs to be upconverted if A.dtype != np.float32 and A.dtype != np.float64: print "A1" A = A.astype(np.float64) if B.dtype != np.float32 and B.dtype != np.float64: print "A2" B = B.astype(np.float64) #common stuff if A.dtype != B.dtype: print "A3" A = A.astype(np.float64) B = B.astype(np.float64) C = np.zeros((A.shape[0],B.shape[1]),dtype=A.dtype,order='F') if A.dtype == np.float64: thefun = _blaslib.dgemm_ one = c_double(1.0) zero = c_double(0.0) else: thefun = _blaslib.sgemm_ one = c_float(1.0) zero = c_float(0.0) thefun(byref(tA), byref(tB), byref(M), byref(N), byref(K), byref(one), A.ctypes.data_as(ctypes.c_void_p), byref(lda), B.ctypes.data_as(ctypes.c_void_p), byref(ldb), byref(zero), C.ctypes.data_as(ctypes.c_void_p), byref(ldc)) return C n = 1000 A = np.asfortranarray(np.random.randn(n,n).astype(np.float64)) B = np.asfortranarray(np.random.randn(n,n).astype(np.float64)) C1 = np.dot(A,B) C2 = lldot(A,B) cProfile.run('np.dot(A,B)') cProfile.run('lldot(A,B)') print 'Max discrepancy: %e'% abs(C1-C2).max()
In my case, the second version using libblas directly was fully 10 times slower than np.dot! Clearly, there was something wrong with the BLAS configuration in NumPy.
Which BLAS library is numpy using?
Running numpy.show_config() should give information about how NumPy is configured, for instance:
blas_info: libraries = ['blas'] library_dirs = ['/usr/lib'] language = f77 lapack_info: libraries = ['lapack'] library_dirs = ['/usr/lib'] language = f77 atlas_threads_info: NOT AVAILABLE blas_opt_info: libraries = ['blas'] library_dirs = ['/usr/lib'] language = f77 define_macros = [('NO_ATLAS_INFO', 1)] atlas_blas_threads_info: NOT AVAILABLE lapack_opt_info: libraries = ['lapack', 'blas'] library_dirs = ['/usr/lib'] language = f77 define_macros = [('NO_ATLAS_INFO', 1)] atlas_info: NOT AVAILABLE lapack_mkl_info: NOT AVAILABLE blas_mkl_info: NOT AVAILABLE atlas_blas_info: NOT AVAILABLE mkl_info: NOT AVAILABLE
In my case, it referenced ATLAS rather than BLAS. In theory, ATLAS should be faster, but for whatever reason, there was something off about my ATLAS installation.
Removing ATLAS and reinstall SciPy
The solution was pretty straightforward: uninstall ATLAS via:
sudo apt-get remove libatlas*
Then reinstall SciPy via PIP:
sudo pip install scipy --upgrade
Ran the test script again, and voilà! The timing is now similar for both calls.
Alternatives
These might be useful (via twitter):
- Building Numpy with GotoBlas [1, 2] (auto-translated from Japanese)
- Building NumPy with OpenBLAS