NumPy uses the wrong BLAS libraries – solution

I’ve been transferring some code from Matlab to NumPy and my first tests on my Ubuntu 12.10 installation were very disappointing – slow as molasses. Specifically, numpy.dot was taking forever. I figured this might have something to do with BLAS libraries, a subject which is completely beyond mere mortals to understand.

Is something wrong with BLAS config?

Following some comments on StackOverflow, I wrote a little script to compare the performance of numpy.dot to that of a similar dot product computed through the s/cgemm function in the BLAS library. Here’s the code:

import ctypes
from ctypes import byref, c_char, c_int, c_float, c_double
import numpy as np
import cProfile
_blaslib = ctypes.cdll.LoadLibrary("libblas.so")

def lldot(A, B):
    """lldot(A,B) performs the matrix multiplication A*B using low-level (blas lib) functions"""
    if A.shape[1] != B.shape[0]:
        raise Exception("Incompatible array dimensions")
        
    if A.flags.c_contiguous:
        tA = c_char('t')
        lda = c_int(A.shape[1])
    else:
        tA = c_char('n')
        lda = c_int(A.shape[0])
        
    if B.flags.c_contiguous:
        tB = c_char('t')
        ldb = c_int(B.shape[1])
    else:
        tB = c_char('n')
        ldb = c_int(B.shape[0])
    
    M = c_int(A.shape[0])
    N = c_int(B.shape[1])
    K = c_int(B.shape[0])
    
    ldc = c_int(A.shape[0])
    
    #upconvert what needs to be upconverted
    if A.dtype != np.float32 and A.dtype != np.float64:
        print "A1"
        A = A.astype(np.float64)
    if B.dtype != np.float32 and B.dtype != np.float64:
        print "A2"
        B = B.astype(np.float64)    
    
    #common stuff
    if A.dtype != B.dtype:
        print "A3"
        A = A.astype(np.float64)
        B = B.astype(np.float64)
    
    C = np.zeros((A.shape[0],B.shape[1]),dtype=A.dtype,order='F')
    
    if A.dtype == np.float64:
        thefun = _blaslib.dgemm_
        one = c_double(1.0)
        zero = c_double(0.0)
    else:
        thefun = _blaslib.sgemm_
        one = c_float(1.0)
        zero = c_float(0.0)
        
    thefun(byref(tA), byref(tB), byref(M), byref(N), byref(K), byref(one), 
                     A.ctypes.data_as(ctypes.c_void_p), byref(lda), 
                     B.ctypes.data_as(ctypes.c_void_p), byref(ldb), byref(zero),
                     C.ctypes.data_as(ctypes.c_void_p), byref(ldc))
    
    return C

n = 1000
A = np.asfortranarray(np.random.randn(n,n).astype(np.float64))
B = np.asfortranarray(np.random.randn(n,n).astype(np.float64))
C1 = np.dot(A,B)
C2 =  lldot(A,B)

cProfile.run('np.dot(A,B)')
cProfile.run('lldot(A,B)')
print 'Max discrepancy: %e'% abs(C1-C2).max()

In my case, the second version using libblas directly was fully 10 times slower than np.dot! Clearly, there was something wrong with the BLAS configuration in NumPy.

Which BLAS library is numpy using?

Running numpy.show_config() should give information about how NumPy is configured, for instance:

blas_info:
    libraries = ['blas']
    library_dirs = ['/usr/lib']
    language = f77
lapack_info:
    libraries = ['lapack']
    library_dirs = ['/usr/lib']
    language = f77
atlas_threads_info:
  NOT AVAILABLE
blas_opt_info:
    libraries = ['blas']
    library_dirs = ['/usr/lib']
    language = f77
    define_macros = [('NO_ATLAS_INFO', 1)]
atlas_blas_threads_info:
  NOT AVAILABLE
lapack_opt_info:
    libraries = ['lapack', 'blas']
    library_dirs = ['/usr/lib']
    language = f77
    define_macros = [('NO_ATLAS_INFO', 1)]
atlas_info:
  NOT AVAILABLE
lapack_mkl_info:
  NOT AVAILABLE
blas_mkl_info:
  NOT AVAILABLE
atlas_blas_info:
  NOT AVAILABLE
mkl_info:
  NOT AVAILABLE

In my case, it referenced ATLAS rather than BLAS. In theory, ATLAS should be faster, but for whatever reason, there was something off about my ATLAS installation.

Removing ATLAS and reinstall SciPy

The solution was pretty straightforward: uninstall ATLAS via:

sudo apt-get remove libatlas*

Then reinstall SciPy via PIP:

sudo pip install scipy --upgrade

Ran the test script again, and voilà! The timing is now similar for both calls.

Alternatives

These might be useful (via twitter):

Building Numpy with GotoBlas [1, 2] (auto-translated from Japanese)
Building NumPy with OpenBLAS

xcorr: AI & neuro