I needed a fast method of binning 1D and 2D data in Matlab – that is, to compute the mean of z conditional on x being in a given range (1d binning) or the mean z of conditional on x and y being in given ranges (2d binning). I stumbled upon a clever method using a combination of histc and sparse which involves no looping and adapted the method to the 2d case. Here are two functions that do the trick:
function [ym,yb] = bindata(y,x,xrg) %function [ym,yb] = bindata(y,x,xrg) %Computes ym(ii) = mean(y(x>=xrg(ii) & x < xrg(ii+1)) for every ii %using a fast algorithm which uses no looping %If a bin is empty it returns nan for that bin %Also returns yb, the approximation of y using binning (useful for r^2 %calculations). Example: % %x = randn(100,1); %y = x.^2 + randn(100,1); %xrg = linspace(-3,3,10)'; %[ym,yb] = bindata(y,x,xrg); %X = [xrg(1:end-1),xrg(2:end)]'; %Y = [ym,ym]' %plot(x,y,'.',X(:),Y(:),'r-'); % %By Patrick Mineault %Refs: https://xcorr.net/?p=3326 % http://www-pord.ucsd.edu/~matlab/bin.htm [~,whichedge] = histc(x,xrg(:)'); bins = min(max(whichedge,1),length(xrg)-1); xpos = ones(size(bins,1),1); ns = sparse(bins,xpos,1); ysum = sparse(bins,xpos,y); ym = full(ysum)./(full(ns)); yb = ym(bins); end
And for the 2d case:
function [ym,yb] = bindata2(y,x1,x2,x1rg,x2rg) %function [ym,yb] = bindata2(y,x1,x2,x1rg,x2rg) %Computes: %ym(ii,jj) = mean(y(x1>=x1rg(ii) & x1 < x1rg(ii+1) & x2>=x2rg(jj) & x2 < x2rg(jj+1)) %for every ii, jj %If a bin is empty it returns nan for that bin %using a fast algorithm which uses no looping %Also returns yb, the approximation of y using binning (useful for r^2 %calculations). Example: % %x = randn(500,2); %y = sum(x.^2,2) + randn(500,1); %xrg = linspace(-3,3,10)'; %[ym,yb] = bindata2(y,x(:,1),x(:,2),xrg,xrg); %subplot(1,2,1);plot3(x(:,1),x(:,2),y,'.'); %subplot(1,2,2);h = imagesc(xrg,xrg,ym); %set(h,'AlphaData',~isnan(ym)); box off; % %By Patrick Mineault %Refs: https://xcorr.net/?p=3326 % http://www-pord.ucsd.edu/~matlab/bin.htm [~,whichedge1] = histc(x1,x1rg(:)'); [~,whichedge2] = histc(x2,x2rg(:)'); bins1 = min(max(whichedge1,1),length(x1rg)-1); bins2 = min(max(whichedge2,1),length(x2rg)-1); bins = (bins2-1)*(length(x1rg)-1)+bins1; xpos = ones(size(bins,1),1); ns = sparse(bins,xpos,1,(length(x1rg)-1)*(length(x2rg)-1),1); ysum = sparse(bins,xpos,y,(length(x1rg)-1)*(length(x2rg)-1),1); ym = full(ysum)./(full(ns)); yb = ym(bins); ym = reshape(ym,length(x1rg)-1,length(x2rg)-1); end
What about in Python? It turns out there’s a built-in way of doing this: using the weights
property of numpy.histogram
.
6 responses to “Fast 1D and 2D data binning in Matlab & Python”
There’s a little bug here. For the 1D case (same logic applies in 2D):
bins = min(max(whichedge,1),length(xrg)-1);
this takes values of whichedge that are less than 1 and turns them into 1, and values of whichedge that are greater than numel(xrg)-1 and turns them into that. This would (maybe) make sense if matlab labeled edges BIN as 0 when it’s outside on the low end, and numel(xrg) when it’s outside on the high end. However, according to [help histc]:
“BIN is zero for out of range values.”
So xxrg(end) all end up squished into bin 1, along with everything that belongs there. There are a few options for what to do with out of range values, I think code intended to lump them in to the extreme bins, though I think a more appropriate solution is to remove them:
bins=whichedge(whichedge>0);
y=y(whichedge>0);
An elegant strategy for lumping them in would be to simply replace xrg(1) and xrg(end) with -inf and inf. Or, if you wanted to return averages for those values falling outside of xrg, you might append a -inf and an inf to xrg, making the output ym([1 end]) the the means of the out of range values, and ym(2:end-1) the mean for y where x>=xrng(n-1) and x<xrng(n).
ooh yes thanks for pointing this error out. this code will throw all data outside the input bin range into the first/last bin!
For those of us with lesser intellectual powers, if it’s not too much trouble, do you know how to get the function to also return the number of points in each bin (2D case)?
Actually, it’s simply the variable ns isn’t it…fantastic function by the way! I had done the 2d one with a loop and it was crazy slow even with a low number of bins
I think accumarray is even faster than using sparse, at least in the 1-D case. Also it generalizes to other functions.
Good to know, I didn’t know about that function.