We’ve been recording neurons on Utah arrays for ~3 years now. That creates a lot of data, and we keep buying more drives (above – there’s about 30TB of storage on the hub computer). Drives themselves are cheap; it’s backing up the data which becomes expensive in the long run, both in terms of bandwidth and cost of buying and maintaining backup servers.
The files which take the largest proportion of space – about 65% – contain wideband signals. As discussed previously, the wideband signals are extracted from the .plx files via a Python script which yields 96 files, each containing a stream of signed 16 bit values – 12 bit values padded to 16, actually – at 10 kHz – so 20 KB/s. For an hour’s worth of recordings, that’s 20*3600*96 KB ~ 6 GB of data.
You can’t significantly compress a stream of int16’s using e.g. zip or 7z, since it’s unlikely that you’ll ever see the same subsequence of int16’s twice within a file. Nevertheless, since the signals are highly non-gaussian and correlated in time, there must exist a way of compressing them.
I turned to FLAC – the free lossless audio codec – which compresses audio signals by using a linear prediction model, Golomb-Rice coding, and run-length encoding. When you think about it, a wideband signal, as a 1-dimensional stream of correlated samples, is a lot like an audio signal. The command line flac utility can compress raw streams of int16’s by using the right flags.
It works pretty well, achieving compression ratios hovering around 60% – that is, the encoded wideband signals take up 40% of the space that they did in the original encoding. FLAC is widely supported – it can be read by Matlab and Python – and automation is trivial thanks to the command-line util. It takes about a day to go through a single drive and compress the data, freeing up about 35% of the hard drive space along the way.
Obviously, saving up 10TBs without having to throw away any data is an attractive prospect. Here’s some python code that I’m using to do this:
# -*- coding: utf-8 -*- """ Created on Sun Feb 23 18:43:14 2014 @author: Patrick Mineault """ import getopt, sys, glob import os import os.path def do_plx(d): #check for a plx file files = glob.glob(d + '/*.plx') for f in files: fname_root = os.path.basename(os.path.splitext(f)) dir_name = os.path.dirname(os.path.dirname(f)) mat_name = '%s/mat/%s.mat' % (dir_name,fname_root) if os.path.isfile(mat_name): print "plx file can be eliminated" os.remove(f) def do_flac(d): """Does the actual compression via flac""" files = glob.glob(d + '*_ch0*') print "found something to flac" for f in files: #only compress the files with no extensions if os.path.splitext(f) == '': os.system("flac -f --endian=little --channels=1 --bps=16 --sample-rate=10000 --sign=signed %s" % f) #Make sure that the .flac file actually exists before removing the original! if os.path.isfile(f + ".flac"): print "Removing %s" % f os.remove(f) def recursive_flac(d): dirs = glob.glob(d + '*/') for d in dirs: dirname = os.path.basename(os.path.dirname(d)) if dirname == 'mat': print "found dir %s" % d do_flac(d) if dirname == "plx": do_plx(d) else: #recursify recursive_flac(d) def usage(): print """python great_compressor.py -d directory_name Recursively looks for /mat/*_ch* files and compresses them with flac, then deletes the original files. Also removes spurious plx files if any""" def main(): try: opts, args = getopt.getopt(sys.argv[1:], "d:") except getopt.GetoptError as err: # print help information and exit: print str(err) # will print something like "option -a not recognized" usage() sys.exit(2) thedir = "." for o, a in opts: if o == "-d": thedir = a else: assert False, "unhandled option" recursive_flac(thedir + '/') if __name__ == "__main__": main()
And here’s a script I’m using to read the number of samples in a FLAC file in Matlab without having to actually read in the file:
function a = get_flac_length(fname) f = fopen(fname,'rb'); header = fread(f,30,'uint8'); fclose(f); a = sum([bitand(header(22),15),header(23:26)'].*(2.^(32:-8:0))); end