We’ve been recording neurons on Utah arrays for ~3 years now. That creates a lot of data, and we keep buying more drives (above – there’s about 30TB of storage on the hub computer). Drives themselves are cheap; it’s backing up the data which becomes expensive in the long run, both in terms of bandwidth and cost of buying and maintaining backup servers.
The files which take the largest proportion of space – about 65% – contain wideband signals. As discussed previously, the wideband signals are extracted from the .plx files via a Python script which yields 96 files, each containing a stream of signed 16 bit values – 12 bit values padded to 16, actually – at 10 kHz – so 20 KB/s. For an hour’s worth of recordings, that’s 20*3600*96 KB ~ 6 GB of data.
You can’t significantly compress a stream of int16’s using e.g. zip or 7z, since it’s unlikely that you’ll ever see the same subsequence of int16’s twice within a file. Nevertheless, since the signals are highly non-gaussian and correlated in time, there must exist a way of compressing them.
I turned to FLAC – the free lossless audio codec – which compresses audio signals by using a linear prediction model, Golomb-Rice coding, and run-length encoding. When you think about it, a wideband signal, as a 1-dimensional stream of correlated samples, is a lot like an audio signal. The command line flac utility can compress raw streams of int16’s by using the right flags.
It works pretty well, achieving compression ratios hovering around 60% – that is, the encoded wideband signals take up 40% of the space that they did in the original encoding. FLAC is widely supported – it can be read by Matlab and Python – and automation is trivial thanks to the command-line util. It takes about a day to go through a single drive and compress the data, freeing up about 35% of the hard drive space along the way.
Obviously, saving up 10TBs without having to throw away any data is an attractive prospect. Here’s some python code that I’m using to do this:
# -*- coding: utf-8 -*- """ Created on Sun Feb 23 18:43:14 2014 @author: Patrick Mineault """ import getopt, sys, glob import os import os.path def do_plx(d): #check for a plx file files = glob.glob(d + '/*.plx') for f in files: fname_root = os.path.basename(os.path.splitext(f)[0]) dir_name = os.path.dirname(os.path.dirname(f)) mat_name = '%s/mat/%s.mat' % (dir_name,fname_root) if os.path.isfile(mat_name): print "plx file can be eliminated" os.remove(f) def do_flac(d): """Does the actual compression via flac""" files = glob.glob(d + '*_ch0*') print "found something to flac" for f in files: #only compress the files with no extensions if os.path.splitext(f)[1] == '': os.system("flac -f --endian=little --channels=1 --bps=16 --sample-rate=10000 --sign=signed %s" % f) #Make sure that the .flac file actually exists before removing the original! if os.path.isfile(f + ".flac"): print "Removing %s" % f os.remove(f) def recursive_flac(d): dirs = glob.glob(d + '*/') for d in dirs: dirname = os.path.basename(os.path.dirname(d)) if dirname == 'mat': print "found dir %s" % d do_flac(d) if dirname == "plx": do_plx(d) else: #recursify recursive_flac(d) def usage(): print """python great_compressor.py -d directory_name Recursively looks for /mat/*_ch* files and compresses them with flac, then deletes the original files. Also removes spurious plx files if any""" def main(): try: opts, args = getopt.getopt(sys.argv[1:], "d:") except getopt.GetoptError as err: # print help information and exit: print str(err) # will print something like "option -a not recognized" usage() sys.exit(2) thedir = "." for o, a in opts: if o == "-d": thedir = a else: assert False, "unhandled option" recursive_flac(thedir + '/') if __name__ == "__main__": main()
And here’s a script I’m using to read the number of samples in a FLAC file in Matlab without having to actually read in the file:
function a = get_flac_length(fname) f = fopen(fname,'rb'); header = fread(f,30,'uint8'); fclose(f); a = sum([bitand(header(22),15),header(23:26)'].*(2.^(32:-8:0))); end
3 responses to “Compressing wideband signals with FLAC”
If I understood this correctly, the data is in fact 12 bit, which make me wonder: why store it as 16-bit? FLAC can store 12-bit data as well. If you want the ease of working with 16-bit, it might be an option to alter the padding: if you pad the data in a way FLAC detects it, FLAC automatically uses it’s wasted bits mechanism, which makes it output that padded 16-bit data, but stores it internally as 12-bit.
It might save you another 30%, something to consider I’d think!
I’m pretty sure that doesn’t matter, although it might be worth trying. Plexon stores data as 16-bit unsigned integers while it actually only uses a dynamic range of 12 bits. FLAC uses a linear dynamic model to predict the coefficients, and it uses a compression scheme on the residuals to maximize their entropy; so it wouldn’t actually use 16 bits to encode the residuals if it needs less than 16 bits. Maybe there’s some internals that would be slightly more efficient if 12 bits was specified though. Hard to say, but I don’t think you’ll get 25% extra compression because of the entropy coding mechanism already in place.
I can assure you it does: I tried. If I take some noisy music, make it 12-bits and pad it with zeros, the filesize shrinks with 32%. However, If I pad in a way that doens’t trigger the wasted bits mechanism (for example, I pad with 1111) I don’t get any compression benefit.
The entropy coding stage assumes small random values, it doesn’t use a table approach or reordering like general purpose compressors to reduce the range of occuring values because it usually doesn’t happen in music signals, which FLAC was made to handle. The lower bits in music usually contain noise.
Except OptimFROG, all lossless audio codecs I know of don’t look for a reduced number of used values (as they expect noise), but quite a few of them handle the special case in which the last x bits are zero, because certain systems, for example DVD-Audio and LossyWAV use this to store data that is not the usual 8, 16 or 24 bit without end users having to support for example playing 19-bit audio.
So, in short, FLAC can only benefit if the last x bits are zero. If you’d like the extra 25%-30% space saving and are sure those last 4 bits don’t contain any information, just set them to zero and FLAC will do the rest.