Setting up a cluster for the analysis of neural array data

As I’ve mentioned before, we’ve been recording from a Utah array for more than a year now. We use a custom Plexon MAP system that allows for the recording of wideband data at 10 kHz over 96 channels. This custom configuration was requested so that we can get spike-free LFPs by preprocessing wideband data with the spike-removal algorithm derived in Zanos, Mineault & Pack (2011). Such an arrangement generates an immense amount of data that needs to be moved around, backed up, filtered, and sorted.

Early on, we decided to set up a cluster of computers that could be used by anybody working on the Utah array data. Each computer in this cluster would be indistinguishable from the next; it would be accessed almost exclusively remotely; it would have local access to all the array data; and would have a decent CPU and a bucket of RAM. Getting all the pieces working well together was a challenge, and I hope that you can learn a thing or two from my experiences.

Hardware and OS setup

Here’s a picture of the array:

At the bottom, you can see the 4 identical computers. Each has an i7 920 Intel CPU, 12 GB of RAM (6×2 GB, upgradeable to 24 GB), a very small internal hard drive, an on-board video card, and that’s pretty much it. Drone 1 (on the left) is connected to an external housing that can contain up to 4 hard drives (the small case on top of the computer with the blue lights); currently it’s housing 3x 2TB drives for a total of 6 TB. This external stack is connected to drone 1 through eSATA, which is a type of connection that allows essentially the same access speed as if the hard drives were connected internally (up to 3GB/s bandwidth I think).

To give the other computers access to the data in the housing, it was essential to get them all on the same fast network. The MNI’s internal network is capped at 100 Mbits/s (~12MBytes/s), so I decided to connect them through their own fast internal network. This was done through the Gigabit (1 Gbit/s, or ~120 MBytes/s) router sitting on top of the third computer. Due to some internal policies from the computer service at McGill, it was also necessary to connect each of these computers individually to the McGill network. Hence an extra Gigabit Ethernet card was installed in each computer.

I installed Ubuntu on drone 1, as well as the relevant scientific software: Matlab, R, RKWard, etc. I also installed NX, which is a wonderful remote desktop server that allows ones to access these Linux computers externally from your platform of choice, whether Windows, Mac, or Linux. It compresses X11 traffic and is remarkably efficient; one can quite comfortably work from home on these computers despite small bandwidth. I used no-ip.com to provide a simple internet address for the drones as opposed to a hard-to-remember IP.

Once everything was installed on drone 1, I cloned the drive three times to derive the 3 other computers, saving much time.

To give access to the data from the other computers, I used sshfs, which allows one to automatically mount a remote hard drive through ssh in a 100% automagic manner. Using symlinks, I arranged it so that every computer could access the data from the same (~/ArrayData) location. Thus, from the perspective of a remote user, all drones have the same software and access the same data. A similar system was used for documents (ie. m files).

Automated fetching and backup pipeline

Next, I sought to automate the process of getting the data from the recording computers. Two computers are used to record the data: a Mac which does the presentation through homebrew Psychtoolbox-based software, and a PC which records .PLX files from the Plexon MAP box. We agreed to use a consistent naming scheme for experiments: a letter (representing which physical hard drive the data will eventually be put in), three digits (for the recording day), and another letter (or letters) for each experiment on this recording day.

SFTP servers were installed on both the Mac and PC to allow external access to the data from the drones. I then wrote a Python script which fetches the data from a particular experiment (say, c012b) from both the PC and Mac and reads metadata in .mat files from the Mac and the .plx file from the PC and spits out usable files. I used the PLX reader modified from OpenElectrophy that I discussed earlier, as well as the MatIO library from reading .mat files in Python that is part of SciPy.

Using this script is pretty simple; one simply needs to type in the command sudo python fetchAndConvert.py -e c012a to fetch and convert experiment c012a. It’s also possible to fetch and convert an entire day with the same software.

The software first creates a fixed directory structure; for experiment c012a it first creates the root folder ~/c/c012/c012a. Under this folder it creates several sub folders, include /plx, which contains the original plx file, /mat, which contains plx data after it’s been processed in a format that matlab can read, and /ml, which contains the data from Mac.

Each of the mat/expName_chxxx files are sequences of 16-bit integers describing the 10kHz wideband which can be read directly in Baudline, as previously discussed, and in Matlab using fread.

The python file also creates some notes.txt files which are derived from the metadata from the Mac, and describe the experimental parameters. These notes can be visualized in a plain text editor, or in the Experiment Notes app. This is a very simple app made through PyQT that shows notes, allows one to view sorted spikes (more on this later) and so forth.The data is directly loaded from the directory structure and notes.txt files, rather than from a database. Looking back at this, this wasn’t the smartest idea; it takes a while for the app to traverse the directory structure, and it would be easier and more flexible to use a database (MySQL or SQLite for example) to store the information rather than flat files.

Using a fixed directory structure makes it straightforward to automate backing up tasks. I have a cron job set up on drone 1 that starts the Unison synching app twice a week. The data and analysis files are backed up to a remote Terastation.

Automated spike sorting

Once the data is fetched and reformatted in a way that Matlab can understand it, the raw wideband data is filtered and spikes are detected and sorted with the help of Wave_clus. Kyler Brown heavily modifed the Wave_clus codebase and created a secondary GUI to enable this. The user first opens the GUI and selects a .mat file in the mat directory of the experiment to be sorted:

Pressing the sort button starts the sorting process. This can take a long time, hence an option was added that sends an email to the relevant people whenever this first (automated) sorting phase is finished. Obviously, it would be feasible to notify the user through other means, like text messaging, if desired.

Once the file is automatically sorted, by selecting the review option the user can tweak sorting through an interface ripped from wave_clus. The interface takes screenshots whenever the user views a channel and places them in the mat folder; these can then be accessed through the Experiment Notes app.

Once the sorting has been reviewed, the user returns to the original GUI and selects another option to use the currently selected experiment as a template for sorting other experiments on the same day.

LFPs are also derived automatically in a separate step after spike sorting using the software I wrote for Zanos, Mineault and Pack (2011).

Conclusion

Dealing with large amounts of neurophysiological data is a challenge. I’ve shared with you how we dealt with this by building a cluster. I’m pretty happy with the chosen solution, although alternative setups would be workable. One possibility is that instead of setting up a cluster, a single computer dedicated to fetching, storing data is created. This is a workable arrangement if a fast enough link is available between the analysis computers and the storage computer. Once the data is sorted (for spikes) or downsampled (for LFPs), the amount of data that needs to be transferred between computers is sufficiently small that it can be comfortable to work with over a standard 100 Mbit/s network. I do think however that it’s not a good idea to use an entirely decentralized system where sorting is done on the analysis computers; too much data needs to be moved around and sharing the data between analyzers is a pain.

Creating scripts to automate fetching and preprocessing I think was a very good idea; it guarantees that the data structure is the same for every experiment. This makes it easier to automate other aspects of the preprocessing like spike sorting.

People use different methods to keep track of the experiments they perform; this can span from experimental notebooks to Excel worksheets to more sophisticated solutions like that enabled by OpenElectrophy. The Experiment Notes app allows central tracking of the experiments performed. OpenElectrophy has a similar functionality, and it would probably have been less trouble to simply use this ready made solution than rolling out our own.

As far as spike sorting goes, it should be possible to use to use proprietary software like Plexon’s spike sorter rather than a homebrew solution. There are several issues with this: it requires the use of Windows; it costs a lot of money; everything is GUI-driven and it seems like it would be a pain to automate; I don’t have the type of fine-grained control over filters and processing that I have with a homebrew solution. I like the algorithm implemented in wave_clus; it’s semi-supervised and manual adjustments are rarely necessary.

3 responses to “Setting up a cluster for the analysis of neural array data”

A Dwarakanath says:

November 1, 2016 at 3:33 am

Hey! Thanks for this blog post. I am recording from multiple Utah Arrays and we have mountains of data. Do you think I can get hold of this modified Wave_Clus by any chance? Would be great. Thanks!

Compressing wideband signals with FLAC | xcorr: comp neuro says:

February 24, 2014 at 6:07 pm

[…] We’ve been recording neurons on Utah arrays for ~3 years now. That creates a lot of data, and we keep buying more drives (above – there’s about 30TB of storage on the hub computer). Drives themselves are cheap; it’s backing up the data which becomes expensive in the long run, both in terms of bandwidth and cost of buying and maintaining backup servers. […]

Max says:

February 9, 2013 at 2:44 pm

How do you get the MAP system to store the whole 10kHz signal? In my experience, it refuses to store anything but waveforms through Rasputin. Are you using a different program?