Tips on using crcns datasets

I’ve mentioned before that the CRCNS web site has a number of neural datasets available for download. To save you some time, here’s some tips to get you up and running for specific datasets.

V2-1 data

Jack Gallant’s V2 dataset is really interesting; I think it’s fair to say that we know very little about what V2 does. The prediction accuracy reported in Willmore et al. (2010) – r ~ 0.30 on a validation dataset – and the relatively short recording times – about 8 to 16k frames – might lead you to think that the data is too noisy to fit fancy models.

Actually, the data is quite good, but there’s some processing steps missing from the Willmore paper that will help you get better scores and fit fancier models. There’s two things you need to know:

  1. The stimuli are not very well centered in the receptive fields of the V2 cells, which can be verified by using STA in a Gabor basis
  2. Many of the cells are selective for very high spatial frequencies, which can be verified by looking at the grating data (DGrat)

The result is that if you represent the stimulus at a modest spatial resolution – 27×27 in the case of Willmore et al. – you will tend to miss the highest frequencies, for which there is significant tuning. On the other hand, you will miss the location of the receptive field if you do a tighter crop around the center of the stimulus. So you need to use e.g. STA in a Gabor basis to get a better estimate of the receptive field locations, do a relatively tight crop, and map the receptive field at high spatial resolution.

By simply re-cropping the stimulus, and using a similar model to the one in Willmore et al. – boosted GLM in rectified BWT basis – I got the prediction accuracy to an average of 0.38 – which brings us much closer to the score in V1 (~.4-.5). With a fancier model of V2, which I reported briefly on earlier and will publish soon, I got to an average validated r ~ 0.52. In other words, there is a lot of publication juice left in this dataset – don’t let the 0.3 score scare you away.

One thing which has yet to be done is to look at the fine-grained temporal structure of the spike trains. The PSTHs are sampled at the frame rate of the monitor. It’s possible to reconstruct the PSTHs at an arbitrary frame rate, however, using the timestamps of the spikes. They are in the raw data files, but they are given in a weird format, which AFAIK, is undocumented.

It’s not too difficult to reverse engineer, though. Here’s a script you can use to sample the data at some number times the framerate of the monitor (by default, 4x):

function Y = getHighResResp(respfile,collapse,upsample)
    if nargin < 2
        collapse = 0;
    if nargin < 3
        upsample = 4;
    d = load(respfile);
    kHzdata = d.kHzdata;
    psth = d.psth;

    thestart = find(kHzdata.stim==0,1);

    da = mean(diff(find(diff(kHzdata.stim(thestart:end))~=0)));

    nreps = 1;
    if da > 20
        thespk = find(diff(kHzdata.stim(thestart:end))>0);
        nreps = round(trimmean(diff(thespk),40)/(1/60.166)/1000);
        fprintf('Nframes per stim (>1):: %d\n',nreps);

    trialstarts = d.kHzdata.trialstartidx;
    trialstarts = trialstarts(~isnan(trialstarts));
    trialstarts(end+1) = length(kHzdata.stim)+1;

    Y = nan(upsample,nreps,(max(kHzdata.stim)+1),length(trialstarts)-1);

    if ~(length(kHzdata.resp)==length(kHzdata.stim))
        rs = -7e-5;
        ry = (find(kHzdata.resp)-1)/(1-rs);
        y = histc(ry,0:length(kHzdata.stim));
        y = y(1:end-1)';
        %kHzdata.resp = y;

        delta = length(kHzdata.resp)-length(kHzdata.stim);
        %kHzdata.resp = kHzdata.resp(delta+1:end);

        fprintf('Warning:: resampling y because of mismatch b/w kHzdata.stim and kHzdata.resp\n');

    delta = 201;

    thegoods = d.xpos + d.ypos;
    nope = isnan(thegoods);

    %nope = convn(nope,ones(3+floor(8/nreps),1),'same')>=1;
    nope(1:end-1,:)= nope(2:end,:);
    %if nreps > 1
    %    nope = convn(nope,ones(3,1),'same')>=1;
    nope = reshape(nope,nreps,size(nope,1)/nreps,size(nope,2));

    for ii = 1:length(trialstarts)-1
        %Trial range
        therg = (trialstarts(ii)+delta):(trialstarts(ii+1)-1);
        if isempty(therg)

        bounds = delta+trialstarts(ii)+[0;find(diff(kHzdata.stim(therg))~=0)+1;length(therg)-1]-1;
        bounds = bounds(1:end);
        if bounds(1) == bounds(2)
        for jj = 1:length(bounds)-1
            thesubrg = bounds(jj):bounds(jj+1);

            theframe = kHzdata.stim(thesubrg(1));

            %sample the spikes in this range
            spktimes = (find(kHzdata.resp(thesubrg))+.2)/length(thesubrg)*upsample*nreps;
            %if isempty(spktimes)
            %    continue
            y_ = histc(spktimes,0:upsample*nreps);

            y_ = reshape(y_(1:end-1),upsample,nreps);
            mask = nope(:,theframe+1,ii);
            y_(:,mask) = nan;

            y_(y_>=3) = nan;

            Y(:,:,theframe+1,ii) = y_;

    Y = reshape(Y,size(Y,1)*size(Y,2)*size(Y,3),size(Y,4));

    if collapse
        Y = nanmean(Y,2);

    %Compute the correlation between Y and psth
    Yp = nanmean(Y,2);
    Yp = sum(reshape(Yp,upsample,length(Yp)/upsample))';
    if nnz(isnan(Yp))>length(Yp)*.1
        fprintf('That''s a lot of nans: %d/%d\n',nnz(isnan(psth)),nnz(isnan(Yp)));
    gidx = ~isnan(Yp) & ~isnan(psth);
    gidx(1) = 0;
    thecorr = corr(Yp(gidx),psth(gidx))

    if (nreps == 1 && thecorr < .99) || (nreps > 1 && thecorr < .95)
        if nansum(psth) < 250
            fprintf('Correlation between psth and Y too low, going to let it slide because there are so few spikes\n');
            error('Correlation between psth and Y too low, perhaps mis-interpreting the file');

It worked for every cell I tried it on except about 4, which you can reject from the analysis – you still have 82 cells with validation data available.

MT data

The MT dataset, which was recorded by Farhan and Dave at my lab, must be cleaned up a little bit before it can be put to good use. Although this isn’t indicated, the cells from 1 to 64 contain single-electrode data, while the cells from 65-86 are recorded from a linear array. For whatever reason, the mtdata.spkbinned variable is not valid for the linear array data – for example, the  variables on cells 65 and 66 are exactly the same. I told Yuwei about this and he should upload the corrected version soon, but in the meantime you will have to resample the data yourself using mtdata.spkt if you want to use the linear array data.

That being said, I’m not sure how well sorted the linear array data is, so it might be better to just use cells 1 to 64. There’s a couple of cells in there, 13 and 14, that seem to contain mostly noise; in addition, for some of the early recordings, the recording quality drops after a certain number of frames. From inspecting the temporal filters estimated by STA in chunks of the data, I recommend truncating the recordings for these cells:

  • Cell 3: truncate after 48,000 frames
  • Cell 4: 36k
  • Cell 6: 30k
  • Cell 11: 30k
  • Cell 24: 66k
  • Cell 30: 42k
  • Cell 32: 48k
  • Cell 36: 48k
  • Cell 41: 24k
  • Cell 43: 42k

*Update Apr 1st 2014: the previous numbers I recommended were off by a factor 3, because I was downsampling the data by a factor 3 and I didn’t take that into account here previously. Numbers have been corrected. Sorry about that.

After the cleanup you should still have more than enough data to perform a good analysis.

V1 Ringach data

Dario Ringach – my new supervisor! – has a ton of V1 data on the CRCNS website. I wrote about how to get a simple analysis running on this dataset, and the advice is still valid (if a bit dated).

One response to “Tips on using crcns datasets”

Leave a comment

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s