How many pixels make an object? Like, 30

There’s a neat paper on the psychophysics of scene and object recognition in super-low resolution scenarios in Visual Neuroscience by A. Torralba (2009). The author sought to answer a rather interesting question: what image resolution is needed to support scene and object recognition? He took images from databases and created several different versions of them, differing in resolution from 4×4 (!) to 128×128 pixels.

In the first experiments observers were asked to identify a scene (“bedroom”, “beach”, “forest”, etc.) based on such images. Even at the lowest resolution (4×4), people were frequently above chance. At 16×16, 75% accuracy was reached for outdoor scenes. To keep things in perspective, 16×16 pixels is the size of a favicon, the tiny icon that is used to visually identify a website in a browser; it sits to the left of the address bar (for this site, it’s a big black X on a white background). For indoor environments higher resolution was required to reach this accuracy, yet even at 16×16 many scenes were clearly recognizable. Here’s an example:

Did you figure it out yet? If you saw a bedroom, you are correct. To be clear, the image was created by downsampling the original image to 16×16 and then presented after upsampling through interpolation. It contains the same info as a 16×16 image without the distracting appearance of square pixels. Here’s a tougher one:

That’s a car in front of an office building (street scene). Now, as I mentioned, indoor images need higher resolution to be categorized. Part of the reason is that color is much more informative about scene identity in outdoor scenes. Green is usually associated with forest environments while blue and beige are associated with beaches. Outdoor scenes tend to contain less surfaces, as well.

Okay, so 16×16 or 32×32 seems like pretty low resolution to identify a scene, but what about identifying an object. If a scene has been identified, it seems likely that at least a few objects have been recognized correctly within it. Yet at these resolutions the objects must be tiny! In a second set of experiments, observers were asked to both identify a scene and the objects within it. Here’s an example of how a psychophysical observer tagged a bedroom at 16×16:

That’s 6 objects right there in a measly 256 pixels. The area of the headboard is probably not more than 30 pixels, and yet it can still be recognized! Presented alone, of course, such an object is unrecognizable; context furnishes the additional information. Here’s a dramatic example of this:

What is that? Context gives the answer:

It’s a sink, and it was correctly tagged by an observer despite being only about 8×4 pixels. Of course if you know already that the scene is a bathroom, prior information tells you that there is probably a sink in there. But the observers here were asked both to classify the scene and tag its objects. So the scene must be classified despite its objects being unrecognizable, and the objects somehow tagged despite the scene identity (and thus context) being unavailable. An explanation involving a hierarchical Bayesian scheme is almost too tempting. A few relevant papers — Yuille & Kersten (2006), Bengio & LeCun (2007), Rao & Ballard (1999).

In any case, pretty thought provoking, and a zippy read.

Torralba A (2009). How many pixels make an image? Visual neuroscience, 26 (1), 123-31 PMID: 19216820

6 responses to “How many pixels make an object? Like, 30”

  1. Apologies for diving right in, thanks for providing a quick overview of dopamine! There is plenty of research out there showing that DN can respond differently to stimuli associated with different reward values, which is largely what led to the reward hypothesis of dopamine function. As you say, the inhibition after an absent reward has been used to extend the theory to reward prediction error.

    The argument from the dopamine as novelty/salience camp has been that the reward related response is the result of sensitisation of sensory inputs, and that it is an artefact of the experimental training, rather than a genuine ability. However, it’s bugged me as to what the normal function is of this sensory system that has been co-opted into signalling reward.

    If saccades can be directed to what the sensory systems assume is a stimulus of interest in a particular context, then perhaps DNs may be able to bias a signal according to an assumed value by a similar mechanism. This wouldn’t be a prediction error signal, because it would be a signal based on assumptions from prior experience, which doesn’t seem to get updated post-saccadically (i.e. post-recognition and post-valuation). It also wouldn’t make dopamine an reliable indicator of reward, because it would be impossible to distinguish reward responses from purely sensory responses. If this is the case, then DNs would be signalling the appearance of a novel/salient stimuli, but in different contexts different stimuli could be tweaked to be more or less salient.

    That isn’t to say that there isn’t some value related function that uses dopamine on a longer time base, say activity over a few seconds, or through synaptic modulation, but the phasic response (80-110 ms latency, ~200 ms duration) doesn’t seem to indicate a current estimate of the value of a stimulus.

  2. Neuromancy: this isn’t my area of expertise so I quickly looked at Redgrave, Prescott and Gurney (1999) to gain some perspective. For the benefit of other readers of this blog, let me state your question as I understand it.

    Dopaminergic neurons from the ventral midbrain and the VTA project to the striatum. They seem to provide reward-related signals that could support reinforcement learning. Like, you give the animal a reward, a 100 ms later there’s a big burst of spikes coming from these dopaminergic neurons (DN). On the other hand, If the animal expects a reward after a conditioning stimulus, and the reward doesn’t come, the DNs are inhibited 100ms after the stimulus is expected to happen. So this could provide an error signal for reinforcement learning.

    But this DN activity also occurs for novelty in general. So if you present unexpectedly a stimulus outside the fovea, you’ll also get the DN burst. Most object recognition occurs near the fovea because the resolution is (so they say) too coarse outside. So it can’t be the “goodness” of the stimulus that’s triggering the burst, it’s just general novelty, and thus the DN signal is not purely reward-related.

    Now as far as we know directing saccades doesn’t require high-resolution (try “bayesian location saccades” in Google Scholar). But what about object recognition (OR)? The previous study does go against the idea that you need the fovea to do half-decent OR. Another good hint of that is that some insects which have way less resolution than primate fovea do perfectly good OR (see “The Visual Neurosciences” chapter on bee vision).

    A quick straw poll of the postdocs in the lab shows reveals that some object recognition is probably possible in the periphery. It’s just a question of degree. Like even far out in the periphery you might be able to tell some things about a face (race, gender for example), yet identity might elude you. We’re recording in V4 in the periphery (10-40 degrees eccentricity) and the neurons still seem to be tuned for Connor-style high-order features. IT receptive fields are pretty huge, also. The OR in the periphery will be limited by resolution and crowding, however.

    Whether this has any bearing on the DN debate is unclear, though.

  3. Wow, great stuff – it’s amazing how much we can do on so little information. I guessed that the sink was probably some bathroom item, but that was perhaps just because it was white and potentially indoors.

    I wonder how quickly we can identify these objects? One of the criticisms of interpretations of dopamine release as value related is that objects appearing outside of the fovea don’t have sufficient spatial resolution to be identified. Neither can the objects be brought on to the fovea to be recognised, and their value identified, at latencies short enough to trigger the release of dopamine. If we can identify these objects at such low spatial resolution, then perhaps we provide a rough identity/value signal to dopamine neurons at sufficiently short latencies.

Leave a comment

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s