Place and scene recognition from video

While navigating in an environment, a vision system has to be able to recognize where it is and what the main objects in the scene are. We present a context-based vision system for place and object recognition. The goal is to identify familiar locations (e.g., office 610, conference room 941, Main Street), to categorize new environments (office, corridor, street) and to use that information to provide contextual priors for object recognition (e.g., table, chair, car, computer). We have trained a system to recognize over 60 locations (indoors and outdoors) and to suggest the presence and locations of more than 20 different object types. The algorithm has been integrated into a mobile system that provides real-time feedback to the user.

As a test-bed for the approach proposed, we use a helmet-mounted mobile system. The system is composed of a web-cam that is set to capture 4 images/second at a resolution of 120x160 pixels (color). The web-cam is mounted on a helmet in order to follow the head movements while the user explores their environment. The user receives feedback about system performance through a head-mounted display.

Kevin Murphy Antonio Torralba

We use a low-dimensional global image representation that captures the "gist" of the scene. This can be used as input to a Bayes net/ HMM, as shown below. (See our ICCV03 paper for details.)

Below we show the performance of place recognition for a sequence that starts indoors and then goes outdoors. (ICCV03 Figure 3). Top. The solid line represents the true location, and the dots represent the posterior probability associated with each location. There are 63 possible locations, but we only show those with non negligible probability mass. Middle. Estimated category of each location. Bottom. Estimated probability of being indoors or outdoors.

Some images from the dataset.

Publications

Context-based vision system for place and object recognition
Antonio Torralba, Kevin P. Murphy, William T. Freeman and Mark Rubin, ICCV 2003.
Using the forest to see the trees: a graphical model relating features, objects and scenes
Kevin P. Murphy, Antonio Torralba and William T. Freeman, NIPS 2003.

Movies

AVI of place recognition using wearable camera. If P(place-category(t)|vG(1:t)) > threshold, we print the category of the place (office, kitchen, etc) in the top right corner (black = correct, red = incorrect). If P(place(t)|vG(1:t)) > threshold, we print the name of the specific place (office 101, kitchen #3, etc) in the bottom right corner (black = correct, red = incorrect).
AVI of place recognition using wearable camera. This one shows the HMM belief state superimposed on a topological map.
Text output is the same as above movie. The bottom half shows a map of the 9th floor of the AI lab (NE43). Blue solid circle indicates P(place(t)|vG(1:t)) as computed using the HMM; black hollow circle indicates P(place(t)|vG(t)) as computed using the instantaneous gist; red/green cross = true location. The size of the circles is proportional to the probability. Notice how the HMM provides temporal smoothing. Nevertheless, there are discontinuous jumps, which apparently violate topological constraints, because we apply Dirichlet smoothing to the transition matrix. This effect can be reduced (at the cost of increased latency upon moving to a new location) by down-weighting the likelihood by an exponential factor (see equation for \tilde{b}_t on p4 of ICCV paper).
WMV movie which shows how Dan Roth ported our place recognition system to an ER1 mobile robot.

Data

The video data used to generate the results in Figure 3 of the ICCV03 paper is available as part of the MIT CSAIL database of object and scenes. Look for the folder called "paperSequence".
The matlab file here contains the 80 dimensional gist vectors for the video sequence, and the place numbers and names:
```
placeNames: {1x20 cell}
     placeNums: [1x3430 double]
         gists: [80x3430 double]
```
If you type plot(foo.placeNums,'o-') the results look slightly different from Figure 3, since the names of the places were changed somewhat. But it is qualitatively similar. Note that although we considered 63 places in the ICCV03 paper, only 20 occur in this particular sequence.
The file gistsICCV03.zip (14MB) contains 17 files, similar to the above, for the 17 video sequences used in the ICCV03 paper (see here for the list of files used for training and testing).