How do RGB SLAM systems use image data?
A simple explanation of Sparse/Dense and Direct/Indirect SLAM systems
Monocular visual Simultaneous Localisation and Mapping (SLAM) has become very popular because it relies only on a standard camera. Since cameras are now easily found in many consumer electronics products, this makes SLAM systems which use only a single camera very appealing, both as an area of research and as a key enabling technology for applications such as augmented reality.
Visual SLAM algorithms are designed to take advantage of the very rich information about the world available from image data. The way that SLAM systems use these data can be classified as sparse/dense and direct/indirect. The former describes the quantity of regions used in each received image frame, and the latter describes different ways in which the image data are used. This means there are four possible types of SLAM system, in terms of how the image data are used.
Sparse and Dense Methods
From the perspective of which areas in an acquired image are used, SLAM systems can be classified as either sparse or dense. More specifically, sparse SLAM systems use only a small selected subset of the pixels in an image frame, while dense SLAM systems use most or all of the pixels in each received frame. As they use a different number of pixels and regions in a given area, the generated maps from sparse and dense methods are very different. The maps generated from sparse methods are basically points clouds, which are a coarse representation of the scene and mainly used to track the camera pose (localisation). On the other hand, dense maps provide much more details of viewed scenes; but because they use many more pixels than sparse methods, more powerful hardware is usually needed, and most current dense SLAM systems require a GPU. Figures 1-3 illustrate the difference between maps generated by sparse and dense SLAM systems.
Direct and Indirect Methods
The way that SLAM systems utilise information from a received image can be used to classify them as either direct or indirect SLAM. Indirect SLAM systems attempt to extract what we call features first, and then make use of these features to locate the camera and build the map. These features can be simple geometric features such as corners or edges, or more sophisticated feature descriptors, for example SIFT, ORB, FAST, etc (such as in ORB SLAM as shown in Figure 4). Direct methods, in contrast, make use of pixel intensities directly, rather than extracting intermediate features. Direct methods try to recover the environment depth and structure and the camera pose through an optimisation on the map and camera parameters together. As the feature extraction procedure can take a lot of time, direct methods potentially allow more time for other computations while maintaining the same frame rate as indirect methods. On the other hand, indirect feature-based methods methods provide better tolerance towards changing lighting conditions since, unlike direct methods, they are not using the pixel intensities directly.
There are many popular monocular SLAM systems these days, and the different means by which they make use of image data (sparse/dense and direct/indirect) can be used to select the appropriate algorithm for different applications and hardware platforms. Figure 5 shows a straightforward illustration of where a selection of different popular SLAM systems fall on these two axes. Information on the systems in Figure 5 which were not described here can be found in the papers and links to videos below.
Related Papers and Demonstration Videos (in alphabetical order)
DSO: Koltun, V. and Cremers, D., 2016. Direct sparse odometry. arXiv preprint arXiv:1607.02565.
DTAM: Newcombe, R.A., Lovegrove, S.J. and Davison, A.J., 2011, November. DTAM:
Dense tracking and mapping in real-time. In Computer Vision (ICCV), 2011
IEEE International Conference on (pp. 2320-2327). IEEE.
LSD-SLAM: Engel, J., Schöps, T. and Cremers, D., 2014, September. LSD-SLAM: Large-scale direct monocular SLAM. In European Conference on Computer Vision pp. 834-849). Springer International Publishing.
MonoSLAM: Davison, A.J., Reid, I.D., Molton, N.D. and Stasse, O., 2007. MonoSLAM: Real-time single camera SLAM. IEEE transactions onpattern analysis and machine intelligence, 29(6)
ORB-SLAM: Mur-Artal, R., Montiel, J.M.M. and Tardos, J.D., 2015. ORB-SLAM: a versatile and accurate monocular SLAM system. IEEE Transactions on Robotics 31(5),
PTAM: Klein, G. and Murray, D., 2007, November. Parallel tracking and mapping for small AR workspaces. In Mixed and Augmented Reality, 2007. ISMAR 2007. 6th IEEE and ACM International Symposium on (pp. 225-234). IEEE.
SVO: Forster, C., Pizzoli, M. and Scaramuzza, D., 2014, May. SVO: Fast semi-direct monocular visual odometry. In Robotics and Automation (ICRA), 2014 IEEE International Conference on (pp. 15-22).IEEE.