By Ryan Wang.
In our previous articles we have explained some of the difficulties in SLAM (simultaneous localisation and mapping) systems from an algorithmic point of view, including issues of camera calibration, data association, system initialisation, and algorithms to compensate for camera distortions. This article will give a brief introduction to situations in which SLAM systems encounter limitations based on hardware, or by particular usage conditions. The following key points will be discussed:
- The limitation of SLAM systems due to camera hardware characteristics
- Frame rate
- The static world assumption
- The geometric limitations in monocular and stereo SLAM systems
- Pure rotation
- Stereo baseline
Camera Characteristics Limitations
Most existing SLAM systems use RGB cameras or RGB-D cameras to obtain visual information (i.e. a colour camera or one which can also perceive depth). As those cameras have a limited frame rate, a limited bandwidth for data transmission, and usually a rolling shutter, the performance of SLAM systems is constrained by those hardware characteristics.
As we typically have a very limited bandwidth to send and receive camera frames, we can either use relatively low-resolution image data, or compress frames - but both options will limit the performance of SLAM systems. Specifically, lower resolution images will be unable to resolve small details within the scene; and real time encoding/decoding requires significant computational power and will introduce compression artifacts as well.
Shutter Related Distortions
As the camera(s) used in SLAM systems do not remain still, movement of the camera will introduce motion blur to frames. Most real-time SLAM systems operate at around 30 frames per second, which can tolerate some motion but not sudden or intense movement. Although motion blur can be reduced by capturing images at a higher frame rate, this will of course run into the constraints of bandwidth and computing power limitations again.
Another limitation introduced by the camera is the distortion introduced due to the use of a rolling shutter. Most low-cost cameras use a mechanical rolling shutter, which introduces the typical rolling shutter distortion in which the whole image appears to be sheared when the camera is undergoing motion. The following video demonstrates this distortion.
Some better cameras use a global shutter, which exposes the whole camera sensor at the same time, but they are generally more expensive; most importantly, especially in the context of mobile augmented reality (AR), popular mobile phones are generally not equipped with global shutter cameras. This means that using a global shutter camera for SLAM usually requires building dedicated hardware, for example AR headsets or in automotive applications. More explanations about rolling shutters and hard/soft global shutters can be found at here.
The Static World Assumption
Most existing SLAM algorithms rely on the 'static world assumption' to localise the sensor and construct a map of the environment, which simply means that (other than the camera) nothing in the scene is moving with respect to the mapped environment. If the environment changes, the SLAM system can simply become lost as its location relative to the static map it has built becomes meaningless. The reason why current SLAM systems can still work in such situations is there are often some filtering mechanisms (e.g. RANSAC, ICP) used within the SLAM system that can treat the small movements as noise. Nevertheless, SLAM systems will always fail if the environment changes dramatically.
The static world assumption also poses problems in AR applications. Imagine a camera facing a table that has a cup on it. A modern SLAM system will not fail if the cup is moved because it should filter the movement as noise; but this means that augmented reality content (for example, a small advertisement appearing to stick out from the surface) cannot be added to the cup using the SLAM system if it moves, because it is no longer being tracked. A more difficult situation will be trying to augment a hand with a virtual glove (non-rigid body) in AR applications. There is ongoing research to try to work around the static world assumption and enable tracking of non-rigid scenes, but these remain at an early stage. For more information on doing SLAM in a dynamic environment, see this paper.
Pure Rotations in Monocular SLAM
One of the most serious problems for monocular SLAM systems is the pure rotation situation, which may lead to large errors in camera pose in a SLAM system. This can be explained by considering a camera looking at a point. The trajectory of this point projected onto the image plane when the camera rotates about its centre will appear very similar to that caused by a large translation movement. Therefore, when there is a paucity of information from other points, the SLAM system will be confused about what the true 3D motion is, which will result in the computation of an incorrect camera pose due to the misguided conclusion that a large translational movement happened. Despite various recent work addressing the problem of pure rotation, it is still a most serious problem in monocular SLAM systems.
Baseline in Stereo SLAM
Stereo SLAM systems have (at least) two cameras, and can therefore estimate depth from computing the difference between the two slightly offset camera images. However, a good estimation of scene depth needs a decent baseline (the spatial distance between the two cameras in the system). Generally, the wider the baseline, the better the depth estimation. But a setup with a wider baseline needs a larger space to put the cameras, which may not always be available. For example, a SLAM system mounted on a car (to enable driver-assisting augmented reality) can take advantage of the large amount of roof space available and employ a wide baseline, possibly around one metre. However, the baseline of cameras on an AR headset will usually be only around 20cm; and much less again if we want to achieve stereo vision on a mobile phone. This also explains why an iPhone 7 Plus, with its two closely-spaced cameras, cannot be used as a stereo vision device, as we discussed in this article.
In conclusion, modern SLAM systems are still under development, and there is always a trade-off between the different limitations introduced by hardware, geometric considerations, computational power, device or scene motion, power consumption, and so on. This suggests the need for specific SLAM systems tailored towards different applications and platforms, such as indoor/outdoor, mobile phone/headset/automobile. No one general SLAM system can deal with all situations.
 Newcombe, R. A., Fox, D. & Seitz, S. M. DynamicFusion: Reconstruction and Tracking of Non-rigid Scenes in Real-Time. (2015).
 Pirchheim, C., Schmalstieg, D. & Reitmayr, G. Handling pure camera rotation in keyframe-based SLAM. 2013 IEEE Int. Symp. Mix. Augment. Reality, ISMAR 2013.