By Osian Haines.Abstract: This article discusses some important features of practical SLAM systems, for the purposes of recovery from failure, closing loops and joining maps together. It will describe the utility of each of these mechanisms and how they can be invaluable in the construction of an effective SLAM system, and how they can be considered to be aspects of the same graph connectivity problem.
Previous articles in this series have introduced the concept of Simultaneous Localisation and Mapping (SLAM) and described some of the key difficulties involved in building a working SLAM system. So far we have mostly focused on issues of tracking and mapping which make up the core of a visual SLAM system (in which the pose of a camera is estimated with respect to the map, and the map is expanded and optimised according to the motion of the camera, respectively).
But another very important property of a working SLAM system is to be able to cope with tracking errors and failures, and to do this reliably as the map grows larger. This article will discuss a few mechanisms to do this, namely relocalisation, loop closure and submapping, and will show how all three can be usefully thought of as aspects of the same general problem.
As in our previous articles, I will be primarily considering these in the context of SLAM systems with parallelised tracking and mapping (in the PTAM/ORB-SLAM style). This consists of a tracking process which estimates the pose of the current camera in real time with respect to the map; and a background mapping process which periodically chooses camera frames to add as new 'keyframes' to the map and optimises their configuration, along with the set of observed 3D points. However, many of the ideas discussed are applicable to more general types of SLAM.
Sooner or later, tracking in SLAM will fail - possibly due to fast motion, poor image quality (e.g. blur), moving into a featureless area, and so on. In such cases it is necessary to have a mechanism to recover the camera pose with respect to the existing map, as soon as it is possible, and then to resume tracking. This is called relocalisation and is a crucial part of any useable SLAM system - otherwise its usefulness would end at the first failure.
At its most basic relocalisation can be described as finding a correspondence between the current camera image (about which, in the monocular case, we may have absolutely no geometric information) and the existing map. Once a correspondence is established, giving at least an approximate camera location, it should then be possible to re-start frame-to-frame tracking as usual. However, recovering tracking is much more difficult that tracking itself, because it cannot make use of the assumptions of spatial and temporal continuity which underpin tracking; indeed, it cannot make the assumption that the camera is anywhere close to where it was in the last tracked frame, especially if it was moving quickly. This generally means relocalisation will need to search over the entire map for a match.
Not only might it be necessary to search the whole map, but this must be done quickly. This is essential as the camera will generally not stay still while relocalisation is happening, and so if it takes too long, the camera may well have moved on since its pose was recovered, making the result useless. Building an accurate, reliable, and fast method to relocalise the camera within the map is quite a challenge, but various methods exist to do this. The use of local (e.g. image descriptors) and global (e.g. image histograms, bags of words) appearance information is one popular way to achieve relocalisation, which can avoid considering geometric information, at least to begin with. Another way is to use a robust model estimation algorithm (e.g. Random Sample Consensus) to try to find a correspondence between the image and the map which makes geometric sense. The KudanSLAM system currently uses a mixture of the two approaches, and is demonstrated in the video below recovering tracking in a variety of difficult situations, in order to illustrate exactly why a relocalisation system is needed.
An example of the KudanSLAM system performing relocalisation, to recover from a variety of problems (tracking failure due to fast motion and a lack of texture, and solving the 'kidnapped camera' problem).
A second mechanism which can be very useful in SLAM, especially when exploring large and complex areas, is loop closure. Ideally, when the camera returns to a place it has been before, the map will reflect this, and sections which were mapped the first time around can be re-matched the second time around. This avoids needing to constantly expand the map when coming back to previously explored areas, and will also improve the accuracy of the map as more constraints are added.
However, due to the accumulation of small errors (e.g. scale drift) in the camera trajectory, the ends of the loop may not properly meet, and so this connection will be missed. These need not be large errors, but large enough for 3D points to project far enough away from their true locations for the tracker to miss them. Without matching to previously created points there would be no way for the SLAM system to recognise it has been in some location before, and would continue expanding the map indefinitely.
This can be solved by adding a mechanism to explicitly detect when such loops have been closed - i.e. 'loop closure'. This does more than use just the geometric information stored in the map to hypothesise loops, and looks at appearance information in the current camera image (or, the latest keyframe) and compares it to the past map. If the current image is deemed to be sufficiently similar to some previously visited location, a loop constraint can be added, stitching the two regions of the map back together. This helps global optimality, since it reduces the error accumulated over time due to drift, and leads to a more compact map. The following images show what effect this can have: error accumulated over time is removed by recognising that the same location has been visited twice.
A demonstration of why loop closure is important in large maps, showing the error due to accumulation of scale/pose drift (left) and the corrected map after incorporating the loop constraint shown (right). The images are from Strasdat et al. 2010.
Finding loops accurately can be difficult, since if it is not possible to rely on geometric information, a search over the whole map may be needed. On the other hand, while geometric information alone cannot be relied upon in identifying loops, some measure of geometric plausibility may help in eliminating very unlikely loops (e.g. loops which involve the camera flipping upside down are much less likely). Moreover, there is the possibility that the environment has changed - e.g. due to moving objects or changing lighting conditions - and this adds an extra difficulty in establishing robust correspondences. Fortunately, many of the techniques used in relocalisation can also be applied to the loop closure problem, since both involve finding a correspondence between some query image and a map, albeit with different constraints.
As SLAM maps grow, the time taken to run optimisation on them will necessarily increase. One popular strategy to avoid such a scalability disaster is to periodically start from scratch with a new map. This is most useful when moving into new areas, so that exploration can continue without being hindered by the cost of maintaining the map already built. So long as some connection between the new map and the old map is maintained, tracking will be able to switch between maps as the camera moves back. This in effect adds another level to the SLAM map - as well as each local map being composed of individual keyframes, the global map is composed of individual maps and the relationships between them. Submapping was a very popular technique when SLAM systems were predominantly filter based, where scalability in the size of the map was a very big concern (e.g. this paper); but there is still use for them with more recent bundle adjustment based systems.
A particularly good use of submapping is when tracking fails and recovery via other means proves to be impossible. Upon failure a relocalisation-equipped SLAM system would continually attempt to recover the camera pose with respect to the existing map, to resume tracking and mapping as normal; but if such a recovery is impossible, nothing else can happen, for there is no connection to the map. This can be seen as wasted time, because the camera is potentially traversing new regions of the environment, yet any information about it is lost. The alternative is to simply begin a new map, and to continue mapping this possibly new region for as long as possible.
This means that tracking need not actually 'fail', but leads to a situation where there may be multiple independent (possibly overlapping) SLAM maps with no connection between them. One would therefore need some kind of submap joining algorithm, to make sense of these multiple maps and merge them back into one coherent map, once there are enough observations to re-establish the link.
Of course, submap joining is similar to loop closure, in that it involves finding a correspondence between different map keyframes. Instead of looking for correspondences between two keyframes in the same map, the search is between different keyframes in different maps. There will also be very few constraints on the type of transformation which may exist between different maps - unlike loop closure where pose or scale may have merely drifted, the relative pose and scale of submaps may have nothing in common.
It is important to note that using submaps will not solve all problems coming from tracking failures in SLAM. There will be a possibly indefinite delay before maps can be joined together, and even then considerable optimisation might be required to reconcile the different sets of observations, introducing a further delay before the whole global map is available. This may be acceptable but depends very much on the application for which the SLAM problem is being solved. It may be a very useful strategy when the end goal is an accurate global map of a new environment; but less applicable if an accurate camera localisation is required at all times, for example in augmented reality applications, where the link between the camera pose and its initial starting location is of prime importance.
A Unified Perspective
You may have noticed that the above all involve finding correspondences between the camera and the map, or different parts of the map. In fact, they may all be viewed as aspects of the same problem. This can be best described if the SLAM map is thought of as a graph, in which the nodes are keyframes and the edges represent observations they have in common (indeed, this is often the representation used within SLAM systems for optimisation). One of the earlier papers to explicitly state the equivalence of relocalisation and loop closure was by Eade and Drummond in 2008.
In this context, relocalisation can be thought of as adding a new edge between a node representing the current camera pose (which is initially unkown) and the rest of the map, thereby re-attaching the live camera to the existing map. The difficulty of relocalisation is that this edge could potentially connect to anywhere in the map.
An illustration of how relocalisation (left), loop closure (centre) and submap joining (right) are all aspects of the same problem, namely where to add a link in or between graphs.
Loop closure, on the other hand, can be thought of as adding an edge between two existing keyframes in the graph (usually via the current camera's pose). This enforces the constraint that two disparate locations in the graph actually correspond to the same physical location, and that the cumulative transformations around the loop should add up to nothing. Loop closure could also potentially involve a search over the entire graph, although some reasoning about plausible loops could be employed, such as avoiding adding edges to the graph which violate constraints imposed by existing edges.
Submap joining is similar to the above, but involves establishing a correspondence between two different graphs, representing different maps, rather than two parts of the same graph. This would form one large graph from the two components connect only through the one new edge, but once this initial relationship is known further searches could find more correspondences between the graph nodes, making them better connected. This unified perspective on these three is illustrated above.
As this articule has discussed, some of the very important features needed in a reliable SLAM system - namely the ability to recover from tracking failure, to identify loops and thereby reduce errors, and to link different map fragments together - can all be thought of as aspects of the same general problem: that of adding a new edge in a graph representing the map. Not only does this insight aid understanding of these three tasks, but can lead to new algorithms which may efficiently do all three, or suggest how advances in one problem can be used to improve performance in the others.
Now that the three tasks are understood in a single framework, the next question is how the necessary correspondences between keyframes can actually be established, in order to add the graph edges. In general, this is the problem of place recognition, which as hinted at above involves the use of appearance information and geometric constraints to establish robust correspondences within and between maps, in potentially very complicated situations. This is a topic in its own right, and will by the subject of my next article.