By Osian Haines.
While Simultaneous Localisation and Mapping is in principle a solved problem, building a reliable SLAM system presents a number of engineering challenges which make it much more than just implementing tracking and mapping algorithms. This article describes some of the more difficult aspects of practical SLAM, and ways in which the various problems can be addressed.
In previous articles we have introduced SLAM (Simultaneous Localisation and Mapping) and discussed some of the issues relating to using SLAM systems in various situations. From reading these, one might be forgiven for thinking that SLAM is basically a solved problem, and that implementing a SLAM system is just a matter of putting together the required components. However, while the core problem of finding the location of a sensor while mapping its environment in real time can be addressed in a variety of ways, there are plenty of difficulties which make building a practical and reliable SLAM system very challenging. This article presents a selection of these problems, from data acquisition through to map refinement, to try to give a sense of the significant engineering challenges and research questions involved when doing SLAM in the real world.
For the purposes of this article, I will be focusing on modern sparse, indirect approaches to visual SLAM - where monocular or stereo cameras are the primary form of input. The most common approach, since the seminal 'Parallel Tracking and Mapping' paper (Klein and Murray 2007), is to separate tracking and mapping into separate concurrent processes. The tracking process constantly estimates the location of the camera with respect to the map so far (represented as a point cloud), maintaining a real-time localisation; while the background mapping process gradually expands the map by adding selected camera frames ('keyframes'), and optimising the whole thing by minimising reprojection errors through nonlinear least-squares ('bundle adjustment') - for more details on basic SLAM, see this previous article. However, many of the topics discussed here are not restricted to visual SLAM only, and problems such as data association and relocalisation are universal to SLAM, despite differences in approach relating to the specific setup.
A reminder of what SLAM is: the input image is shown on the left, overlaid with tracked points. The points are tracked through each frame to maintain a real-time estimate of the camera pose (visible in the map, shown on the right), whereas the 3D map is updated periodically, as needed.
Several of the practical difficulties in visual SLAM stem from the cameras being used for input. As described in a previous article, a pinhole camera model is usually used. The relevant parameters of the camera (focal length and principle point) should be known, as these are involved in projecting 3D points into a 2D image. Various methods exist to calibrate these parameters, but even small errors can lead to problems when tracking and mapping. A poor calibration means 3D points will not project to the correct 2D location, which can mean points fail to track as they become too far from their expected location; and map geometry can suffer if the triangulation of points is systematically wrong.
Another complication of real cameras is the use of a rolling shutter, which is especially common in lower-end webcams or mobile phone cameras. Instead of capturing the entire image on the camera sensor at the same instant of time, the image is built up line-by-line. This takes some time, so the top part of the image is captured slightly earlier than the bottom. This is not noticeable when the camera is stationary, but for faster motions it manifests as a shearing of the image. Left unmodelled, this can cause huge problems for tracking, as points no longer project where they are expected to, and models of rigid 3D geometry no longer apply. The easiest solution is to simply use a global shutter camera, but when this is not possible, it may be necessary to adjust point search regions to compensate. Some recent methods take the rolling shutter effect into account in the optimisation pipeline, but at the cost of a much more complicated algorithm.
A key issue in SLAM - indeed, one of the central technical hurdles in designing a working SLAM system - is data association. This is the process by which observations of image points at different times are associated with each other, which usually means determining which existing map point a new image point refers to. If the data association problem were solved and we knew for every measurement coming from the sensor which map point it referred to, then the SLAM problem would effectively be reduced to ensuring efficient mapping. However, when there is uncertainty over what measurements mean, this can be a difficult problem to solve.
Data association is greatly helped in visual SLAM by the fact that cameras provide a very rich source of information about the environment. The regions surrounding image points are often distinctive enough to enable reliable matching based on appearance, obviating the need to consider all possible associations between image and map points.
In indirect SLAM there are two common ways to do appearance based data association. The first is to use small image patches extracted around (a reference observation of) each map point, and to compare these against regions of the incoming image using a measure such as absolute pixel differences or cross correlation. These can be fast and reliable, assuming an approximate search region is known, but the patches may need to be warped before comparison to account for rotation, scale or viewpoint changes, which requires some knowledge of the point's 3D position and orientation. The second approach is to use image descriptors. A descriptor is basically a compact representation of local image appearance about some point (usually a vector representing gradient or texture information). These can be more robust to illumination and viewpoint changes compared to patch matching, and comparison between descriptors is very efficient. On the other hand, this approach adds an additional overhead of descriptor computation for each image.
Even with the power of patch or descriptor matching, there will never be a perfect match between the map and image regions, so there is always some uncertainty over whether the correct match is made. Tuning matching thresholds to only accept good matches can help, but risks throwing away useful correspondences; whereas incorrect associations can lead to errors in the tracked camera pose or map, if not properly taken into account by some kind of outlier removal strategy (see below).
The parallelised approach to tracking and mapping described above provides an elegant way of managing the two inherently linked tasks: tracking against an existing map, then using the tracked camera to expand the map. However, this ignores the process of map initialisation, i.e. how to get the initial map when the SLAM system is first started.
In monocular SLAM this can be a very challenging task, since a single camera frame gives no information about the depth of points. One approach is to assume a known calibration target is present at the start of mapping. This should be easy to detect (a black rectangle on a white background, say) and will immediately give a real-world scale to the map. The initial map would consist of just the target, but this is enough to start tracking and then to add more map points. However, this puts a severe limitation on the use of the SLAM system, namely that the initialisation target must always be visible to start mapping, which would add unwanted complexity when using the system for purposes such as augmented reality.
An alternative approach involves delaying the initialisation until some parallax has been observed. This approach often involves tracking a set of distinctive points through a number of frames (usually by patch matching or optical flow algorithms). Assuming good enough tracking, the corresponding matches between the first and last frames for each point can be used to recover the relative pose transformation (using the eight-point algorithm, for example), and then to triangulate the 3D position of the points to make the initial map. However, this approach can be problematic in practice: it requires that the points remain in view long enough to gain parallax, and for their appearance to change slowly enough for them to be tracked. There might never be enough parallax if points are actually far away or if the camera is rotating about its centre, and this can lead to nonsensical initialisations, which can be difficult to detect.
A more elegant approach is to use a map point parameterisation capable of representing uncertainty in the depth direction. This allows undelayed initialisation of points in the map, despite the fact that their depth is initially unknown. Their depth estimates should gradually converge as more observations are made, but in the mean time their bearing still provides a constraint on the rotation of the camera. Such representations are usually associated with filter based approaches to SLAM, as opposed to optimisation based algorithms, but it is possible to combine both approaches to make use of undelayed initialisation of tracking together with background map optimisation.
Tracking points between frames can be relatively easy when the difference between images is small and the images are of good quality. Of course, this is not always the case. Tracking gets increasingly difficult as the time between images increases, because the image location of tracked points could have moved further between frames, necessitating a larger search area. Unfortunately, the larger the search area, the slower the search, and the higher the chance of mismatches. This is a big problem when attempting to perform SLAM on low-powered devices: if it takes too long to process a frame, it does not just result in slow tracking, but because frames are skipped it can mean much worse tracking - especially as the resources to search larger regions may not be available. In practice this acts as a limitation on the speed with which such devices can track or build a map.
Worse still are situations where an image to be tracked is heavily blurred - this can occur due to fast motion relative to the camera's exposure time. A blurred image is much harder to find matches in, since much of the high-frequency information which distinguishes local patches from each other will be missing. Moreover, this loss of visual information can make patches look more similar to each other, increasing the chance of mismatches. This is not only a problem for tracking, where it would lead to a less reliable pose (or complete tracking failure) but can impede mapping too. If blurred keyframes are added, these can make tracking difficult even when the incoming images are no longer affected.
Assuming a map has been initialised and image conditions are favourable, tracking a new camera frame is (in principle) fairly simple: match points in the map to features in the current image and use this to recover the current pose - but if, unless special care is taken, this can be very prone to errors which will dramatically decrease accuracy. It makes sense to use as many matches between the image and the map as are available to get as good an estimate as possible - however, if just one match is incorrect, this can cause the result to diverge significantly, as a measurement's effect on the error term being minimised will increase the further it is from the other measurements. This necessitates some kind of outlier removal strategy, in order to weed out the bad measurements while keeping the rest. A very popular approach is Random Sample Consensus (RANSAC) in which multiple hypotheses are compared according to how many samples agree with the model, allowing both a good model to be chosen while eliminating bad correspondences. As the image below suggests, this depends on choosing a sensible threshold for separating good points from bad - which is another tradeoff between rejecting bad data and the risk of omitting good data.
Robustness to errors is crucial in map-building too. As the size of the map increases, more and more links between points and cameras become a part of the optimisation process. If even a few of these links are incorrect (due to a mismatch during tracking, for example), the constraints they impose can prevent the optimisation from reducing the overall error. This can be partly ameliorated by using more robust error measures, which seek to reduce the effect of large error terms beyond a certain value. This can significantly increase the accuracy of the resulting map in the face of uncertain data, but comes with its own difficulties, such as making it harder to add new information if it is deemed too unlikely given what is there already. Choosing the right parameters to maintain robustness to gross errors while adding uncertain measurements can be a fine balancing act.
One of the defining characteristics of a SLAM system (as opposed to offline map-generation) is that it should operate in real time, which generally means that everything that needs to be done should be finished by the time the next camera frame is available. However, this is not simply an issue of making the algorithms faster (or running on better hardware), because while exploring new areas, a SLAM system is constantly expanding its map. Therefore, scalability rather than speed is the crucial factor.
This puts limitations on the kind of processing which can be done during tracking. For example, if the tracking process attempts to match to all points in the map, this will get progressively slower as more points become available. It is therefore necessary to introduce strategies to limit the map points which can be potentially matched - and yet this introduces further problems, such as limiting the ability of the system to recognise places it has been before and making the tracking more prone to failure.
A constantly growing map is also a problem for the mapping process, as there are ever more measurements to be optimised, so the map update will take longer each time a keyframe is added. This means a longer and longer delay before the map can next expand, gradually diminishing the distance the camera can move before running out of map. This is a serious problem, but can be partly solved by adding a local optimisation phase as well as a global optimisation. The local optimisation updates the new keyframe and its immediate neighbourhood, which can be kept at a fixed or bounded size, enabling timely expansion even in large maps. It is still necessary to perform global optimisation when possible, to maintain accuracy non-locally (this is illustrated below); but since the time this takes can still grow without bound, further strategies to reduce the complexity of the optimisation may be necessary, such as maintaining only the relationships between camera poses and not the points (a 'pose graph').
Failure and Recovery
No matter how good the tracking algorithm being used or how sophisticated the outlier-removal strategy, sooner or later - perhaps due to fast motion or other adverse conditions - tracking will fail and the link between the current camera frame and the map will be lost. In these situations it is necessary to have some mechanism for re-establishing the camera's location with respect to the map. Solving this 're-localisation' problem is crucial to any pratical SLAM system, or it becomes useless at the first failure; unfortunately, reliably recovering the camera pose when there is a large map to search can be a demanding task. Relocalisation, and the related issue of loop closure, form a large topic in their own right and will be the focus of a future article.
Parallelising the tracking and mapping necessitates some way for the two processes to communicate safely. The mapping process periodically updates the map, but without a careful implementation this will cause problems for the tracking process, which is using the map to localise the camera (such as accessing invalid memory or causing deadlock). Simply blocking access to the map data during an update would be impractical, as the tracker would frequently be forced to wait, breaking the real-time requirement. Maintaining separate copies of the map for mapping and tracking can alleviate this (though there will still need to be a non-trivial synchronisation phase), at the expense of increased memory use.
The problem is compounded when multiple levels of optimisation are required (as described above). It would perhaps be desirable to run global map optimisation in the background at all times, even when expanding the map, but this means there will need to be synchronisation between the locally-optimised version and the more globally optimal but less up-to-date version. Alternatively, global optimisation may need to be preempted every time an expansion is required, and resumed with updated information afterwards, but quickly enough to not disrupt tracking, while integrating information from a half-finished optimisation. These threading and memory management issues are not necessarily intractable, but make the implementation of an efficient parallelised SLAM system that much less straightforward.
In this article I have described why, while the core ideas of SLAM are well understood, many factors relating to using it in the real world can make building a reliable and robust SLAM system an extremely challenging (but interesting!) task. From map initialisation and data association to the quest for an optimal map to issues of robustness and recovery, each step in the SLAM pipeline is beset with its own difficulties, each necessitating complicated (and often interacting) mechanisms for dealing with them effectively. Nevertheless, due to much research and development work in recent years, these difficulties are gradually being resolved, and the prospect of a reliable working SLAM system is not an impossible dream.