by Osian Haines
When solving the simultaneous localisation and mapping (SLAM) problem, one crucial difficulty is the issue of scale. Scale - the relationship between sizes in the world and the map - is always important in SLAM systems, but is especially so in monocular SLAM, because true scale cannot be directly measured from a single camera. This means that, at best, the resulting maps and trajectories are only correct relative to an unknown scale factor; worse, a poorly constrained scale can introduce inconsistency in the map over time. Fortunately, as this article discusses, these problems can be solved in a variety of ways, including detecting and correcting for loops in the map, and introducing known scale references.
Scale in SLAM
In a previous article I introduced the concept of Simultaneous Localisation and Mapping (SLAM), in the general sense, and focused on its application to monocular vision (where the only sensing device is a single camera). As well as the difficult computer vision problems involved in SLAM (recognition, tracking, etc), there is another crucial difficulty which must be addressed in monocular visual SLAM: the fact that real world scale cannot be directly observed.
Scale and depth
First, a clarification of what I mean by scale. Scale is basically the relationship between different distances in the map (generally a cloud of 3D points), and how these distances relate to the real world distances in the scene. If the relationship between distances in the map and in the real world is known, then the map has absolute, or metric, scale; whereas if the distances are all correct relative to each other, but the relationship with the real world is not known, then the map is said to be correct "up to scale".
In monocular SLAM there is no way to directly measure the depth to any single point in the map, because the input is only a 2D image (as opposed to say a stereo pair or depth map). Nevertheless, by integrating measurements from multiple images over time, it is possible to jointly recover the shape of the map and the motion of the camera. However, since the depths of points are not observed directly, the estimated point and camera positions are related to the real positions by a common, unknown scale factor.
Creating a map like this is called 'dimensionless' mapping, because there is no real-world dimension or meaning attached to the distances in the map: one 'map unit' could mean anything. Of course, there must be some way to choose which arbitrary scale to use, and a common choice is for the distance of the camera between the first two frames to be defined as one map unit. All subsequent estimated distances are then expressed relative to this distance, whatever it is.
This means that, assuming a good mapping algorithm and favourable conditions, everything is correct up to this unobserved scale factor. Consequently, there is no way to know from within the SLAM system whether it is exploring a city-centre environment, say, or a perfect scale model of one. Of course, as we shall see later, it may be possible to make some assumptions to limit the scale, based on the fact that we are probably not in a scale model, for example.
In principle there is no problem with having everything known up to an unknown scale factor, since everything is relative and correct within this reference frame. Unfortunately there is a more subtle, yet extremely important, problem which results from this. Because scale is never observed directly, it is initially arbitrary. Subsequent scales are also arbitrary, but are relative to the scales of previous frames (they are tied together by the fact that tracking and mapping link together one frame to the next). But because the relative world scale at one frame is only known in relation to its previous frame, and so on back to the start of the map, it means that any inaccuracy in measurements, or problems in map reconstruction, can cause the scale to gradually change over time.
This is called scale drift, and is a core problem in monocular SLAM. It means that over time, the estimated scale of the map changes, which means real world lengths are expanded or contracted compared to what they should be. Not only are distances between points in the map distorted, but it means the distance travelled by the camera will be similarly affected: this is a problem for localisation as well as mapping. The trajectory will become increasingly different from the ground truth, and loops in the true trajectory will not appear to close.
One common way to deal with the errors accumulated by scale drift - or indeed the accumulation of any inaccuracy - is through a process called loop closure. Loop closure is when two poses in the map are detected to correspond to the same actual location, which means that by fusing them together, the rest of the map can be adjusted to fit. A special mechanism is usually required to achieve this, precisely because after the scale has drifted, it is impossible to tell (through map geometry alone) if the camera has returned to somewhere it has been before. This usually involves some comparison of the current camera image and visual information stored while building the map. When an image is deemed sufficiently similar to somewhere seen before, the loop can be closed, and the map updated accordingly.
Correcting the scale will involve adjusting all of the camera poses (and associated point estimates) along the loop, not just the two endpoints (otherwise there will always be a gap). This can be thought of in terms of the underlying graph structure from which most SLAM systems are built. Before loop closure, the graph is one long chain of poses, with only the links between adjacent pairs of cameras constraining the pose (so any inaccuracy in the links will build up by the end). Loop closure involves adding a new link from one end of the chain to the other, and adjusting all poses along this loop to make the most sense given the new information. For example, in the following map (left), the scale has gradually drifted, meaning the recent parts of the map (red points) are far from their true locations, and the curve near the bottom right is smaller than it should be. However, because a correspondence between poses (i.e. a loop) has been detected (the blue line) the whole map and trajectory can be adjusted to correct for this, resulting in the much more consistent map seen on the right:
An example of using loop closure (blue line, left) to correct the structure and scale of a SLAM map (right). Not only is the gap removed, but the length of the trajectory has been corrected (images are from the ORB-SLAM paper by Mur-Artal et al.).
The techniques above are concerned with maintaining a consistent, yet arbitrary, estimate of scale over the course of mapping. Alternatively, real-world scale can be introduced into the mapping process, establishing a relationship between the map units and real distance. The different ways of doing this will achieve different things, depending on whether the scale reference is used just once (in which case drift will still occur over time) or whether the scale information is always present. The following sections will discuss ways of getting the true depth either using special sensors, or by making special assumptions about the environment.
One easy way to get depth, of course, is to use a sensor which directly measures it. This includes modern 'RGBD' (i.e. a reg-green-blue camera plus depth) sensors, such as the Microsoft Kinect, which use techniques including structured light and time-of-flight. These sensors produce a dense depth map associated with each camera image. Having access to a reasonably accurate depth map is obviously beneficial for a SLAM system, since it allows 3D information to be acquired directly; but the fact that each depth map will also encode the actual distance means that the scale of the map is always known.
The Microsoft Kinect RGBD camera (left), with its structured-light enhanced infra red image (centre) and resulting depth map (right). This would give a SLAM system direct access to metric 3D depth information (images from Wikipedia).
A more common example of a sensor able to sense depth is of course a stereo camera. A stereo camera consists of simply two cameras separated by a fixed distance; observations of the position of the same 3D point in both cameras allows its depth to be calculated by triangulation. As with RGBD cameras, this provides an instant depth estimate for any (sufficiently recognisable) point in the scene in real-world units.
It is interesting to consider the difference between having a stereo camera - which is simply two cameras separated in space - and the situation in monocular SLAM where a single camera takes images from two locations, separated by time. One could argue that a monocular camera (observing a static scene) from two locations separated by some baseline would give exactly the same information as a stereo camera. Indeed, this is the case, assuming the distance the camera has moved is known: if the distance between the two camera positions is known this introduces a real scale reference just as in stereo. The crucial difference is that in general this is not known (and certainly not between every pair of camera positions during mapping), whereas in stereo the length of the fixed baseline is easily measured.
However, having access to such a depth sensor is not always an option (in mobile devices, for example), but alternative methods to recovering scale are available. While there is no way to get a real scale directly from the process of monocular mapping, another possibility is to use an external scale reference to introduce real metric scale into the map. This can take the form of a pre-specified object, or set of objects, with known size, which can be recognised during mapping. Their mapped extent can then be associated with their known size, and this gives the map a sense of scale. A good example of this is work by Robert Castle which combined planar object detection with SLAM in order to recognise paintings in an art gallery, then to use the known dimensions of the paintings to set the map scale.
Using SLAM for Measuring
Given the ability of a SLAM system to create basic geometric maps of 3D spaces, the addition of a known scale reference can turn a SLAM system into a powerful measuring tool. Introducing just one known distance into a geometrically consistent map (assuming no further scale drift) allows measurements can be made between any points in 3D space. This would have applications in surveying, interior design, health monitoring, and potentially even crime scene investigation and archaeological research.
We have recently developed a prototype application using a SLAM system in this way, in order to quickly recover the actual measurements of 3D objects, using only a camera and a simple 2D marker as a reference. As the following video shows, the dimensions of a box can be accurately recovered, as compared to a tape measure.
An example of using the Kudan SLAM system, with a known target for scale reference, in an automatic measuring application.
Clearly, scale is an important issue in SLAM, especially in cases where it is not directly observable. Fortunately, through the techniques discussed above - and through a variety of other methods not discussed, such as fusing inertial measurements, global positioning information or recognition of familiar objects - scale can either be measured, estimated, or made consistent even in monocular SLAM. This makes monocular SLAM a potentially very stable and useful technology in a range of applications, especially in the context of mobile phones and tablets, which generally only have a single camera.