ORB-SLAM Notes

2019-07-25

ORB-SLAM is an excellent feature-basde vSLAM. Current ORB-SLAM supports map construction, map reuse(re-localization) and loop closing. Besides, ORB-SLAM2 supports different kind of vSLAM, like monocular, stereo and RGB-D cameras. ORB-SLAM can achieve real time performance on normal CPU and in the backend of ORB-SLAM, there is bundle adjustment for optimization of the pose-graph.

In mono-SLAM, it will suffer from scale drift and fail in pure rotation. Stereo and RGB-D cameras will solve these issues. Inverse depth can be used in the stereo cameras’ disparity is smaller for long distance. Direct method like LSD or DSO can work well in motion blur and poorly-textured environments. But this kind of direct methods rely heavily on the phtometric consistency of context frames. In rolling shutter and non-lambertian reflectance, the pose graph will suffer larger error.

RGB-D SLAM can be achieved by traditional ICP and NDT, ICP and NDT method introduction can be seen from my later poster.

ORB-SLAM Architecture

The whole architecture shown below:

The front-end is tracking part: feature extraction, feature matching and key-frame select. The back-end has local mapping optimization and management. With loop closing to detect loops and correct accumulated drift with pose-graph optimization. Later is global whole map optimization with BA.

Pre-processing of key-points: stereo keypoints can be represented as (uL, vL, uR). u meanse horizontal direction, v is vertical. The matching is calculated by epipolar geometry with epipolar line search. The RGB-D can also be represented as the stereo format, as the disparity can be $D = uL - uR = \frac {f_x b} {d}$. $f_x$ is the focal length in horizontal direction, b is the virtual stereo camera basedline, d is known from our RGB-D camera. The key points of stereo and RGB-D have divided into close and far category. Close means the distance is within 40 times of baseline. Far means out of 40 times of baseline. Close points can be good for scale, translation and rotation evaluation. The far points are good for rotation information. Initial process of RGB-D and stereo cameras can use the first frame.
BA in mono-SLAM and stereo constrains: In ORB-SLAM, there are three kinds of BA. Motion-BA: optimize the poses of the camera(rotation and translation only), minimize the 3D key points reprojection geometry error. Local-BA: optimize the local map, a co-visible graph with keyframes, points and poses. Full-BA: the first frame is fixed, but the later keyframes and points are optimized together.
Loop Closing and Full-BA: Loop detection is implemented by bag of word model. When Full-BA is running, the new key frames are not inerted, to handle this problem, after BA, the new key frames will be tranformed into the new transformation with co-visibility graph.
Keyframes Insertion: In monocular ORB-SLAM, there is no far and close points, so the frames insertion is oftern. But stereo ORB-SLAM will detect the close feature number lower then t, and insert the new key frames with higher close points c.
Localization mode: the local mapping and loop closing are deactivated. The localization model will match the features in the map. And at the same time, during localization mode, visual odometry is still running, which can combined with feature map matching for better results.

Monocular ORB-SLAM Details

The mono ORB-SLAM is more complex than stereo and RGB-D ORB-SLAM. As there is not intial depth of key points, and there is scale drift problem in mono SLAM. So there is 7DoF to be optimized in mono-SLAM.

Traking: perform motion-only BA. Decide when to insert one key frame.
Local mapping: local BA on covisibility graph and eliminate redundant frames.
Loop closing: detect loop when insering new frame. When loop detected, the sim3 transformation is calculated.

Data representation in ORB-SLAM: map points(3D position, orientation, ORB descriptor to key points, maximum and minimum distance); keyframes(camera pose and its feature of the frames).

Co-visibility graph, essential graph and spanning tree. There will be one edge if two frames have co-oberved map points than a threshold. Spanning tree is with the minimal number of edges. When one new key frame is inserted into the tree, if most of the point are observed, it will be erased. The essential graph contains spanning tree and add part of co-visibility graph.

Map initialization: homography for planar scene; fundamental matrix for non-planar scene

Extract feature of current Frame Fc and reference Frame Fr. Compute fundamendal matrix and homography at the same time. Calculate a score SM for each model. Choose model from above model score. After the selection, there will be 8 motion hypotheses for homography or 4 motion hypotheses for fundamental matrix(With camera intrinsics, we can get essential matrix). If there is no clear winner solution, it will come back to the first step. After choosing it, the full BA is utilized for refine of reconstruction.
Initial Pose Estimation from previous frame: the new estimated of new frame is from constant velocity motion model. Then if not matched well, the search range will be wider.
Tracking loss: when tracking is loss, current ORB will match with each key frames with RANSANC and PnP algorithms.
Track local map: project co-visibility graph and co-visibility graph key frame’s neighbors map points into current frame. Discard outliers with projection position and angles and map point distance.
New key frame decision: more than 20 frames have passes from last global relocalization; local mapping is idel or more than 20 frames have passed last keyframe insertion. Current frame tracks less than 90% points than $K_{ref}$.
Keyframe insertion: update covisibility graph and add new edge to the shared map points with other keyframes. Update bag of words and spanning tree.
Map points filter: must be observed by 3 keyframes. More then 25% localo frames can be visible.
New map points adding: unmatched points will be matched with epipolar constraints.

Loop closing

When new key frame inserts, the keyfram feature will match with all keyframes.

Reference: [1] ORB-SLAM2: an Open-Source SLAM System for Monocular, Stereo and RGB-D Cameras [2] ORB-SLAM: a Versatile and Accurate Monocular SLAM System