Monocular Visual-Inertial State Estimation for Micro Aerial Vehicles

. Autonomous micro aerial vehicles (MAVs) equipped with onboard sensors, are idea platforms for missions in complex and conﬁned environments for its low cost, small size and agile maneuver. Due to the size, power, weight and computation constraints inherent in the ﬁled of MAVs, monocular visual-inertial system that consist of one camera and an inertial measurement (IMU) are the most suitable sensor suit for MAVs. In this paper, we proposed a monocular visual-inertial algorithm for estimating the state of a MAV. Firstly, the Semi-Direct Visual Odometry (SVO) algorithm used as the vision front-end of our framework was modiﬁed so that it can be used for forward-looking camera case. Second, an Error-state Kalman Filter was designed so that it can fuse the output of the SVO and IMU data to estimate the full state of the MAVs. We evaluated the proposed method with EuRoc Dataset and compare the results to the state-of-the-art visual-inertial algorithm, VINS-Mono. Experiments show that our estimator can achieve comparable accurate results.


Introduction
MAVs are ideal platforms for missions such as exploration, inspecting, search and rescue in complex and confined environments due to its low cost, small size and agile maneuver.To achieve these goals, it is essential that the MAVs is capable of autonomous navigation in unknown environments, which include reliable state estimation, control, environment mapping, planning and obstacles avoidance.Among these, state estimation is the first and most critical components for autonomous flight.Despite GPS-based autonomous flight have been successful apply in commercial MAVs, it is unavailable when operate in indoor or tunnel-link environments due to effects like shadowing.
At present, the most common used sensors for state estimation of MAVs are monocular [1][2][3][4], stereo [5][6][7] and RGB-D [8,9] cameras as well as laser scanners [10].However, MAVs usually come with tight size, weight and power constraints, limiting their ability to carry active but heavy sensors such as radars or LiDARs.The dimensions of a small platform also limit its ability to carry stereo or multicamera systems due to insufficient baseline length.
As the platform becomes smaller, a monocular visualinertial navigation system (VINS), consisting of only a low-cost inertial measurement unit (IMU) and a camera, becomes the only viable sensor suite allowing autonomous flights with sufficient environmental awareness [4].
Recent research on VINS for MAVs have yielded a number of significant results [1][2][3][4]11].Most of these existing approach can be classified into filter-based and optimization-based systems.In [1], a parallel tracking and mapping (PTAM) algorithm from the Simultaneous Locole-mail: xiaoyao1103@mails.bjut.edu.cnization and Mapping (SLAM) community is used as frontend of the VINS system which estimate a 6DOF with arbitrarily scaled and drifting camera pose measurement, and an Extended Kalman Filter is used as the back-end which can fuse the output of the PTAM and the IMU information to get a full state metric estimation of the MAVs.In [11], Faessler et al. present a system that enables a monocularvision-based quadrotor to automatically recover from any unknown, initial attitude with significant velocity, such as after loss of visual tracking due to an aggressive maneuver.The images from the camera are processed by means SVO pipeline [12].The visual-odometry pipeline outputs an un-scaled pose that is then fused with the IMU readings in an Extended Kalman Filter framework (Multi Sensor Fusion, MSF [13]) to compute a metric state estimate.This work have been extended in [3].However, SVO is a direct-based method and designed for MAVs with downward-looking camera, which made the system cannot used for forward-looking case.In [2], Shen proposed a optimization-based VINS framework for rotorcraft MAVs.This estimator tightly couples the vision and IMU measurements and allows the MAV to execute trajectories at 2 m/s with roll and pitch angles up to 30 degrees.To the best our knowledge, this is the first tightly-coupled nonlinear optimization-based monocular VINS estimator successful apply for autonomous flight.This work have been extended in their work in [4].However, tightly-coupled nonlinear optimization-based method is more computationintensive.
In this paper, we proposed a monocular visual-inertial algorithm for estimating the state of a micro aerial vehicles (MAV).The SVO algorithm was modified so that it can be used for forward-looking camera case.And an Error-state Kalman filter was designed which can fuse the output of One contribution of this paper is that we modified the original SVO algorithm such that it can be used in the forward-looking camera case.As a second contribution, we formulate a fusion framework based on error-state Kalman filter which can fuse the output of the SVO and the IMU data to get the full state estimate of the MAVs.Compare to the work of [1,14,15], we use the SVO as the front end of the frame work which would less computational demand.And compare to the work of [3], our estimator can be run in the case of forward-looking camera.As a third contribution, we compare our propose algorithm against the state-of-the-art visual-inertial algorithm (Mono-VINVS) and demonstrate its comparable performance.
The rest of this paper is organized as follows: Section 2 presents a general overview of the software architecture.Section 3 describes what we have improve from the original SVO; Section 4 describes the details of the design the error-state Kalman filter.Section 4 consists the experiments to evaluate our proposed approach.Finally, Section 6 concludes the work.

Software Architecture
The Software Architecture of our algorithm is shown in Figure 1.The p, q, v, ω represent the position, attitude, translation velocity, angular velocity of the MAVs respectively.And Cov represents the covariance of p and q.The software consists two components.The first one is the modified SVO, which input is the image scream capture from the camera.The Error-State EKF component fuses the output of the SVO and the IMU data to estimate the full state of the MAVs.

Keyframe Selection
The original version of SVO algorithm was designed for downward-looking case.And a keyframe is selected if the Euclidean distance of the new frame relative to all keyframes exceeds 12% of the average scene depth [12].However, if the camera is facing forwards, the scene depth is very large and no new keyframes are selected.
Generally, a keyframe should be required in there cases: (1) The field of view changes with a threshold.
(2) The camera translate exceeds a threshold.
(3) The environment light condition changes or the camera exposure time changes.Compare to the feature-based method, SVO is more sensitive to the light condition change or exposure time change as it is a direct-based method.However, determining thresholds of the three cases is a problem in reality.As a second analyze of the cases above, we could find that all of the cases would cause tracking features to be lost.According to this idea, in this paper we set the keyframe selection strategy as a simple criteria, that is, a keyframe is selected if the percent of the lost tracking feature respect to the last keyframe exceeds 15%.

Tracking Failure Process
In the original version of SVO tracking pipline, the pose of a new coming frame is initialize through sparse modelbased image alignment.The camera pose relative to the last frame is found through minimizing the photometric error between pixels corresponding to the projected location of the same 3D points [12].And if one frame tracking failure, the system will try to do image alignment with the last keyframe.This would be fine if the frame rate is high enough (>50Hz).Experiments showed that one or two continue frame may be very easily failed for tracking due to focus losing or motion blur.In this case, if the SVO are run on low frame rate (∼ 20 Hz), it would be very difficult to re-localzie the new frame and continue the tracking again.This maybe for several reasons: (1) The image align process dose not wrap the patches used to align for computing speed reasons.(2) Compare to two consecutive frames, the change of field of view or translation between the last keyframe is more larger.(3) If more than two consecutive are failed for tracking, the change of field of view or translation between the last keyframe become larger and larger.As a result, it is more and more difficult to re-localize the new frame.
In our modified version of SVO, the strategy we use is that if one frame is failed for tracking, we would drop out the current frame and keep the last frame unchanged, and after the new frame arrive, we would align with the last frame.As we discuss above, the new coming frame would be more easier to re-localize with this strategy.

Error-State Kalman Filter
The unit quaternion has the lowest dimension of globally nonsigular attitude parameterization and represents the attitude matrix as a homogeneous quadratic function.Due to the quaternion norm constraint, the 4 × 4 quaternion covariance is assumed to have rank three [16].As a result, we cannot simply apply the standard Exetend Kalman filter for the system.In this paper, we use the Error-State Kalman Filter to fuse the IMU data and the output of SVO.

Notations
There are several quaternion conventions.In this paper, the Hamilton convention is used, which is defined as: q = q w + q x i + q y j + q z k, and i 2 = j 2 = k 2 = i jk = −1.The quaternion q ab describes a ratation from frame a to the frame b, and q ac = q ab ⊗ q bc .R ab is the corresponding rotation matrix of q ab .The subscript w, i, v of the position p, quaternion q and velocity v denote the world frame, IMU frame, vision frame, camera frame respectively, such as p wi means the IMU position of the MAV express in the world frame.We assume the IMU frame is coincide with the body frame, which is the general case in reality.The skew symmetric matrix of w is represented as [w] × .The hat of a variable x means its estimate value x.

System Kinematics
We assumed the IMU measurements contain a slowly varying bias b and white Gaussian noise n: Where, a, ω is the real acceleration and angular velocity respectively.The a m , ω m denotes the measurements value.Define the system state as The system kinematics is represented as following differential equations: The nominal-state kinematics is defined as: ḃw = 0 ḃa = 0 λ = 0 (11)

Error State Kinematics
Define q wi = qwi ⊗ δq wi , which means Define error state of the system as The error state kinematics is represented as following differential equations: This can be summarized to the linearized continoustime error state equation with u being the noise vector.Discretize the error state kinematics we can get F d : Where:

Update
Measurement model of the system is described as The measurement error can be define as: Which can be linearized to with Measurement model of rotation: Which yields for the error measurement: MATEC Web of Conferences 139, 00068 (2017)

Experiment Results
We evaluated the proposed algorithm with the Machine Hall 01 of EuRoc Visual-Inertial Dataset.The dataset are collected on-board a Micro Aerial Vehicle (MAV), which contain stereo images(Aptina MT9V034 global shutter, WVGA monochrome, 2 × 20 FPS), synchronized IMU measurements(ADIS16448, angular rate and acceleration, 200 Hz), and accurate ground-truth states.We only use the left camera of the stereo images set.The results do not contain the comparison of the performance difference between the original version of SVO and the our modified SVO as the the original version of SVO cannot run on this dataset.We compare the performance of the proposed algorithm with ground-truth and the VINS-Mono framework [4].
The whole trajectories are shown in Figure 2. Compare to the ground-truth, both of the proposed method and VINS-Mono exist the drift in the trajectories as the time goes on.It is hard to say which one drift more.However, benefit from the SVO, the trajectory of our propose method nearly return to the starting point at the end of flying.In contrast, the result of VINS-Mono show more drift at the end of the trajectory.
Figure 3 shows the translation error of the trajectories in x, y, z axises respectively.It shows that the VINS-Mono achieve more accurate in y axis.However, proposed algorithm shows less drift on z axis.
The attitude errors expressed in yaw, roll, pitch respectively are showed in Figure 4.It is obvious that the proposed method achieve more accuracy with the attitude estimation compare to the results ofvalidity VINS-Mono.
Figure 5 shows the velocity error.It can be seen that Both the method achieve accurate velocity estimating.It  is obvious that there is a abrupt jump in the measurements of our propose methods, this should be caused by tracking failure.However, the vision estimator have successful relocalize the subsequent frames, which shows the effectiveness our modified tracking failure process.

Conclusion
In this paper, we presented a monocular visual-inertial state estimation algorithm based on the modified SVO and error-state Kalman Filter.The proposed method can be used for state estimation of MAVs, the result of which can be used in the higher lever task such as controlling, mapping, planing and trajectory following.The performance of the proposed method was validate on the EuRoc dataset and the results was compared to the ground-truth and the state-of-the-art algorithm, VINS-Mono.However, two component of our framework are work independent at present, which means the estimate result do not feedback to the vision-pipeline for failure detecting or dirft suppression.Feature will investigate to tightly integrate the filter component in the vision-pipeline to get more accurate results.