RRVPE: A Robust and Real-Time Visual-Inertial-GNSS Pose Estimator for Aerial Robot Navigation

: Self-localization and orientation estimation are the essential capabilities for mobile robot navigation. In this article, a robust and real-time visual-inertial-GNSS(Global Navigation Satellite System) tightly coupled pose estimation (RRVPE) method for aerial robot navi‐ gation is presented. The aerial robot carries a front-facing stereo camera for self-localization and an RGB-D camera to generate 3D voxel map. Ulteriorly, a GNSS receiver is used to continuously provide pseudorange, Doppler frequency shift and universal time coordinated (UTC) pulse signals to the pose estimator. The proposed system leverages the Kanade Lucas algorithm to track Shi-Tomasi features in each video frame, and the local factor graph solution process is bounded in a circumscribed container, which can immensely abandon the compu‐ tational complexity in nonlinear optimization procedure. The proposed robot pose estimator can achieve camera-rate (30 Hz) performance on the aerial robot companion computer. We thoroughly experimented the RRVPE system in both simulated and practical circumstances, and the results demonstrate dramatic advantages over the state-of-the-art robot pose estimators.


Introduction
Aerial robots will soon play a significant role in industrial inspection, accident warning, and national defense [1][2][3] .For such operations, flight mode dependent on the human remote control can no longer meet the mission requirements under complex conditions.At present, the autonomous navigation ability has become an impor-tant indicator to measure robot intelligent level.It is difficult to obtain the aerial robot pose in real time and solve the problem of autonomous robot navigation.Fully autonomous navigation requires aerial robots to have accurate pose estimation and robust environmental awareness.Due to the aerial robots swing during flying, the state estimation results are difficult to converge during movement, which leads to the instability of the exist-ing pose estimation algorithms.
Compared with the aerial robot pose estimator based on a single sensor, the multi-sensor fusion pose estimation technologies [4][5][6][7][8] can make full use of different kinds of sensors to obtain more accurate and robust robot pose estimation results.Stereo camera and inertial measurement unit (IMU) can output the robot position with centimeter-level precision in the local coordinate system [9,10] , but the pose in the local frame will drift with the aerial robot motion.Global navigation satellite system (GNSS) has been widely used in various mobile robot navigation tasks to provide drift-free position information for agents [11,12] .However, the GNSS-based agent navigation method is not suitable to use indoors, and is vulnerable to noise and multipath effects, resulting in only meter-level positioning accuracy.In order to leverage the complementary characteristics of different sensors, the pose estimation algorithm based on vision-IMU-GNSS information fusion can make full use of the respective advantages of stereo cameras, accelerometers, gyroscopes and navigation satellites to obtain accurate and drift free aerial robot pose estimation.Unfortunately, the pose estimation strategy based on vision-IMU-GNSS sensor fusion will face the following problems: Firstly, the output frequencies from each sensor are different (the frequencies of camera, IMU and GNSS receiver are 30, 200, and 10 Hz, respectively).How to fuse these raw measurements from different sensors will be an intractable problem [13,14] ; Secondly, how can the pose estimator quickly restore to normal state when one of the sensors suddenly fails?
To deal with the aforementioned problems, we propose a robust and real-time visual-inertial-GNSS tightly coupled pose estimation (RRVPE) method, which is a probabilistic factor graph optimization-based pose estimation strategy, for aerial robot navigation.The RRVPE system can achieve real-time robot state estimation after compute unified device architecture (CUDA) acceleration on an airborne computer.The main novelties of RRVPE are exhibited as below: 1) The RRVPE system leverages the Kanade-Lucas [15] algorithm to track Shi-Tomasi [16] features in each video frame, and the image corners describing and matching between adjacent frames are not required in corner tracking procedures.After NVIDIA CUDA acceleration, the robot pose estimator can achieve camerarate (30 Hz) performance on a small embedded platform.
2) The local bundle adjustment (BA) and factor graph solution process are bounded in a circumscribed container, which can dramatically abandon the number of variables in nonlinear optimization procedure.Furthermore, the computation complexity of the RRVPE system is discharged by the additional marginalization strategy.
3) By making full use of the GNSS raw measurements, the intrinsic drift from the vision-IMU odometry will be abandoned, and the yaw angle residual between the odometry frame and the world frame will be updated without any offline calibration.The aerial robot pose estimator is able to rapidly execute in unpredictable environments and achieves local smoothness and global consistency without visual closed loop detection.

System Overview
The aerial robot equipped with the RRVPE navigation system is shown in Fig. 1, which tightly fuses sparse optical flow tracking and inertial measurements with GNSS raw data for precise and driftless aerial robot pose estimation.
The structure of the RRVPE system overview is illustrated in Fig. 2. First, the raw sensor data from the aerial robot are preprocessed, including visual feature extraction and tracking, IMU pre-integration, and GNSS signal filtering.Then, vision, IMU and GNSS cost functions are formulated respectively, and vision-IMU- GNSS information is jointly initialized to obtain all the initial values of the aerial robot pose estimator.Finally, the aerial robot pose solving process is constructed as a state estimation problem.In the meantime, the corresponding probability factor graph model and marginalization strategy are designed.The aerial robot pose is solved by non-linear optimization, and subsequently the accurate, robust and drift free robot pose can be achieved.
In the system initialization stage, the camera trajectory is solved by the structure from motion (SfM) algorithm [17][18][19] , and the IMU raw measurement data is preintegrated [20,21] to initialize the vision-IMU tightly coupled robot odometry.Then, the rough robot position in the world coordinate frame is solved by the single point positioning (SPP) algorithm.Under the condition that visual-IMU odometry is used as prior information, the transformation matrix between the odometry coordinate frame and the world coordinate frame is solved in nonlinear optimization.Finally, the precise pose of the aerial robot in the global coordinate system is modified by probability factor graph optimization.
After the estimator initialization, constraints from all sensor measurements are tightly coupled to calculate aerial robot states within a circumscribed container.If the GNSS broadcast is not available or cannot be entirely initialized, the RRVPE system will naturally degrade to visual-IMU odometry.In order to maintain the real-time performance of the estimation system, the additional marginalization scheme [6] is also applied after each optimization.
We define (ˑ) r as the robot coordinate system, (ˑ) c as the camera coordinate system and (ˑ) o as the odometry frame, where the direction of the gravity is aligned with the Z axis.World coordinate system (ˑ) w is a semiglobal frame, where the X and Y axes direct to the east and north direction respectively, and the Z axis is also gravity aligned.The earth-centered, earth-fixed (ECEF) frame (ˑ) e and the earth-centered inertial (ECI) frame (ˑ) E are global coordinate system that is fixed with respect to the center of the earth.The difference between the ECEF and the ECI frame is that the latter  s coordinate axis does not change with the rotation of the earth.

Aerial Robot Pose Estimator 2.1 Formulation
In this section, the aerial robot pose estimation is formulated as a probabilistic factor graph optimization procedure, and sensor measurement information constitutes a composite of multifarious factors in the graph, which constrains the aerial robot states.The factors in the probabilistic graph are composed of visual factor, inertia factor and GNSS factor.All of the factors in the factor graph will be formulated in detail through this section.
We can take advantage of a sliding window-based tightly coupled visual-inertial-GNSS pose estimator for exceedingly robust and real-time aerial robot state estimation.The whole states χ inside the sliding window can be summarized as: where x k is the aerial robot state at the time t k that the k- , gyroscope bias b ω t k and acceleration bias b a t k of the aerial robot in the odometry frame.δt and δt′ correspond to the clock biases and bias drifting rate of the GNSS receiver, respectively.n is the window size and m is the total number of visual features in the sliding window.λ l is the inverse depth of the l-th visual feature.ψ is the yaw bias between the odometry and the world frame.

Visual Constraint
The visual factor constraint in the probabilistic graph is constructed from a sequence of sparse corner points.Considering the unstable vibration of the aerial robot, we separate the Shi-Tomasi [16] sparse feature points for the Kanade-Lucas-Tomasi (KLT) optical flow tracking [15] .
For each input video frame, when the number of feature points is less than 120, new corner points are extracted to maintain a sufficient number of tracking features.Meanwhile, a uniform feature point distribution is carried out by setting a minimum pixel space between neighboring corners.It is worth noting that the corner extraction and KLT optical flow tracking procedures can achieve camera-rate performance on the NVIDIA Jetson Xavier NX board after being accelerated by CUDA.Assume the homogeneous coordinates of the image feature point l in the world coordinate frame are: Then the homogeneous coordinates of feature point l in the pixel plane of video frame i can be expressed as: where u and v are coordinate values on the pixel plane.
The projection model of the airborne camera can be expressed as: where T is the transformation matrix, n c is the camera imaging noise, and K is the camera internal parameter matrix: where f and c represent scaling and translation during camera projection, respectively.The elements of the internal parameter matrix can be obtained by the camera calibration process, and the reprojection model of the feature point l from the video frame i to the video frame j can be formulated as: with where λ c i l represents the inverse depth of feature point l relative to the airborne camera c i .
Then the visual factor constraint can be expressed as the deviation between the actual position P ͂ c j l of the image feature point l in the video frame j and the measurement position P ͂ ̂cj l : where χ V represents the sub-vector related to visual information in the aerial robot state vector.

Inertial Measurements Constraint
In the world coordinate frame, the aerial robot  s pose and velocity can be obtained by the raw data of the inertial measurement unit that are measured in the aerial robot body frame.The IMU raw data includes two parts: gyroscope measurement ω ̂rt and accelerometer measurement âr t , both of which are affected by the gyroscope bias b ω and the acceleration bias b a , respectively.The raw measurement values of the gyroscope and accelerometer can be constructed by the following formulas: where, symbols ω ̂rt and âr t represent the measured values of the gyroscope and accelerometer at time t with the current body coordinate system as the reference system respectively; b ω and b a are the gyroscope bias and accelerometer bias; n ω and n a are gyroscope noise and accelerometer noise; g w is the gravitational acceleration.The gyroscope and accelerometer noises are Gaussian white noise; the gyroscope biases and the accelerometer biases obey Brownian motion, and their derivatives obey Gaussian distribution.
Assuming that the motion time of the aerial robot in two consecutive video frames is t k and t k+1 , then the orientation (o), velocity (v) and position (p) of the aerial robot at time t+1 in the local world coordinate system can be expressed by the following formula: where In the above formula, symbols ω ̂ and â are the measured values from gyroscope and accelerometer, and the symbols ⊗ represent quaternion multiplications.
If the reference coordinate system is converted from the local world coordinate system (w) to the robot coordinate system (r), the above formula can be rewritten as: where the IMU pre-integration term can be expressed as: Then the first-order Jacobian approximation of the IMU pre-integration term can be expressed by the following formula: This formula represents a sub-matrix of the Jacobian matrix.When the gyroscope or accelerometer bias changes, the above first-order Jacobian approximation can be used to replace the IMU pre-integration without reintegration.
The gyroscope factor constraint term is constructed as a rotation residual based on quaternion outer product.Simultaneously, the accelerometer factor constraint term is constructed as velocity pre-integration residual and translation pre-integration residual respectively.The gyroscope bias and accelerometer bias factor terms are obtained from the bias difference between two consecutive video frames.Then the IMU factor constraint can be constructed as follows: where χ I represents the sub-vector related to IMU in the aerial robot state vector.

GNSS Constraint
Currently, there are 4 complete and independently operated GNSS constellations, namely, BeiDou, GPS, Galileo, GLONASS.The navigation satellites in each GNSS constellation ceaselessly broadcast modulated carrier signals, and consequently the ground receiver can distinguish the navigation satellites and demodulate the original messages.The GNSS factor constraint in the probability factor graph model is composed of pseudorange factor, Doppler frequency shift factor and receiver clock offset factor.The pseudorange measurement model between the receiver and the navigation satellite can be expressed as: Here, p E r and p E s are the positions of the ground receiver and navigation satellite in the Earth-centered inertial (ECI) coordinate system respectively.P ̂s r is the measured value of GNSS pseudorange, c is the propagation speed of light in vacuum, δt r and δt s are the clock offset of the receiver and satellite, respectively, Dt tro and Dt ion are the delay of troposphere and ionosphere in the atmosphere, respectively, Dt mul is the delay caused by multipath effect, n pr is the noise of pseudo range signal, ω earth is the earth  s rotation speed, t s r represents the signal propagation time from the satellite to the receiver.
Then the GNSS pseudorange factor constraint can be constructed as the residual between the true pseudorange and the receiver measured pseudorange: (18) where χ pr represents the sub-vector related to the GNSS pseudorange in the aerial robot state vector.Besides pseudorange, Doppler frequency shift is also an important navigation information in GNSS modulated signal.The Doppler frequency shift measurement of GNSS receiver and navigation satellite can be modeled as: (19) with where λ is the carrier wavelength, Á s r is the direction vector between the satellite and the receiver, v E r t k and v E s i are the speed of the receiver and the satellite respectively, and δt r t k ′ and δt s i ′ are the clock offset drifting rate of the receiver and the satellite respectively.
Then the GNSS Doppler shift factor constraint can be constructed as the residual between the true carrier Doppler shift and the Doppler shift measurement: where, χ dp represents the sub-vector related to GNSS Doppler frequency shift in the agent state vector χ, and δf ̂si r t k is the Doppler frequency shift measurement value.Now, the GNSS receiver clock offset error from t k to t k+1 is constructed as follows: (22) By combining the pseudorange factor E pr , the Doppler frequency shift factor E dp and the receiver clock offset factor E τ , the GNSS factor constraint item in the aerial robot probability factor graph model can be formed.

Tightly Coupled Pose Estimation
Considering the aerial robot pose solving process as a state estimation problem, the optimal state of the aerial robot is the maximum a posteriori estimation of the robot state vector.Assuming that the measurement signals of the aerial robot  s airborne camera, IMU, and GNSS receiver are independent of each other, and the measurement noise conforms to a Gaussian distribution with zero mean, the maximum a posteriori estimation prob-lem can be equivalent to minimizing the sum of errors, then the solution process of the aerial robots state vector χ can be expressed as: where, z is the aerial robot linear observation model, e p represents the prior error, H P matrix is the prior pose information obtained by the airborne camera, n is the number of robot state vectors in the sliding window, and E(•) represents the sum of all sensor measurement error factors.
Finally, by solving the aerial robot state vector χ by means of probability factor graph optimization, the complete robot pose information can be obtained.

Implementation for Aerial Robot Navigation
We chose the NVIDIA jetson Xavier NX board as the companion computer for aerial robot autonomous navigation.The Intel RealSense T265 binocular camera is employed to provide visual and inertial raw measurement information for the aerial robot pose estimation.Simultaneously, the Intel Realsense D435i RGB-D camera can provide 3D point cloud map.The U-Blox ZED-F9P is used for GNSS receiver module, which can continuously provide GNSS pseudorange, Doppler frequency shift and universal time coordinated (UTC) pulse signals to aerial robot pose estimator.A carbon fiber quadrotor unmanned aerial vehicle (UAV) with 410 mm wheelbase can be used as the carrier of companion computer and sensors.Pixhawk4 is chosen as the flight control automatic pilot, and the PX4 is employed as the flight control firmware.Both onboard cameras and GNSS receiver are connected to the companion computer via USB.The ground station is connected with the Pixhawk4 automatic pilot and companion computer through WiFi 2.4G and Ethernet, respectively.The detailed description is shown in Fig. 3.

Simulation in Public Dataset
The EuRoC datasets [22] are collected from a binocular fisheye camera (Aptina MT9V034) and synchronized inertial measurement unit (Analog Devices ADIS 16448) carried by a micro aerial robot.The resolution of the binocular camera is 752×480, and the exposure mode of this camera is a global shutter with 20 Hz output frequency.The EuRoC datasets [22] contain 11 sequences, which includes different lighting conditions and different environments.We compare the proposed RRVPE with OKVIS [4] and VINS-Fusion [16] in EuRoC datasets.OKVIS is another nonlinear optimizationbased visual-inertial odometry, and VINS-Fusion is the state-of-the-art KLT sparse optical flow tracking-based tightly coupled agent state estimator.
All methods are compared in a NVIDIA Jetson Xavier NX embedded device, as shown in Fig. 4. The NVIDIA Jetson series devices are slightly different from other onboard computers on the score of its GPU mod-ule with 384 CUDA cores, which allows the RRVPE system to execute in real time with CUDA parallel acceleration.The comparison of experimental results on rootmean-square error (RMSE) are shown in Table 1, which is verified by an absolute trajectory error (ATE).Figure 5 shows the system consistency on absolute pose error (APE) as time goes on in the sequence MH01.RRVPE will inevitably generate some accumulation errors over time, which is an inherent characteristic of all visual-

Simulation for Aerial Robot Navigation
Due to the instability of the open source flight control algorithm, a simulation test is needed before realworld flight, which can effectively avoid the aerial robot crash caused by a program error.We carried out the virtual experiment for aerial robot autonomous navigation in the Gazebo simulator, as shown in Fig. 6.After taking off, the aerial robot leverages a virtual plug-in stereo camera and GNSS raw signals to obtain the spatial position.Meanwhile, a 3D voxel map calculated by a virtual plug-in RGB-D camera is structured to further capture the transformation matrix between the aerial robot and neighbouring obstruction.When the flight destination is entered manually, the trajectory planner generates a path for the aerial robot motion and sends the desired speed to the flight controller, then gradually approaches the destination and keeps a fixed distance from the neighbouring obstruction.

Real-World Aerial Robot Navigation
In order to verify the robustness and practicability of the proposed aerial robot navigation system, we conduct both simulation and real-world physical verification similar to the Gazebo test.The visual-inertial sensor used in our real-world test is an Intel RealSense T265 binocular camera.In the meantime, an Intel RealSense D435i RGB-D camera is used to capture the 3D environmental map.In addition, the U-Blox ZED-F9P is employed as GNSS receiver that is a high-precision multiband receiver with multi-constellation support.The realworld experiment was conducted on a campus tennis court, where the sky is open and most of the navigation satellites are well tracked.The terrain crossed by the aerial robot is an artificial arbitrary obstruction, as shown in Fig. 7.

Conclusion
In this paper, we proposed RRVPE: a robust and real-time visual-inertial-GNSS tightly coupled pose estimator for aerial robot navigation, which combines KLT sparse optical flow, inertial measurements and GNSS raw signal to estimate aerial robot state between consecutive images.In the nonlinear optimization phase, visual-inertial-GNSS raw measurements were formulated by the probabilistic factor graph in a small sliding container.The RRVPE system can achieve real-time robot state estimation with CUDA acceleration on an airborne computer.The proposed system is evaluated using both simulated and real-world physical experiments, demonstrating clear advantages over state-of-the-art approaches.

Fig. 1
Fig. 1 The aerial robot equipped with the RRVPE navigation system

Fig. 2
Fig. 2 Main parallel threads of RRVPE system

15 Fig. 3 Fig. 4 NVIDIA
Fig. 3 The aerial robot implementation scheme During flight, the aerial robot can change its route when approaching an obstacle and always keep a reasonable distance from the neighbouring obstruction.It is worth noting that all flight control commands are generated by the NVIDIA Jetson Xavier NX board, and the flight autopilot does not receive any external control signal generated by external environment.

Fig. 6 Fig. 5 Fig. 7
Fig. 6 Aerial robot navigation test carried out in the Gazebo simulator