RRVPE: A Robust and Real-Time Visual-Inertial-GNSS Pose Estimator for Aerial Robot Navigation

Chi ZHANG; Zhong YANG; Hao XU; Luwei LIAO; Tang ZHU; Guotao LI; Xin YANG; Qiuyan ZHANG

doi:10.1051/wujns/2023281020

All issues

Volume 28 / No 1 (February 2023)

Wuhan Univ. J. Nat. Sci., 28 1 (2023) 20-28

Full HTML

Open Access

Issue		Wuhan Univ. J. Nat. Sci. Volume 28, Number 1, February 2023


Page(s)		20 - 28
DOI		https://doi.org/10.1051/wujns/2023281020
Published online		17 March 2023

Wuhan University Journal of Natural Sciences, 2023, Vol.28 No.1, 20-28

Computer Science

CLC number: TP 391

RRVPE: A Robust and Real-Time Visual-Inertial-GNSS Pose Estimator for Aerial Robot Navigation

Chi ZHANG¹, Zhong YANG¹^†, Hao XU¹^,2, Luwei LIAO¹, Tang ZHU¹, Guotao LI¹, Xin YANG¹ and Qiuyan ZHANG³

¹ College of Automation Engineering, Nanjing University of Aeronautics and Astronautics, Nanjing 211106, Jiangsu, China
² School of Mathematics & Physics, Anhui University of Technology, Maanshan 243000, Anhui, China
³ Electric Power Research Institute of Guizhou Power Grid Co., Ltd., Guiyang 550002, Guizhou, China

^† To whom correspondence should be addressed. E-mail: YangZhong@nuaa.edu.cn

Received: 25 July 2022

Abstract

Self-localization and orientation estimation are the essential capabilities for mobile robot navigation. In this article, a robust and real-time visual-inertial-GNSS(Global Navigation Satellite System) tightly coupled pose estimation (RRVPE) method for aerial robot navigation is presented. The aerial robot carries a front-facing stereo camera for self-localization and an RGB-D camera to generate 3D voxel map. Ulteriorly, a GNSS receiver is used to continuously provide pseudorange, Doppler frequency shift and universal time coordinated (UTC) pulse signals to the pose estimator. The proposed system leverages the Kanade Lucas algorithm to track Shi-Tomasi features in each video frame, and the local factor graph solution process is bounded in a circumscribed container, which can immensely abandon the computational complexity in nonlinear optimization procedure. The proposed robot pose estimator can achieve camera-rate (30 Hz) performance on the aerial robot companion computer. We thoroughly experimented the RRVPE system in both simulated and practical circumstances, and the results demonstrate dramatic advantages over the state-of-the-art robot pose estimators.

Key words: computer vision / visual-inertial-GNSS(Global Navigation Satellite System) pose estimation / real-time autonomous navigation / sensor fusion / robotics

Biography: ZHANG Chi, male, Ph.D. candidate, research direction: robot navigation. E-mail: laozhang@nuaa.edu.cn

Fundation item: Supported by the Guizhou Provincial Science and Technology Projects ([2020]2Y044), the Science and Technology Projects of China Southern Power Grid Co. Ltd. (066600KK52170074), and the National Natural Science Foundation of China (61473144)

© Wuhan University 2023

This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

0 Introduction

Aerial robots will soon play a significant role in industrial inspection, accident warning, and national defense^[1-3]. For such operations, flight mode dependent on the human remote control can no longer meet the mission requirements under complex conditions. At present, the autonomous navigation ability has become an important indicator to measure robot intelligent level. It is difficult to obtain the aerial robot pose in real time and solve the problem of autonomous robot navigation. Fully autonomous navigation requires aerial robots to have accurate pose estimation and robust environmental awareness. Due to the aerial robot's swing during flying, the state estimation results are difficult to converge during movement, which leads to the instability of the existing pose estimation algorithms.

Compared with the aerial robot pose estimator based on a single sensor, the multi-sensor fusion pose estimation technologies^[4-8] can make full use of different kinds of sensors to obtain more accurate and robust robot pose estimation results. Stereo camera and inertial measurement unit (IMU) can output the robot position with centimeter-level precision in the local coordinate system^{[9, 10]}, but the pose in the local frame will drift with the aerial robot motion. Global navigation satellite system (GNSS) has been widely used in various mobile robot navigation tasks to provide drift-free position information for agents^[11,12]. However, the GNSS-based agent navigation method is not suitable to use indoors, and is vulnerable to noise and multipath effects, resulting in only meter-level positioning accuracy. In order to leverage the complementary characteristics of different sensors, the pose estimation algorithm based on vision-IMU-GNSS information fusion can make full use of the respective advantages of stereo cameras, accelerometers, gyroscopes and navigation satellites to obtain accurate and drift free aerial robot pose estimation. Unfortunately, the pose estimation strategy based on vision-IMU-GNSS sensor fusion will face the following problems: Firstly, the output frequencies from each sensor are different (the frequencies of camera, IMU and GNSS receiver are 30, 200, and 10 Hz, respectively). How to fuse these raw measurements from different sensors will be an intractable problem^[13,14]; Secondly, how can the pose estimator quickly restore to normal state when one of the sensors suddenly fails?

To deal with the aforementioned problems, we propose a robust and real-time visual-inertial-GNSS tightly coupled pose estimation (RRVPE) method, which is a probabilistic factor graph optimization-based pose estimation strategy, for aerial robot navigation. The RRVPE system can achieve real-time robot state estimation after compute unified device architecture (CUDA) acceleration on an airborne computer. The main novelties of RRVPE are exhibited as below:

1) The RRVPE system leverages the Kanade-Lucas^[15] algorithm to track Shi-Tomasi^[16] features in each video frame, and the image corners describing and matching between adjacent frames are not required in corner tracking procedures. After NVIDIA CUDA acceleration, the robot pose estimator can achieve camera-rate (30 Hz) performance on a small embedded platform.

2) The local bundle adjustment (BA) and factor graph solution process are bounded in a circumscribed container, which can dramatically abandon the number of variables in nonlinear optimization procedure. Furthermore, the computation complexity of the RRVPE system is discharged by the additional marginalization strategy.

3) By making full use of the GNSS raw measurements, the intrinsic drift from the vision-IMU odometry will be abandoned, and the yaw angle residual between the odometry frame and the world frame will be updated without any offline calibration. The aerial robot pose estimator is able to rapidly execute in unpredictable environments and achieves local smoothness and global consistency without visual closed loop detection.

1 System Overview

The aerial robot equipped with the RRVPE navigation system is shown in Fig. 1, which tightly fuses sparse optical flow tracking and inertial measurements with GNSS raw data for precise and driftless aerial robot pose estimation.

Fig. 1 The aerial robot equipped with the RRVPE navigation system

1. Intel RealSense D435i camera: responsible for building real-world 3D voxel map; 2. Intel RealSense T265 camera: responsible for providing binocular video stream and inertial measurement information; 3. U-Blox ZED-F9P receiver: responsible for receiving GNSS pseudorange, Doppler, ephemeris and time pulse information

The structure of the RRVPE system overview is illustrated in Fig. 2. First, the raw sensor data from the aerial robot are preprocessed, including visual feature extraction and tracking, IMU pre-integration, and GNSS signal filtering. Then, vision, IMU and GNSS cost functions are formulated respectively, and vision-IMU-GNSS information is jointly initialized to obtain all the initial values of the aerial robot pose estimator. Finally, the aerial robot pose solving process is constructed as a state estimation problem. In the meantime, the corresponding probability factor graph model and marginalization strategy are designed. The aerial robot pose is solved by non-linear optimization, and subsequently the accurate, robust and drift free robot pose can be achieved.

Fig. 2 Main parallel threads of RRVPE system

In the system initialization stage, the camera trajectory is solved by the structure from motion (SfM) algorithm^[17-19], and the IMU raw measurement data is pre-integrated^{[20, 21]} to initialize the vision-IMU tightly coupled robot odometry. Then, the rough robot position in the world coordinate frame is solved by the single point positioning (SPP) algorithm. Under the condition that visual-IMU odometry is used as prior information, the transformation matrix between the odometry coordinate frame and the world coordinate frame is solved in nonlinear optimization. Finally, the precise pose of the aerial robot in the global coordinate system is modified by probability factor graph optimization.

After the estimator initialization, constraints from all sensor measurements are tightly coupled to calculate aerial robot states within a circumscribed container. If the GNSS broadcast is not available or cannot be entirely initialized, the RRVPE system will naturally degrade to visual-IMU odometry. In order to maintain the real-time performance of the estimation system, the additional marginalization scheme^[6] is also applied after each optimization.

We define (ˑ)^r as the robot coordinate system, (ˑ)^c as the camera coordinate system and (ˑ)^o as the odometry frame, where the direction of the gravity is aligned with the Z axis. World coordinate system (ˑ)^w is a semiglobal frame, where the X and Y axes direct to the east and north direction respectively, and the Z axis is also gravity aligned. The earth-centered, earth-fixed (ECEF) frame (ˑ)^e and the earth-centered inertial (ECI) frame (ˑ)^E are global coordinate system that is fixed with respect to the center of the earth. The difference between the ECEF and the ECI frame is that the latter's coordinate axis does not change with the rotation of the earth.

2 Aerial Robot Pose Estimator

2.1 Formulation

In this section, the aerial robot pose estimation is formulated as a probabilistic factor graph optimization procedure, and sensor measurement information constitutes a composite of multifarious factors in the graph, which constrains the aerial robot states. The factors in the probabilistic graph are composed of visual factor, inertia factor and GNSS factor. All of the factors in the factor graph will be formulated in detail through this section.

We can take advantage of a sliding window-based tightly coupled visual-inertial-GNSS pose estimator for exceedingly robust and real-time aerial robot state estimation. The whole states $χ$ inside the sliding window can be summarized as:

${\begin{array}{l} χ = {[x_{0}, x_{1}, . . ., x_{n}, λ_{1}, λ_{2}, . . ., λ_{m}, ψ]}^{T} \\ x_{k} = {[o_{r_{t_{k}}}^{w}, v_{r_{t_{k}}}^{w}, p_{r_{t_{k}}}^{w}, b_{ω_{t_{k}}}, b_{a_{t_{k}}}, δ t, δ t^{'}]}^{T}, k \in [0, n] \end{array}$ (1)

where x_k is the aerial robot state at the time t_k that the k-th video frame is captured. It contains orientation $o_{r_{t_{k}}}^{w}$ , velocity $v_{r_{t_{k}}}^{w}$ , position $p_{r_{t_{k}}}^{w}$ , gyroscope bias $b_{ω_{t_{k}}}$ and acceleration bias $b_{a_{t_{k}}}$ of the aerial robot in the odometry frame. $δ t$ and $δ t^{'}$ correspond to the clock biases and bias drifting rate of the GNSS receiver, respectively. n is the window size and m is the total number of visual features in the sliding window. $λ_{l}$ is the inverse depth of the l-th visual feature. $ψ$ is the yaw bias between the odometry and the world frame.

2.2 Visual Constraint

The visual factor constraint in the probabilistic graph is constructed from a sequence of sparse corner points. Considering the unstable vibration of the aerial robot, we separate the Shi-Tomasi^[16] sparse feature points for the Kanade-Lucas-Tomasi (KLT) optical flow tracking^[15].

For each input video frame, when the number of feature points is less than 120, new corner points are extracted to maintain a sufficient number of tracking features. Meanwhile, a uniform feature point distribution is carried out by setting a minimum pixel space between neighboring corners. It is worth noting that the corner extraction and KLT optical flow tracking procedures can achieve camera-rate performance on the NVIDIA Jetson Xavier NX board after being accelerated by CUDA. Assume the homogeneous coordinates of the image feature point l in the world coordinate frame are:

${\tilde{p}}_{l}^{w} = {[\frac{X_{l}}{Z_{l}}, \frac{Y_{l}}{Z_{l}}, 1]}^{T}$ (2)

Then the homogeneous coordinates of feature point l in the pixel plane of video frame i can be expressed as:

${\tilde{P}}_{l}^{c_{i}} = {[u_{l}^{i}, v_{l}^{i}, 1]}^{T}$ (3)

where u and v are coordinate values on the pixel plane. The projection model of the airborne camera can be expressed as:

${\tilde{P}}_{l}^{c_{i}} = K T_{r}^{c} T_{w}^{r_{i}} {\tilde{p}}_{l}^{w} + n_{c}$ (4)

where T is the transformation matrix, n_c is the camera imaging noise, and K is the camera internal parameter matrix:

$K = [\begin{matrix} f_{x} & 0 & c_{x} \\ 0 & f_{y} & c_{y} \\ 0 & 0 & 1 \end{matrix}]$ (5)

where f and c represent scaling and translation during camera projection, respectively. The elements of the internal parameter matrix can be obtained by the camera calibration process, and the reprojection model of the feature point l from the video frame i to the video frame j can be formulated as:

${\hat{\tilde{P}}}_{l}^{c_{j}} = K T_{r}^{c} T_{w}^{r_{j}} [T_{r_{i}}^{w} T_{c}^{r} K^{- 1} (Z_{l}^{c_{i}} {\tilde{P}}_{l}^{c_{i}})]$ (6)

with

$Z_{l}^{c_{i}} = λ_{l}^{c_{i}} \frac{f_{x} f_{y}}{\sqrt[]{{f_{x}}^{2} {f_{y}}^{2} + {(u_{l}^{i} - c_{x})}^{2} {f_{y}}^{2} + {(v_{l}^{i} - c_{y})}^{2} {f_{x}}^{2}}}$ (7)

where $λ_{l}^{c_{i}}$ represents the inverse depth of feature point l relative to the airborne camera c_i.

Then the visual factor constraint can be expressed as the deviation between the actual position ${\tilde{P}}_{l}^{c_{j}}$ of the image feature point l in the video frame j and the measurement position ${\hat{\tilde{P}}}_{l}^{c_{j}}$ :

$E_{V} ({\hat{Z}}_{l}^{c_{j}}, χ_{V}) = {\tilde{P}}_{l}^{c_{j}} - {\hat{\tilde{P}}}_{l}^{c_{j}}$ (8)

where $χ_{V}$ represents the sub-vector related to visual information in the aerial robot state vector.

2.3 Inertial Measurements Constraint

In the world coordinate frame, the aerial robot's pose and velocity can be obtained by the raw data of the inertial measurement unit that are measured in the aerial robot body frame. The IMU raw data includes two parts: gyroscope measurement ${\hat{ω}}^{r_{t}}$ and accelerometer measurement ${\hat{a}}^{r_{t}}$ , both of which are affected by the gyroscope bias b_ω and the acceleration bias b_a, respectively. The raw measurement values of the gyroscope and accelerometer can be constructed by the following formulas:

${\begin{array}{l} {\hat{ω}}^{r_{t}} = ω^{r_{t}} + b_{ω^{r_{t}}} + n_{ω^{r_{t}}} \\ {\hat{a}}^{r_{t}} = a^{r_{t}} + b_{a^{r_{t}}} + n_{a^{r_{t}}} + R_{w}^{r_{t}} g^{w} \end{array}$ (9)

where, symbols ${\hat{ω}}^{r_{t}}$ and ${\hat{a}}^{r_{t}}$ represent the measured values of the gyroscope and accelerometer at time t with the current body coordinate system as the reference system respectively; b_ω and b_a are the gyroscope bias and accelerometer bias; n_ω and n_a are gyroscope noise and accelerometer noise; g^w is the gravitational acceleration. The gyroscope and accelerometer noises are Gaussian white noise; the gyroscope biases and the accelerometer biases obey Brownian motion, and their derivatives obey Gaussian distribution.

Assuming that the motion time of the aerial robot in two consecutive video frames is t_k and t_k+₁, then the orientation (o), velocity (v) and position (p) of the aerial robot at time t+1 in the local world coordinate system can be expressed by the following formula:

${\begin{array}{l} o^{w_{t_{k + 1}}} = o^{w_{t_{k}}} \otimes \int_{t \in [t_{k}, t_{k + 1}]} \frac{1}{2} Φ ({\hat{ω}}^{r_{t}} - b_{ω^{r_{t}}} - n_{ω^{r_{t}}}) o_{r_{t}}^{r_{t_{k}}} d t \\ v^{w_{t_{k + 1}}} = v^{w_{t_{k}}} + \int_{t \in [t_{k}, t_{k + 1}]} [R_{r_{t}}^{w_{t}} ({\hat{a}}^{r_{t}} - b_{a^{r_{t}}} - n_{a^{r_{t}}}) - g^{w_{t}}] d t \\ p^{w_{t_{k + 1}}} = p^{w_{t_{k}}} + (t_{k + 1} - t_{k}) v^{w_{t_{k}}} + \iint_{t \in [t_{k}, t_{k + 1}]}^{} [R_{r_{t}}^{w_{t}} ({\hat{a}}^{r_{t}} - b_{a^{r_{t}}} - n_{a^{r_{t}}}) - g^{w_{t}}] d t^{2} \end{array}$ (10)

where

$Φ (ω) = [\begin{matrix} - {⌊ ω ⌋}_{\times} & ω \\ - ω^{T} & 0 \end{matrix}], {⌊ ω ⌋}_{\times} = [\begin{matrix} 0 & - ω_{z} & ω_{y} \\ ω_{z} & 0 & - ω_{x} \\ - ω_{y} & ω_{x} & 0 \end{matrix}]$ (11)

In the above formula, symbols $\hat{ω}$ and $\hat{a}$ are the measured values from gyroscope and accelerometer, and the symbols $\otimes$ represent quaternion multiplications.

If the reference coordinate system is converted from the local world coordinate system (w) to the robot coordinate system (r), the above formula can be rewritten as:

${\begin{array}{l} o_{w}^{r_{t_{k}}} \otimes o^{w_{t_{k + 1}}} = α_{r_{t_{k + 1}}}^{r_{t_{k}}} \\ R_{w}^{r_{t_{k}}} v^{w_{t_{k + 1}}} = R_{w}^{r_{t_{k}}} [v^{w_{t_{k}}} - (t_{k + 1} - t_{k}) g^{w}] + β_{r_{t_{k + 1}}}^{r_{t_{k}}} \\ R_{w}^{r_{t_{k}}} p^{w_{t_{k + 1}}} = R_{w}^{r_{t_{k}}} [p^{w_{t_{k}}} + (t_{k + 1} - t_{k}) v^{w_{t_{k}}} - \frac{1}{2} g^{w} (t_{k + 1} - t_{k})^{2}] + γ_{r_{t_{k + 1}}}^{r_{t_{k}}} \end{array}$ (12)

where the IMU pre-integration term can be expressed as:

${\begin{array}{l} α_{r_{t_{k + 1}}}^{r_{t_{k}}} = α_{r_{t}}^{r_{t_{k}}} \int_{t \in [t_{k}, t_{k + 1}]} \frac{1}{2} Φ ({\hat{ω}}^{r_{t}} - b_{ω^{r_{t}}} - n_{ω^{r_{t}}}) d t \\ β_{r_{t_{k + 1}}}^{r_{t_{k}}} = \int_{t \in [t_{k}, t_{k + 1}]} R_{r_{t}}^{r_{t_{k}}} ({\hat{a}}^{r_{t}} - b_{a^{r_{t}}} - n_{a^{r_{t}}}) d t \\ γ_{r_{t_{k + 1}}}^{r_{t_{k}}} = \iint_{t \in [t_{k}, t_{k + 1}]}^{} R_{r_{t}}^{r_{t_{k}}} ({\hat{a}}^{r_{t}} - b_{a^{r_{t}}} - n_{a^{r_{t}}}) d t^{2} \end{array}$ (13)

Then the first-order Jacobian approximation of the IMU pre-integration term can be expressed by the following formula:

${\begin{matrix} α_{r_{t_{k + 1}}}^{r_{t_{k}}} \approx {\hat{α}}_{r_{t_{k + 1}}}^{r_{t_{k}}} \otimes [\begin{matrix} 1 \\ \frac{1}{2} J_{b_{ω}}^{α} Δ b ω_{t_{k}} \end{matrix}] \\ β_{r_{t_{k + 1}}}^{r_{t_{k}}} \approx {\hat{β}}_{r_{t_{k + 1}}}^{r_{t_{k}}} + J_{b_{a}}^{β} Δ b a_{t_{k}} + J_{b_{ω}}^{β} Δ b ω_{t_{k}} \\ γ_{r_{t_{k + 1}}}^{r_{t_{k}}} \approx {\hat{γ}}_{r_{t_{k + 1}}}^{r_{t_{k}}} + J_{b_{a}}^{γ} Δ b a_{t_{k}} + J_{b_{ω}}^{γ} Δ b ω_{t_{k}} \end{matrix}$ (14)

This formula represents a sub-matrix of the Jacobian matrix. When the gyroscope or accelerometer bias changes, the above first-order Jacobian approximation can be used to replace the IMU pre-integration without reintegration.

The gyroscope factor constraint term is constructed as a rotation residual based on quaternion outer product. Simultaneously, the accelerometer factor constraint term is constructed as velocity pre-integration residual and translation pre-integration residual respectively. The gyroscope bias and accelerometer bias factor terms are obtained from the bias difference between two consecutive video frames. Then the IMU factor constraint can be constructed as follows:

$\begin{array}{l} E_{I} ({\hat{Z}}_{r_{t_{k + 1}}}^{r_{t_{k}}}, χ_{I}) \\ = {[α_{r_{t_{k + 1}}}^{r_{t_{k}}} \otimes {({\hat{α}}_{r_{t_{k + 1}}}^{r_{t_{k}}})}^{- 1}, β_{r_{t_{k + 1}}}^{r_{t_{k}}} - {\hat{β}}_{r_{t_{k + 1}}}^{r_{t_{k}}}, γ_{r_{t_{k + 1}}}^{r_{t_{k}}} - {\hat{γ}}_{r_{t_{k + 1}}}^{r_{t_{k}}}, δ b_{ω}, δ b_{a}]}^{T} \\ = [\begin{matrix} 2 {[o_{w}^{r_{t_{k}}} \otimes o^{w_{t_{k + 1}}} \otimes {({\hat{α}}_{r_{t_{k + 1}}}^{r_{t_{k}}})}^{- 1}]}_{i m a g} \\ R_{w}^{r_{t_{k}}} [v^{w_{t_{k + 1}}} - v^{w_{t_{k}}} + (t_{k + 1} - t_{k}) g^{w}] - {\hat{β}}_{r_{t_{k + 1}}}^{r_{t_{k}}} \\ R_{w}^{r_{t_{k}}} [p^{w_{t_{k + 1}}} - p^{w_{t_{k}}} - (t_{k + 1} - t_{k}) v^{w_{t_{k}}} + \frac{1}{2} g^{w} (t_{k + 1} - t_{k})^{2}] - {\hat{γ}}_{r_{t_{k + 1}}}^{r_{t_{k}}} \\ b_{ω^{r_{t + 1}}} - b_{ω^{r_{t}}} \\ b_{a^{r_{t + 1}}} - b_{a^{r_{t}}} \end{matrix}] \end{array}$ (15)

where $χ_{I}$ represents the sub-vector related to IMU in the aerial robot state vector.

2.4 GNSS Constraint

Currently, there are 4 complete and independently operated GNSS constellations, namely, BeiDou, GPS, Galileo, GLONASS. The navigation satellites in each GNSS constellation ceaselessly broadcast modulated carrier signals, and consequently the ground receiver can distinguish the navigation satellites and demodulate the original messages. The GNSS factor constraint in the probability factor graph model is composed of pseudorange factor, Doppler frequency shift factor and receiver clock offset factor. The pseudorange measurement model between the receiver and the navigation satellite can be expressed as:

${\hat{P}}_{r}^{s} = ‖ p_{r}^{E} - p_{s}^{E} ‖ + c (δ t_{r} + δ t_{s} + Δ t_{t r o} + Δ t_{i o n} + Δ t_{m u l}) + n_{p r}$ (16)

with

${\begin{matrix} p_{r_{t_{k}}}^{E} = R (ω_{e a r t h} t_{r}^{s}) p_{r_{t_{k}}}^{e} \\ p_{s_{i}}^{E} = R (ω_{e a r t h} t_{r}^{s}) p_{s_{i}}^{e} \end{matrix}$ (17)

Here, $p_{r}^{E}$ and $p_{s}^{E}$ are the positions of the ground receiver and navigation satellite in the Earth-centered inertial (ECI) coordinate system respectively. ${\hat{P}}_{r}^{s}$ is the measured value of GNSS pseudorange, c is the propagation speed of light in vacuum, $δ t_{r}$ and $δ t_{s}$ are the clock offset of the receiver and satellite, respectively, $Δ t_{t r o}$ and $Δ t_{i o n}$ are the delay of troposphere and ionosphere in the atmosphere, respectively, $Δ t_{m u l}$ is the delay caused by multipath effect, $n_{p r}$ is the noise of pseudo range signal, $ω_{e a r t h}$ is the earth's rotation speed, $t_{r}^{s}$ represents the signal propagation time from the satellite to the receiver.

Then the GNSS pseudorange factor constraint can be constructed as the residual between the true pseudorange and the receiver measured pseudorange:

$\begin{array}{l} E_{p r} ({\hat{Z}}_{r_{t_{k}}}^{s_{i}}, χ_{p r}) \\ = ‖ p_{r_{t_{k}}}^{E} - p_{s_{i}}^{E} ‖ + c (δ t_{r_{t_{k}}} + δ t_{s_{i}} + Δ t_{t r o} + Δ t_{i o n} + Δ t_{m u l}) - {\hat{P}}_{r_{t_{k}}}^{s_{i}} \end{array}$ (18)

where $χ_{p r}$ represents the sub-vector related to the GNSS pseudorange in the aerial robot state vector.

Besides pseudorange, Doppler frequency shift is also an important navigation information in GNSS modulated signal. The Doppler frequency shift measurement of GNSS receiver and navigation satellite can be modeled as:

$δ {\hat{f}}_{r}^{s} = - \frac{1}{λ} [I_{r}^{s} (v_{r}^{E} - v_{s}^{E}) + c (δ {t_{r}}^{'} + δ {t_{s}}^{'})] + n_{d p}$ (19)

with

${\begin{matrix} v_{r_{t_{k}}}^{E} = R (ω_{e a r t h} t_{r}^{s}) v_{r_{t_{k}}}^{e} \\ v_{s_{i}}^{E} = R (ω_{e a r t h} t_{r}^{s}) v_{s_{i}}^{e} \end{matrix}$ (20)

where $λ$ is the carrier wavelength, $I_{r}^{s}$ is the direction vector between the satellite and the receiver, $v_{r_{t_{k}}}^{E}$ and $v_{s_{i}}^{E}$ are the speed of the receiver and the satellite respectively, and $δ {t_{r_{t_{k}}}}^{'}$ and $δ {t_{s_{i}}}^{'}$ are the clock offset drifting rate of the receiver and the satellite respectively.

Then the GNSS Doppler shift factor constraint can be constructed as the residual between the true carrier Doppler shift and the Doppler shift measurement:

$E_{d p} ({\hat{z}}_{r_{t_{k}}}^{s_{i}}, χ_{d p}) = - \frac{1}{λ} [I_{r_{t_{k}}}^{s_{i}} (v_{r_{t_{k}}}^{E} - v_{s_{i}}^{E}) + c (δ {t_{r_{t_{k}}}}^{'} + δ {t_{s_{i}}}^{'})] - δ {\hat{f}}_{r_{t_{k}}}^{s_{i}}$ (21)

where, $χ_{d p}$ represents the sub-vector related to GNSS Doppler frequency shift in the agent state vector $χ$ , and $δ {\hat{f}}_{r_{t_{k}}}^{s_{i}}$ is the Doppler frequency shift measurement value.

Now, the GNSS receiver clock offset error from t_k to t_k+₁ is constructed as follows:

$E_{τ} ({\hat{Z}}_{k - 1}^{k}, χ_{τ}) = δ t_{r_{t_{k}}} - δ t_{r_{t_{k - 1}}} - (t_{k} - t_{k - 1}) δ {t_{r_{t_{k - 1}}}}^{'}$ (22)

By combining the pseudorange factor $E_{p r}$ , the Doppler frequency shift factor $E_{d p}$ and the receiver clock offset factor $E_{τ}$ , the GNSS factor constraint item in the aerial robot probability factor graph model can be formed.

2.5 Tightly Coupled Pose Estimation

Considering the aerial robot pose solving process as a state estimation problem, the optimal state of the aerial robot is the maximum a posteriori estimation of the robot state vector. Assuming that the measurement signals of the aerial robot's airborne camera, IMU, and GNSS receiver are independent of each other, and the measurement noise conforms to a Gaussian distribution with zero mean, the maximum a posteriori estimation problem can be equivalent to minimizing the sum of errors, then the solution process of the aerial robot's state vector $χ$ can be expressed as:

$\begin{array}{l} χ = \underset{χ}{a r g m a x} P (χ | z) \\ = \underset{χ}{a r g m i n} ({‖ e_{p} - H_{p} χ ‖}^{2} + \sum_{k = 1}^{n} {‖ E (z_{k}, χ) ‖}^{2}) \end{array}$ (23)

where, z is the aerial robot linear observation model, $e_{p}$ represents the prior error, H_P matrix is the prior pose information obtained by the airborne camera, n is the number of robot state vectors in the sliding window, and E(·) represents the sum of all sensor measurement error factors.

Finally, by solving the aerial robot state vector $χ$ by means of probability factor graph optimization, the complete robot pose information can be obtained.

3 Experiments

3.1 Implementation for Aerial Robot Navigation

We chose the NVIDIA jetson Xavier NX board as the companion computer for aerial robot autonomous navigation. The Intel RealSense T265 binocular camera is employed to provide visual and inertial raw measurement information for the aerial robot pose estimation. Simultaneously, the Intel Realsense D435i RGB-D camera can provide 3D point cloud map. The U-Blox ZED-F9P is used for GNSS receiver module, which can continuously provide GNSS pseudorange, Doppler frequency shift and universal time coordinated (UTC) pulse signals to aerial robot pose estimator. A carbon fiber quadrotor unmanned aerial vehicle (UAV) with 410 mm wheelbase can be used as the carrier of companion computer and sensors. Pixhawk4 is chosen as the flight control automatic pilot, and the PX4 is employed as the flight control firmware. Both onboard cameras and GNSS receiver are connected to the companion computer via USB. The ground station is connected with the Pixhawk4 automatic pilot and companion computer through WiFi 2.4G and Ethernet, respectively. The detailed description is shown in Fig. 3.

Fig. 3 The aerial robot implementation scheme

3.2 Simulation in Public Dataset

The EuRoC datasets^[22] are collected from a binocular fisheye camera (Aptina MT9V034) and synchronized inertial measurement unit (Analog Devices ADIS 16448) carried by a micro aerial robot. The resolution of the binocular camera is 752×480, and the exposure mode of this camera is a global shutter with 20 Hz output frequency. The EuRoC datasets^[22] contain 11 sequences, which includes different lighting conditions and different environments. We compare the proposed RRVPE with OKVIS^[4] and VINS-Fusion^[16] in EuRoC datasets. OKVIS is another nonlinear optimization-based visual-inertial odometry, and VINS-Fusion is the state-of-the-art KLT sparse optical flow tracking-based tightly coupled agent state estimator.

All methods are compared in a NVIDIA Jetson Xavier NX embedded device, as shown in Fig. 4. The NVIDIA Jetson series devices are slightly different from other onboard computers on the score of its GPU module with 384 CUDA cores, which allows the RRVPE system to execute in real time with CUDA parallel acceleration. The comparison of experimental results on root-mean-square error (RMSE) are shown in Table 1, which is verified by an absolute trajectory error (ATE). Figure 5 shows the system consistency on absolute pose error (APE) as time goes on in the sequence MH01. RRVPE will inevitably generate some accumulation errors over time, which is an inherent characteristic of all visual- based robot state estimators. Fortunately, due to the local bundle adjustment, the accumulation error of the RRVPE system is always within a reasonable range. The experimental results show that, on the NVIDIA Jetson Xavier NX embedded companion computer, RRVPE system shows a favorable accuracy comparing with other state-of-the-art agent sate estimators, and achieves real-time performance.

Fig. 4 NVIDIA Jetson Xavier NX

Fig. 5 The change of APE as time goes on in sequence MH01

Table 1

Performance comparison in the EuRoC datasets on RMSE (unit:m)

3.3 Simulation for Aerial Robot Navigation

Due to the instability of the open source flight control algorithm, a simulation test is needed before real-world flight, which can effectively avoid the aerial robot crash caused by a program error. We carried out the virtual experiment for aerial robot autonomous navigation in the Gazebo simulator, as shown in Fig. 6. After taking off, the aerial robot leverages a virtual plug-in stereo camera and GNSS raw signals to obtain the spatial position. Meanwhile, a 3D voxel map calculated by a virtual plug-in RGB-D camera is structured to further capture the transformation matrix between the aerial robot and neighbouring obstruction. When the flight destination is entered manually, the trajectory planner generates a path for the aerial robot motion and sends the desired speed to the flight controller, then gradually approaches the destination and keeps a fixed distance from the neighbouring obstruction.

Fig. 6 Aerial robot navigation test carried out in the Gazebo simulator

3.4 Real-World Aerial Robot Navigation

In order to verify the robustness and practicability of the proposed aerial robot navigation system, we conduct both simulation and real-world physical verification similar to the Gazebo test. The visual-inertial sensor used in our real-world test is an Intel RealSense T265 binocular camera. In the meantime, an Intel RealSense D435i RGB-D camera is used to capture the 3D environmental map. In addition, the U-Blox ZED-F9P is employed as GNSS receiver that is a high-precision multiband receiver with multi-constellation support. The real-world experiment was conducted on a campus tennis court, where the sky is open and most of the navigation satellites are well tracked. The terrain crossed by the aerial robot is an artificial arbitrary obstruction, as shown in Fig. 7.

Fig. 7 The real-world navigation environment conducted on a campus tennis court

During flight, the aerial robot can change its route when approaching an obstacle and always keep a reasonable distance from the neighbouring obstruction. It is worth noting that all flight control commands are generated by the NVIDIA Jetson Xavier NX board, and the flight autopilot does not receive any external control signal generated by external environment.

4 Conclusion

In this paper, we proposed RRVPE: a robust and real-time visual-inertial-GNSS tightly coupled pose estimator for aerial robot navigation, which combines KLT sparse optical flow, inertial measurements and GNSS raw signal to estimate aerial robot state between consecutive images. In the nonlinear optimization phase, visual-inertial-GNSS raw measurements were formulated by the probabilistic factor graph in a small sliding container. The RRVPE system can achieve real-time robot state estimation with CUDA acceleration on an airborne computer. The proposed system is evaluated using both simulated and real-world physical experiments, demonstrating clear advantages over state-of-the-art approaches.

References

Lei L, Li Z H, Yang H, et al. Extraction of the leaf area density of maize using UAV-LiDAR data [J]. Geomatics and Information Science of Wuhan University, 2021, 46(11): 1737-1745(Ch). [Google Scholar]
Chen J J, Li S, Liu D H, et al. AiRobSim: Simulating a multisensor aerial robot for urban search and rescue operation and training [J]. Sensors (Basel, Switzerland), 2020, 20(18): 5223. [NASA ADS] [CrossRef] [PubMed] [Google Scholar]
Tabib W, Goel K, Yao J, et al. Autonomous cave surveying with an aerial robot [EB/OL].[2022-06-25]. https://arxiv.org/abs/2003.13883. [Google Scholar]
Geneva P, Eckenhoff K, Lee W, et al. OpenVINS: A research platform for visual-inertial estimation [C]// 2020 IEEE International Conference on Robotics and Automation (ICRA). New York: IEEE, 2020: 4666-4672. [Google Scholar]
Paul M K, Roumeliotis S I. Alternating-stereo VINS: Observability analysis and performance evaluation [C]//2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE, 2018: 4729-4737. [Google Scholar]
Qin T, Li P L, Shen S J. VINS-mono: A robust and versatile monocular visual-inertial state estimator [J]. IEEE Transactions on Robotics, 2018, 34(4): 1004-1020. [CrossRef] [Google Scholar]
Rosinol A, Abate M, Chang Y, et al. Kimera: An open-source library for real-time metric-semantic localization and mapping [C]// 2020 IEEE International Conference on Robotics and Automation (ICRA). New York: IEEE, 2020: 1689-1696. [Google Scholar]
Campos C, Elvira R, Rodríguez J J G, et al. ORB-SLAM3: An accurate open-source library for visual, visual-inertial, and multimap SLAM [J]. IEEE Transactions on Robotics, 2021, 37(6): 1874-1890. [CrossRef] [Google Scholar]
Mur-Artal R, Montiel J M M, Tardós J D. ORB-SLAM: A versatile and accurate monocular SLAM system [J]. IEEE Transactions on Robotics, 2015, 31(5): 1147-1163. [CrossRef] [Google Scholar]
Mur-Artal R, Tardós J D. ORB-SLAM2: An open-source SLAM system for monocular, stereo, and RGB-D cameras [J]. IEEE Transactions on Robotics, 2017, 33(5): 1255-1262. [CrossRef] [Google Scholar]
Li T, Zhang H P, Gao Z Z, et al. Tight fusion of a monocular camera, MEMS-IMU, and single-frequency multi-GNSS RTK for precise navigation in GNSS-challenged environments [J]. Remote Sensing, 2019, 11(6): 610. [NASA ADS] [CrossRef] [Google Scholar]
Cao S Z, Lu X Y, Shen S J. GVINS: Tightly coupled GNSS-visual-inertial fusion for smooth and consistent state estimation [J]. IEEE Transactions on Robotics, 2022, 38(4): 2004-2021. [CrossRef] [Google Scholar]
Zhang C, Yang Z, Fang Q H, et al. FRL-SLAM: A fast, robust and lightweight SLAM system for quadruped robot navigation [C]//2021 IEEE International Conference on Robotics and Biomimetics (ROBIO). New York: IEEE, 2022: 1165-1170. [Google Scholar]
Zhang C, Yang Z, Liao L W, et al. RPEOD: A real-time pose estimation and object detection system for aerial robot target tracking [J]. Machines, 2022, 10(3): 181. [CrossRef] [MathSciNet] [Google Scholar]
Lucas B D, Kanade T. An iterative image registration technique with an application to stereo vision [C]// Proceedings of the 7th International Joint Conference on Artificial Intelligence. New York: ACM, 1981: 674-679. [Google Scholar]
Shi J B, Tomasi. Good features to track [C]//1994 Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE, 2002: 593-600. [Google Scholar]
Leutenegger S, Lynen S, Bosse M, et al. Keyframe-based visual-inertial odometry using nonlinear optimization [J]. The International Journal of Robotics Research, 2015, 34(3): 314-334. [CrossRef] [Google Scholar]
Forster C, Pizzoli M, Scaramuzza D. SVO: Fast semi-direct monocular visual odometry [C]// 2014 IEEE International Conference on Robotics and Automation (ICRA). New York: IEEE, 2014: 15-22. [Google Scholar]
Engel J, Schöps T, Cremers D. LSD-SLAM: Large-scale direct monocular SLAM [C]// Proceedings of the European Conference on Computer Vision (ECCV). Berlin: Springer, 2014: 834-849. [Google Scholar]
Qin T, Li P L, Shen S J. Relocalization, global optimization and map merging for monocular visual-inertial SLAM [C]// 2018 IEEE International Conference on Robotics and Automation (ICRA). New York: IEEE, 2018: 1197-1204. [Google Scholar]
Qin T, Shen S J. Robust initialization of monocular visual-inertial estimation on aerial robots [C]//2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). New York: IEEE, 2017: 4225-4232. [Google Scholar]
Burri M, Nikolic J, Gohl P, et al. The EuRoC micro aerial vehicle datasets [J]. The International Journal of RoboticsResearch, 2016, 35(10): 1157-1163. □ [CrossRef] [Google Scholar]

All Tables

Table 1

Performance comparison in the EuRoC datasets on RMSE (unit:m)

In the text

All Figures

	Fig. 1 The aerial robot equipped with the RRVPE navigation system 1. Intel RealSense D435i camera: responsible for building real-world 3D voxel map; 2. Intel RealSense T265 camera: responsible for providing binocular video stream and inertial measurement information; 3. U-Blox ZED-F9P receiver: responsible for receiving GNSS pseudorange, Doppler, ephemeris and time pulse information
In the text

	Fig. 2 Main parallel threads of RRVPE system
In the text

	Fig. 3 The aerial robot implementation scheme
In the text

	Fig. 4 NVIDIA Jetson Xavier NX
In the text

	Fig. 5 The change of APE as time goes on in sequence MH01
In the text

	Fig. 6 Aerial robot navigation test carried out in the Gazebo simulator
In the text

	Fig. 7 The real-world navigation environment conducted on a campus tennis court
In the text

Current usage metrics show cumulative count of Article Views (full-text article views including HTML views, PDF and ePub downloads, according to the available data) and Abstracts Views on Vision4Press platform.

Data correspond to usage on the plateform after 2015. The current usage metrics is available 48-96 hours after online publication and is updated daily on week days.

Initial download of the metrics may take a while.

[1] Lei L, Li Z H, Yang H, et al. Extraction of the leaf area density of maize using UAV-LiDAR data [J]. Geomatics and Information Science of Wuhan University, 2021, 46(11): 1737-1745(Ch). [Google Scholar]

[2] Chen J J, Li S, Liu D H, et al. AiRobSim: Simulating a multisensor aerial robot for urban search and rescue operation and training [J]. Sensors (Basel, Switzerland), 2020, 20(18): 5223. [NASA ADS] [CrossRef] [PubMed] [Google Scholar]

[3] Tabib W, Goel K, Yao J, et al. Autonomous cave surveying with an aerial robot [EB/OL].[2022-06-25]. https://arxiv.org/abs/2003.13883. [Google Scholar]

[4] Geneva P, Eckenhoff K, Lee W, et al. OpenVINS: A research platform for visual-inertial estimation [C]// 2020 IEEE International Conference on Robotics and Automation (ICRA). New York: IEEE, 2020: 4666-4672. [Google Scholar]

[5] Paul M K, Roumeliotis S I. Alternating-stereo VINS: Observability analysis and performance evaluation [C]//2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE, 2018: 4729-4737. [Google Scholar]

[6] Qin T, Li P L, Shen S J. VINS-mono: A robust and versatile monocular visual-inertial state estimator [J]. IEEE Transactions on Robotics, 2018, 34(4): 1004-1020. [CrossRef] [Google Scholar]

[7] Rosinol A, Abate M, Chang Y, et al. Kimera: An open-source library for real-time metric-semantic localization and mapping [C]// 2020 IEEE International Conference on Robotics and Automation (ICRA). New York: IEEE, 2020: 1689-1696. [Google Scholar]

[8] Campos C, Elvira R, Rodríguez J J G, et al. ORB-SLAM3: An accurate open-source library for visual, visual-inertial, and multimap SLAM [J]. IEEE Transactions on Robotics, 2021, 37(6): 1874-1890. [CrossRef] [Google Scholar]

[9] Mur-Artal R, Montiel J M M, Tardós J D. ORB-SLAM: A versatile and accurate monocular SLAM system [J]. IEEE Transactions on Robotics, 2015, 31(5): 1147-1163. [CrossRef] [Google Scholar]

[10] Mur-Artal R, Tardós J D. ORB-SLAM2: An open-source SLAM system for monocular, stereo, and RGB-D cameras [J]. IEEE Transactions on Robotics, 2017, 33(5): 1255-1262. [CrossRef] [Google Scholar]

[11] Li T, Zhang H P, Gao Z Z, et al. Tight fusion of a monocular camera, MEMS-IMU, and single-frequency multi-GNSS RTK for precise navigation in GNSS-challenged environments [J]. Remote Sensing, 2019, 11(6): 610. [NASA ADS] [CrossRef] [Google Scholar]

[12] Cao S Z, Lu X Y, Shen S J. GVINS: Tightly coupled GNSS-visual-inertial fusion for smooth and consistent state estimation [J]. IEEE Transactions on Robotics, 2022, 38(4): 2004-2021. [CrossRef] [Google Scholar]

[13] Zhang C, Yang Z, Fang Q H, et al. FRL-SLAM: A fast, robust and lightweight SLAM system for quadruped robot navigation [C]//2021 IEEE International Conference on Robotics and Biomimetics (ROBIO). New York: IEEE, 2022: 1165-1170. [Google Scholar]

[14] Zhang C, Yang Z, Liao L W, et al. RPEOD: A real-time pose estimation and object detection system for aerial robot target tracking [J]. Machines, 2022, 10(3): 181. [CrossRef] [MathSciNet] [Google Scholar]

[15] Lucas B D, Kanade T. An iterative image registration technique with an application to stereo vision [C]// Proceedings of the 7th International Joint Conference on Artificial Intelligence. New York: ACM, 1981: 674-679. [Google Scholar]

[16] Shi J B, Tomasi. Good features to track [C]//1994 Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE, 2002: 593-600. [Google Scholar]

[17] Leutenegger S, Lynen S, Bosse M, et al. Keyframe-based visual-inertial odometry using nonlinear optimization [J]. The International Journal of Robotics Research, 2015, 34(3): 314-334. [CrossRef] [Google Scholar]

[18] Forster C, Pizzoli M, Scaramuzza D. SVO: Fast semi-direct monocular visual odometry [C]// 2014 IEEE International Conference on Robotics and Automation (ICRA). New York: IEEE, 2014: 15-22. [Google Scholar]

[19] Engel J, Schöps T, Cremers D. LSD-SLAM: Large-scale direct monocular SLAM [C]// Proceedings of the European Conference on Computer Vision (ECCV). Berlin: Springer, 2014: 834-849. [Google Scholar]

[20] Qin T, Li P L, Shen S J. Relocalization, global optimization and map merging for monocular visual-inertial SLAM [C]// 2018 IEEE International Conference on Robotics and Automation (ICRA). New York: IEEE, 2018: 1197-1204. [Google Scholar]

[21] Qin T, Shen S J. Robust initialization of monocular visual-inertial estimation on aerial robots [C]//2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). New York: IEEE, 2017: 4225-4232. [Google Scholar]

[22] Burri M, Nikolic J, Gohl P, et al. The EuRoC micro aerial vehicle datasets [J]. The International Journal of RoboticsResearch, 2016, 35(10): 1157-1163. □ [CrossRef] [Google Scholar]