Open Access
Issue
Wuhan Univ. J. Nat. Sci.
Volume 29, Number 5, October 2024
Page(s) 439 - 452
DOI https://doi.org/10.1051/wujns/2024295439
Published online 20 November 2024

© Wuhan University 2024

Licence Creative CommonsThis is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

0 Introduction

In recent years, with the development of space remote sensing technology, the acquisition of high-resolution remote sensing images[1] has become more convenient. Object detection in remote sensing images has become one of the current research hotspots. Research on object detection technology in remote sensing images can provide important information for disaster prediction, national defense security and smart city construction. In past researches, many excellent methods have been proposed to deal with this task, especially deep learning-based methods and traditional methods. These studies have greatly promoted the development of remote sensing object detection technology and brought considerable economic value in many fields[2,3].

Traditional remote sensing image object detection methods are usually image-based processing methods. In brief, threshold segmentation is firstly used for image preprocessing, then the Scale Invariant Feature Transform (SIFT) method is used to extract image features. Finally, template matching and methods based on shallow learning are used to judge the target. Over the past decades, researchers have developed many traditional methods to deal with this problem, including template matching sliding window, Histogram of Oriented Gradients (HOG) and Bag of Words (BOW)[4], etc. However, these traditional methods require manual design of feature priors, leading to extensive time spent on feature engineering for specific tasks. Consequently, this results in long processing times, low detection accuracy and poor generalization ability. In contrast, deep learning-based object detection algorithms automatically extract object features and directly predict object category and location information on the feature map, significantly enhancing the detection performance and efficiency. Deep learning-based object detection algorithms can be divided into two categories according to whether they include a region proposal generation stage. One category is two-stage object detection algorithms[5], such as Region-based Convolutional Neural Network (R-CNN), Spatial Pyramid Pooling Network (SPP-Net)[6], Fast R-CNN[7] and Faster R-CNN[8]. The other category is single-stage object detection algorithms[9], such as Single Shot MultiBox Detector (SSD)[10], Retina-Net[11], and You Only Look Once (YOLO) algorithm[12]. Although deep learning-based object detection algorithms exhibit good performance, remote sensing images are captured from a bird's-eye view at different altitudes. The backgrounds in these images are complex and the sizes of the objects are generally small, which introduces interference to the detection and recognition of the object.

To solve the above issues, scholars have conducted a series of researches. Cheng et al[13] proposed the Remote Sensing Image Convolutional Neural Network (RICNN) model, which effectively dealt with the problem of target rotation changes in remote sensing images. Yang et al[14] aimed at small targets densely arranged in any direction in remote sensing images, and improved the detection ability of the model for targets in complex backgrounds by introducing a supervised multi-dimensional attention network. Wang et al[15] used the Adaptive Spatial Feature Pyramid (ASFP) module to fuse multi-scale feature information to enhance the information interaction of different scale feature layers. Nayan et al[16] utilized upsampling and skip connections to extract multi-scale features at different depths during the training process to improve the accuracy of small object detection. Xi et al[17] proposed a dual-stream representation learning generative adversarial network by recovering the missing information of the high-frequency and low-frequency components in high-resolution images, which in turn improved the detection accuracy of small objects in low-resolution aerial images. Nie et al[18] proposed an improved ship detection method based on Mask R-CNN, which enhanced the accurate detection of equipment in remote sensing images.

Although the above methods exhibit satisfactory detection performance on remote sensing images, the detection efficiency of small objects in complex backgrounds remains unsatisfactory. Thus, to address this issue, this research proposes a lightweight remote sensing image object detection network model which utilizes YOLOX-S as the basic network model. And it is optimized for different subproblems. The main contributions of this research are summarized as follows:

1) We employ the YOLOX model, which has demonstrated commendable results in various scene detection tasks, for target object detection in remote sensing images. On the basis of YOLOX model, we integrate the Efficient Channel Attention (ECA) mechanism to improve the network's ability in extracting important features, reducing the influence of redundant information such as background, and ultimately improving the detection accuracy of small objects in remote sensing images.

2) We optimize the loss function of the YOLOX-S model. Specifically, for the model's boundary regression loss, we employ the Alpha Intersection over Union (α-IoU) loss function as a replacement for the IoU loss function to improve robustness to small sample datasets and noise.

3) Experimental evaluations conducted on the NWPU VHR-10 remote sensing image dataset demonstrate that our method can achieve better detection performance than most mainstream methods. Additionally, we conduct generalization experiments on aircraft images in the RSOD dataset, and achieve a good detection effect. These results show that our proposed model has a good generalization ability.

1 Related Work

1.1 Remote Sensing Image Detector Based on YOLO Series Algorithm

The YOLO object detection algorithm was firstly proposed by Joseph Redmon. It constitutes an end-to-end network model that directly predicts target object bounding boxes and categories. YOLO redefines object detection as a regression problem, employing a single Convolutional Neural Network (CNN) applied to the whole image. This network divides the image into grids and predicts the class probability and bounding box coordinates for each grid. YOLOv3[20] was proposed in 2018. It uses the global region of the image for training, which can better distinguish the target from the background. Based on YOLOv3, Cao et al[21] integrated a 104×104 detection scale into the improved model to enhance the sensitivity and detection ability of the network towards small objects such as planes and ships in remote sensing images. In April 2020, Bochkovskiy et al[22] proposed YOLOv4 based on YOLOv3, incorporating Cross Stage Partial Network (CSP) and Path Aggregation Network (PAN) structures, and adopting some practical techniques to achieve a balance between detection speed and accuracy. Yu et al[23] introduced feature layer scaling to YOLOv4 to improve the detection accuracy of bridge objects in remote sensing images. In June 2020, Ultralytics introduced a new object detection network framework, YOLOv5, in which some improvements in details have been implemented based on YOLOv4. In order to solve the problem of category label imbalance caused by sparse distribution of objects in remote sensing images, Zhao et al[24] integrated the Aggregated-Mosaic method into the YOLOv5 model, thereby enhancing the stability of training and inference processes.

1.2 Attention Mechanism

In computer vision tasks, such as image recognition, semantic segmentation and object detection, attention mechanism can enhance essential features of images while suppressing some irrelevant features to improve the model accuracy. In 2018, Woo et al[25] proposed inferring attention maps along two independent dimensions of channel and space to strengthen target features and improve object detection accuracy. Wang et al[26] designed an encoder-decoder module and constructed a residual attention network based on it. Experimental results show that better outputs can be achieved by continuously refining the feature maps. Inspired by these findings, we incorporated attention mechanism into our network. The experimental results indicate that these attention mechanisms can significantly enhance detection accuracy.

2 Method

2.1 Analysis of YOLOX Network

YOLOX[27] inherits the developmental ideas of the previous YOLO series network, effectively achieving a balance between detection accuracy and processing speed, thus rendering it highly adaptable for remote sensing image object detection. YOLOX network structure has four models: YOLOX-S, YOLOX-M, YOLOX-L and YOLOX-X. Considering the model reasoning speed and lightweight, we choose YOLOX-S as the basic model.

The structure of YOLOX network is divided into three key components: Backbone, Neck and Output. The Backbone network is Cross Stage Partial Darknet (CSPDarknet). Through multiple convolutional and pooling processing within the Backbone network, feature maps of different sizes are extracted and transferred to the Neck part of the modular structure. The Neck component consists of Path Aggregation Feature Pyramid Network (PAFPN)[28], which initially utilizes the Feature Pyramid Network (FPN) architecture to fuse low-level and high-level features. This fusion enhances both position and semantic information across feature images of different scales. Subsequently, a top-down Path Aggregation Network (PANet) is utilized to establish connections between high-level features and those of the lower layers. These enrich the feature information obtained and also enhance feature fusion capabilities. For the Output, a decoupled head is used, incorporating a generalized crossover joint loss function and the innovative Simulated Optimal Transport Assignment (SimOTA) dynamic positive sample matching method. An anchor-free method is adopted to alleviate the constraints imposed by prior bounding boxes, utilizing the CSPDarknet architecture which is known for its excellent feature extraction performance. In addition, we incorporate the Focus layer from YOLOv5 for data augmentation, along with Mosaic augmentation, enabling more comprehensive end-to-end predictions.Although YOLOX exhibits commendable detection performance, this research identifies several areas for improvement to address the following challenges:

1) The objects in remote sensing images are dense and the background is complex, which brings great difficulties to the object detection task. YOLOX employs the CNN-based CSPDarknet Backbone network to capture local feature information through convolution kernels. However, this approach often ignores the relationship between global feature information, resulting in indistinguishable target information and background information. Consequently, this approach affects the effectiveness of detecting targets in remote sensing images.

2) Remote sensing images[29] often exhibit a large aspect ratio, with objects appearing relatively small in size, thus complicating accurate object localization. YOLOX utilizes the IoU loss function for bounding box regression. However, this choice may cause gradient disappearance during the calculation process, resulting in slow convergence speed and low bounding box regression accuracy. Consequently, this approach affects the precision of object localization in remote sensing images.

2.2 Improved YOLOX Network

Aiming to address the shortcomings of YOLOX in remote sensing image object detection, this research proposes the YOLOX-NR model (shown in Fig. 1).

thumbnail Fig. 1 YOLOX-NR model structure diagram

(a) The Backbone of YOLOX-NR is based on CSPDarknet, with three added ECA attention block. C3, C4, and C5 are feature maps of different scales extracted by the Backbone network. (b)The Neck uses the structure of the PANet and is optimized by using ECA attention block. (c) Decoupled head is the detection of YOLOX-NR, with added α-IoU loss function. H3, H4, H5 are three output branches of different scales

YOLOX-NR is an improved model based on YOLOX. In our experiments, the detection results of YOLOX-NR are better than the original YOLOX. We introduced three ECA modules following the Backbone CSPDarknets output to improve the efficiency of feature extraction. Moreover, ECA blocks are also incorporated to the back of PANet in the Neck section. This strategic placement can avoid dimensionality reduction and enable the model to focus on useful information for detection while suppressing irrelevant information such as background noise. This refinement facilitates enhanced detection accuracy of small objects in remote sensing images. Additionally, we refined the loss function by replacing the IoU loss function with the α-IoU loss function. This modification provides more flexible bounding box regression accuracy by modulating the parameter α, thereby improving the model's resilience to small datasets sizes and noise.

2.2.1 ECA attention module

Squeeze-and-Excitation Network (SENet) is the most common channel domain attention mechanism, which mainly includes three steps: Squeeze, Excitation and Attention[30]. As shown in Fig. 2(a), firstly, global average pooling is carried out for each channel separately. Then two nonlinear Fully Connected (FC) layers are applied, followed by a Sigmoid function to generate channel weight values. The two FC layers are designed to capture nonlinear channel interactions while finishing dimensionality reduction. The dimensionality reduction operation maps channel features from a high-dimensional space to a low-dimensional space and then maps them back. This operation can reduce the complexity of the model and decrease the model parameters, but it obstructs the direct correspondence between weights and channels.

thumbnail Fig. 2 Structure comparison diagram of SE module (a) and ECA module (b)

ECA Network (ECANet) is an extension and improvement of SENet, which avoids dimension reduction and effectively realizes cross-channel communication. Moreover, introducing fewer parameters and computations can bring significant gains. After global average pooling of channels without dimensionality reduction, ECANet captures cross-channel interaction information by considering each channel and its K neighbors.

Considering the requirements of model performance, inserting the efficient channel attention mechanism ECANet[31] into the network can effectively improve the model performance. Additionally, due to its lightweight design, the number of model parameters will not greatly increase. As shown in Fig. 2(b), ECA obtains a 1×1×C feature map after global average pooling and calculates the adaptive convolution kernel size K, where K refers to the coverage of local channel communication, that is how many neighbors will participate in the attention prediction of a single channel. The interaction coverage (the size of the convolution kernel) is proportional to the channel dimension. The weights for each channel are obtained by fast 1D convolutions of size K, as shown in Eq. (1).

W = σ ( C 1 D K ( y ) ) (1)

where σ represents the activation function, 1D indicates 1D convolution, y indicates aggregated features without dimensionality reduction, C indicates the number of channels, then the normalized weights and the original input feature maps are multiplied channel by channel to generate the weighted feature maps.

For the determination of K value, the optimal coverage can be manually adjusted in different network structures and different numbers of convolution modules, but manual adjustment will lead to a lot of waste of computing power resources. Since the convolution kernel size K is proportional to the channel dimension C, it can be inferred that there is a mapping relationship between K and C. The mapping relationship is shown in Eq. (2).

C = ϕ ( K ) = 2 γ × K - b (2)

In order to enable layers with a larger number of channels to carry out more cross-channel interactions, the convolution kernel size of one-dimensional convolution is self-adaptive according to a function. The formula for calculating the convolution kernel size K is:

K = φ ( C ) = | l o g 2 ( C ) γ + b γ | o d d (3)

where C represents the number of channels, ||odd indicates the nearest odd number, b and γ are hyperparameters, and set to be 1 and 2, respectively.

In this research, the ECA module is added, which can avoid dimension reduction and enable the model to focus on the object in the image while ignoring the influence and interference of the background. This improves the detection accuracy of the model.

2.2.2 Optimization of loss function

The loss function of the YOLOX-S model is mainly composed of three parts: bounding box regression loss (LossIoU)[32]; class loss (Losscls) and confidence loss (Lossconf). Among them, the bounding box regression loss of YOLOX-S model adopts the IoU loss function[33]. IoU refers to the interaction over union, which is an important index for evaluating detector performance in the field of object detection. It reflects the detection effect between the predicted box and the ground truth box. The calculation formula is shown in Eq. (4).

I o U = | A C | | A C | (4)

where A represents the predicted box and C indicates the ground truth box. The greater the overlap between the prediction box and the ground truth box, the higher the IoU value. The calculation formula of IoU loss function is shown in Eq. (5).

L o s s I o U = 1 - I o U (5)

However, during the calculation process using the IoU loss function, there might be scenarios where the prediction box and the ground truth box do not intersect. This can result in the gradient disappearance problem, which will reduce both the convergence speed and detection accuracy.

To address the existing problems of IoU loss function, this paper adopts α-IoU loss function[34] to solve these deficiencies. The calculation formula is shown in Eq. (6).

L o s s α - I o U = 1 - I o U α + ρ α 2 ( B , B g t ) (6)

That is, the Box-Cox transformation is firstly applied to the IoU loss, and it is generalized as Power IoU loss, which incorporates an additional power regularization term along the power parameter α. B represents the predicted bounding box, and Bgt represents the ground truth bounding box. ρα2(B,Bgt) indicates penalty term computed based on B and Bgt, which can be used to prevent the gradient disappearance problem when calculating the IoU loss. Besides, the α-IoU loss can significantly exceed the existing IoU-based loss. By adjusting α, the detector gains more flexibility in achieving different levels of bounding box regression accuracy. When α > 1, it increases the loss and gradient for high IoU objects, thus improving the bounding box regression accuracy. In most situations, the detector performs well when α = 3. The final improved objective loss function formula is shown in Eq. (7).

L o s s = L o s s α - I o U + L o s s c o n f + L o s s c l s (7)

When performing object detection in remote sensing images, using the IoU loss function to locate the loss may lead to a slower convergence rate, especially for small objects. This is because the small size of the object can cause the predicted box and the ground truth box to overlap which makes it difficult to accurately determine the moving direction for the bounding box. Consequently, this results in poor localization performance in remote sensing images. In this research, the α-IoU loss function is adopted to address this issue. This loss function aims to improve the convergence speed of localization loss by increasing the loss and gradient of high IoU objects, and thereby improving the regression accuracy of bounding box.

3 Experiment

3.1 Datasets

The datasets used in this research are the NWPU VHR-10 remote sensing image dataset[35] and RSOD dataset[36]. Firstly, the NWPU VHR-10 dataset is used to evaluate and train the entire model, and then the aircraft images of the RSOD dataset are used to evaluate the generalization ability and robustness of the improved model[37].

The NWPU VHR-10 dataset, annotated by the Northwestern Polytechnical University team, is used for training and testing in this experiment. The images in the dataset are cropped from Google Earth and Vaihingen datasets. The NWPU VHR-10 small target dataset consists of 800 images, which are divided into two parts: positive and negative image sets. The positive image set consists of 650 images, each containing at least one target, while the negative image set comprises 150 images with no targets. Negative image sets are only used in semi-supervised learning or weakly supervised learning. However, this research uses supervised learning, so only positive sample sets are used. The NWPU VHR-10 positive sample set covers 10 classes of objects, which are aircraft, ship, storage tank, baseball diamond, tennis court, basketball court, ground track field, harbor, bridge and vehicle. Before the experiment, the positive sample set of NWPU VHR-10 dataset was randomly divided into a training set and a test set in a ratio of 7:3, in which the training set had 455 images and the testing set had 195 images. The numbers of different targets included in the training set and test set are shown in Table 1.

The RSOD dataset was annotated by Wuhan University and used to evaluate the generalization ability and robustness of the improved model in this experiment. The aircraft images from the RSOD dataset are selected for this generalization evaluation. The RSOD dataset[38] contains 446 aircraft remote sensing images and 4 993 aircraft targets. The brightness and contrast in the images are diverse, and there are interferences such as occlusion, shadow and distortion. During the experiment, all 446 aircraft remote sensing images were used as the test set to verify the generalization ability of the model.

Table 1

Numbers of targets contained in the dataset

3.2 Experimental Details

3.2.1 Experimental platform

The experiments in this paper have certain requirements on hardware devices, and GPU need to be used for accelerated operations. The environment required for the experiment is built on the server. The operating system is Ubuntu 18.04, the GPU is GTX1080TI, the memory is 11 GB, and the PyTorch deep learning framework is selected. The specific environment required for the experiment is shown in Table 2.

Table 2

Experimental environment configuration

3.2.2 Training details

During the training process, the official pre-trained weight yolox-s.pt of YOLOX-S is downloaded for transfer learning. The experiment sets the total training period epochs to 300, enables Mosaic data augmentation, with a range of Mosaic augmentation from 0.1 to 2, and turns off data augmentation for the last 15 epochs.

The optimizers used for this training are Stochastic Gradient Descent (SGD) and Adamax, with a weight decay coefficient of 0.000 5, an initial learning rate of 0.01, a minimum learning rate is 5% of the set current learning rate, and a learning rate decay strategy using the cosine annealing algorithm. Additionally, the momentum and Nesterov parameters of the SGD optimizer are set to 0.9 and True, respectively. The batch size is set to 8, and the image input size is 640×640. Each group of experiments is trained for 5 times.

3.3 Evaluation Indicator

In terms of the accuracy of the network model, this experiment uses Average Precision (AP), Mean Average Precision (mAP) and Precision-Recall (P-R) curve, along with other common object detection evaluation indicators to evaluate the performance of the algorithm in this research[39].

Precision is used to measure the correctness of model testing, and recall is used to evaluate the comprehensiveness of model testing. The calculation formulas are as follows:

P r e c i s i o n = T P T P + F P R e c a l l = T P T P + F N (8)

The meanings of TP, FP and FN in the formula are shown in Table 3.

The P-R curve can be drawn according to the calculated precision and recall rate, and the area of the graph formed by the P-R curve is the AP of a single class, as shown in Eq. (9).

A P = 0 1 P ( r ) d r (9)

where P represents Precision and r represents Recall. For mAP, the PASCAL VOC evaluation metric is adopted in this paper, which calculates the average AP for each class at the IoU threshold of 0.5. The calculation formula is shown in Eq. (10).

m A P = i = 1 C A P i C (10)

where C represents the number of classes and APirepresents the AP value of each class.

In terms of model complexity, the evaluation metrics used in this experiment consist of the number of parameters (weights of the model) and Floating-point Operations (FLOPs):

P a r a m e t e r s = [ i × ( f × f ) × o ] + o F L O P s = H × W × P a r a m e t e r s (11)

where i represents the number of input channels, f represents the size of the convolution kernel, o represents the number of output channels, and H×W represents the size of the output feature map.

Table 3

Description of evaluation index formula

3.4 Analysis of Experimental Results

3.4.1 Experiments on attention mechanisms

In order to verify the role of the ECA module, this research uses three different attention mechanisms, namely Convolutional Block Attention Module (CBAM)[25], SE and ECA for comparative experiments. Each experiment uses the same set of parameters and employs the official pretrained weights of YOLOX-S for transfer learning. Additionally, to unify the control variables, the three attention mechanisms are only added between the Backbone and Neck of YOLOX-S. The loss function and optimizer of the experiment are both set to IoU and SGD, and the training rounds are unified to 300 epochs. The final results are shown in Table 4.

As can be seen from Table 4, when ECA attention module is added to YOLOX-S model, its mAP value is higher than that when the CBAM module and SE module are added. Additionally, the number of parameters and FLOPs after adding ECA module are also less than the latter two, which reduces the complexity of the model and the requirements of hardware equipment to a certain extent. Therefore, the attention mechanism selected in this study is ECA module.

Table 4

Comparative experiment of different attention mechanisms

3.4.2 Results on the NWPU VHR-10 dataset

Figure 3(a) and 3(b) show the changes of mAP curve and P-R curve of the two algorithms during the training process, respectively. It can be seen from the figure that the mAP value of our method is higher than that of the YOLOX-S model, and the area formed by its P-R curve is also larger than that of YOLOX-S. This indicates that our method is better than the YOLOX-S model in the accuracy of remote sensing image detection. Figure 3(c) and 3(d) show the changes of the P-R curves of the YOLOX-S model and our method at the IoU thresholds of 0.5, 0.6, 0.7, 0.8 and 0.9, respectively. When IoU = 0.5, the area formed by P-R curve is the largest, indicating that the image object detection at this time is the most accurate. The sum of the area of five P-R curves of our method is larger than that of the YOLOX-S model, which further verifies that the detection performance of our method is better than that of the YOLOX-S model.

thumbnail Fig. 3 Performance comparison between our method and the original YOLOX-S model

Figure 4 shows the test results of the proposed algorithm on part of the data in the NWPU VHR-10 dataset. As shown in the Fig. 4, although the detection accuracy of harbor object is slightly low, the detection accuracy of other objects is high, so the overall detection effect is excellent, which also demonstrates the reliability of the proposed algorithm for remote sensing image object detection.

thumbnail Fig. 4 The detection results of different targets in NWPU VHR-10 dataset using our method

3.4.3 Model generalization verification experiment

In order to verify the generalization of the improved method based on the YOLOX-S model proposed in this research, we tested it on the aircraft images of the RSOD dataset. Briefly, the weights trained by the previous improved model on the NWPU VHR-10 dataset are directly used for inference testing on the aircraft images of the RSOD dataset, without training process. Its mAP and mAP50:95 can reach 86.70% and 48.47% respectively. This indicates that the YOLOX-NR model has relatively good generalization performance and can be extended to the detection of other remote sensing image objects.

Figure 5 shows the performance of the YOLOX-NR model on aircraft images from the RSOD dataset. As observed in the figure, the YOLOX-NR model consistently achieves good performance and accurate object detection under varying conditions of brightness and contrast, occlusion, shadow, and small object interference. This indicates that the YOLOX-NR model exhibits relatively good generalization properties, suggesting its potential applicability to the detection of other objects in remote sensing images.

thumbnail Fig. 5 Detection performance of YOLOX-NR in RSOD dataset aircraft images

3.4.4 Ablation experiment

In order to further verify the effectiveness of each module in the YOLOX-NR model, we conducted an ablation experiment on each module using the NWPU VHR-10 dataset in this research. The results of the ablation experiments are shown in Table 5.

As observed from Table 5, the first group employed YOLOX-S as the benchmark model without incorporating any improved modules, and its detection accuracy reached 94.2%. In the second group, the loss function optimized and the optimizer was modified based on the first group. Specifically, the IoU loss function was replaced with the α-IoU loss function, and the SGD optimizer was substituted with the Adamax optimizer, resulting in an mAP of 94.5% without introducing additional parameters. In the third group, the channel attention module ECA was added between the Backbone and Neck of the YOLOX-S model based on the second group, leading to an accuracy of 95.0% with a slight increase in computational overhead. On the basis of the third group, the fourth experiment also incorporated the ECA module between the Neck and Head of YOLOX-S model. The number of parameters and floating-point operations remained unchanged, and the mAP reached 95.5%, which further improved the detection accuracy.

Table 5

Comparison of ablation experiments of each module of the improved algorithm

3.4.5 Comparison with other models

In order to verify the effectiveness of method, we compared it with several mainstream algorithms on NWPU VHR-10. Among these, BOW[40] represents a machine learning method, while the others are based on deep learning technology. These deep learning-based methods include R-CNN-based RICNN, single-stage algorithm SSD, deformable Region-based Fully Convolutional Network (R-FCN), traditional Faster R-CNN, and Multi-Scale Convolutional Neural Network (MSCNN), etc. The mAP values of each class are shown in Table 6. From the total average precision of each class, the method proposed in this paper (95.5%) is higher than other methods in accuracy. The traditional BOW object detection method has the lowest mAP value (24.57%), and the detection performance of deep learning methods is significantly better than that of traditional object detection algorithms. From the detection accuracy rate of individual classes, the method in this research demonstrates superior performance in identifying airplanes, ships, storage tanks, baseball diamonds, basketball courts, ground track fields, harbors and vehicles. In particular, the detection accuracy of small objects such as ships and storage tanks, as well as complex objects like baseball diamonds, ground track fields, and basketball courts, which are difficult to distinguish from the background and surrounding objects, has been significantly improved. Furthermore, while the YOLOX-S model exhibits substantial improvements in detection effectiveness compared to other models, it encounters issues such as false detections and missed detections in habors, which lead to lower mAP values. The method proposed in this research effectively addresses these problems, improving the mAP of harbor by 7.18% and enhancing the overall detection performance.

Table 6

Performance comparison between the proposed model and other methods on NWPU VHR-10 dataset (unit:%)

4 Discussion

In order to solve the problems of complex background and dense small target detection in remote sensing images, the YOLOX-NR proposed in this research integrates the efficient channel attention and optimizes the boundary regression loss function. This integration significantly enhances the detection accuracy of small and medium-sized objects in remote sensing images.

In this research, we proved the effectiveness of ECA attention module in detecting small objects in remote sensing images. As shown in Table 4 and Table 5, when the YOLOX-S model is added to ECA attention module, its mAP value is higher than the original model, as well as models augmented with the CBAM or SE modules. This superiority is attributed to the stronger feature extraction capabilities of the ECA channel attention module compared to CBAM or SE modules. Additionally, the ECA channel attention can effectively focus on useful information for detection while suppressing irrelevant background information, thereby improving the detection accuracy of small objects in remote sensing images. Furthermore, we demonstrated the feasibility of the α-IoU loss function. As shown in Table 5, adding α-IoU loss function can improve the object detection accuracy. This is because the α-IoU loss function can accelerate the convergence of positioning loss by modulating α, increase the loss and gradient of high IoU targets, and thereby improving the regression accuracy of bounding boxes.

In summary, the method proposed in this research effectively solves the problem of dense small object detection in remote sensing images and exhibits excellent detection performance in actual remote sensing image object detection tasks. Nevertheless, since this study employs horizontal bounding boxes for object detection, it may ignore the directional information such as the rotation angle of objects. Therefore, in the future work, we will take into account factors such as the orientation information of objects and explore optimization methods to further research in the domain of remote sensing image tasks.

5 Conclusion

To solve the problems of complex background and poor detection performance of dense small objects in remote sensing images, this study proposes a lightweight remote sensing image object detection algorithm which is based on an improved YOLOX-S model. This algorithm integrates the ECA mechanism to improve the network's ability on extract important features from images. Additionally, we use the α-IoU loss function as a replacement for the IoU loss function in bounding box regression, thereby improving the localization accuracy of objects. Experimental results show that the mAP in the improved method exceeds that in the original YOLOX model by 1.3% on the NWPU VHR-10 dataset. Moreover, in the generalization experiment using RSOD remote sensing dataset for aircraft images, our method also exhibits commendable object detection performance, indicating that our method exhibits good generalization ability.

References

  1. Dai L J, Liu C. Research on remote sensing image of land cover classification based on multiple classifier combination[J]. Wuhan University Journal of Natural Sciences, 2011, 16(4): 363-368. [CrossRef] [Google Scholar]
  2. Yu D W, Ji S P. A new spatial-oriented object detection framework for remote sensing images[J]. IEEE Transactions on Geoscience and Remote Sensing, 2022, 60: 1-16. [CrossRef] [Google Scholar]
  3. Shivappriya S N, Priyadarsini M J P, Stateczny A, et al. Cascade object detection and remote sensing object detection method based on trainable activation function[J]. Remote Sensing, 2021, 13(2): 200. [NASA ADS] [CrossRef] [Google Scholar]
  4. Zhang Y, Jin R, Zhou Z H. Understanding bag-of-words model: A statistical framework[J]. International Journal of Machine Learning and Cybernetics, 2010, 1(1): 43-52. [CrossRef] [Google Scholar]
  5. Sandoval C, Pirogova E, Lech M. Two-stage deep learning approach to the classification of fine-art paintings[J]. IEEE Access, 2019, 7: 41770-41781. [NASA ADS] [CrossRef] [Google Scholar]
  6. He K M, Zhang X Y, Ren S Q, et al. Spatial pyramid pooling in deep convolutional networks for visual recognition[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2015, 37(9): 1904-1916. [CrossRef] [Google Scholar]
  7. Girshick R. Fast R-CNN[C]//2015 IEEE International Conference on Computer Vision (ICCV). New York: IEEE, 2015: 1440-1448. [Google Scholar]
  8. Ren S Q, He K M, Girshick R, et al. Faster R-CNN: Towards real-time object detection with region proposal networks[C]//Proceedings of the 28th International Conference on Neural Information Processing Systems. New York: ACM, 2015: 91-99. [Google Scholar]
  9. de Vos B D, Berendsen F F, Viergever M A, et al. A deep learning framework for unsupervised affine and deformable image registration[J]. Medical Image Analysis, 2019, 52: 128-143. [CrossRef] [PubMed] [Google Scholar]
  10. Liu W, Anguelov D, Erhan D, et al. SSD: Single shot multibox detector[C]//Computer Vision–ECCV 2016. Cham: Springer, 2016: 21-37. [Google Scholar]
  11. Lin T Y, Goyal P, Girshick R, et al. Focal loss for dense object detection[C]//2017 IEEE International Conference on Computer Vision (ICCV). New York: IEEE, 2017: 2999-3007. [Google Scholar]
  12. Redmon J, Divvala S, Girshick R, et al. You only look once: Unified, real-time object detection[C]//2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). New York: IEEE, 2016: 779-788. [Google Scholar]
  13. Cheng G, Zhou P C, Han J W. Learning rotation-invariant convolutional neural networks for object detection in VHR optical remote sensing images[J]. IEEE Transactions on Geoscience and Remote Sensing, 2016, 54(12): 7405-7415. [NASA ADS] [CrossRef] [Google Scholar]
  14. Yang X, Yang J R, Yan J C, et al. SCRDet: Towards more robust detection for small, cluttered and rotated objects[C]//2019 IEEE/CVF International Conference on Computer Vision (ICCV). New York: IEEE, 2019: 8231-8240. [Google Scholar]
  15. Wang P J, Sun X, Diao W H, et al. FMSSD: Feature-merged single-shot detection for multiscale objects in large-scale remote sensing imagery[J]. IEEE Transactions on Geoscience and Remote Sensing, 2020, 58(5): 3377-3390. [NASA ADS] [CrossRef] [Google Scholar]
  16. Nayan A A, Saha J, Nokib Mozumder A, et al. Real time multi-class object detection and recognition using vision augmentation algorithm[J]. International Journal of Advanced Science and Technology, 2020, 29(5): 14070-14083. [Google Scholar]
  17. Xi Y, Jia W, Zheng J, et al. DRL-GAN: Dual-stream representation learning GAN for low-resolution image classification in UAV applications[J]. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 2021, 14: 1705-1716. [NASA ADS] [CrossRef] [Google Scholar]
  18. Nie X, Duan M Y, Ding H X, et al. Attention mask R-CNN for ship detection and segmentation from remote sensing images[J]. IEEE Access, 2020, 8: 9325-9334. [NASA ADS] [CrossRef] [Google Scholar]
  19. Redmon J, Divvala S, Girshick R, et al. You only look once: Unified, real-Time object detection[C]// 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). New York: IEEE, 2016: 779-788. [Google Scholar]
  20. Redmon J, Farhadi A. YOLOv3: An incremental improvement[EB/OL]. [2018-04-08]. http://arxiv.org/abs/1804.02767. [Google Scholar]
  21. Cao C Q, Wu J, Zeng X D, et al. Research on airplane and ship detection of aerial remote sensing images based on convolutional neural network[J]. Sensors, 2020, 20(17): 4696. [Google Scholar]
  22. Bochkovskiy A, Wang C Y, Liao H M. YOLOv4: Optimal speed and accuracy of object detection[EB/OL]. [2020-04-23]. https://arxiv.org/abs/2004.10934. [Google Scholar]
  23. Yu P D, Wang X, Liu J H, et al. Bridge target detection in remote sensing image based on improved YOLOv4 algorithm[C]//2020 4th International Conference on Computer Science and Artificial Intelligence. New York: ACM, 2020: 139-145. [Google Scholar]
  24. Zhao B Y, Wu Y F, Guan X R, et al. An improved aggregated-mosaic method for the sparse object detection of remote sensing imagery[J]. Remote Sensing, 2021, 13(13): 2602. [NASA ADS] [CrossRef] [Google Scholar]
  25. Woo S, Park J, Lee J Y, et al. CBAM: Convolutional block attention module[C]//Proceedings of the European Conference on Computer Vision (ECCV). Cham: Springer, 2018: 3-19. [Google Scholar]
  26. Wang F, Jiang M Q, Qian C, et al. Residual attention network for image classification[C]//2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). New York: IEEE, 2017: 6450-6458. [Google Scholar]
  27. Ge Z, Liu S T, Wang F, et al. YOLOX: Exceeding YOLO series in 2021[EB/OL]. [2021-08-06]. http://arxiv.org/abs/2107.08430. [Google Scholar]
  28. Wang W H, Xie E Z, Song X G, et al. Efficient and accurate arbitrary-shaped text detection with pixel aggregation network[C]//2019 IEEE/CVF International Conference on Computer Vision (ICCV). New York: IEEE, 2019: 8439-8448. [Google Scholar]
  29. Shen Y C, Jin H, Du B. An improved method to detect remote sensing image targets captured by sensor network[J]. Wuhan University Journal of Natural Sciences, 2011, 16(4): 301-307. [CrossRef] [Google Scholar]
  30. Hu J, Shen L, Sun G. Squeeze-and-excitation networks[C]//2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). New York: IEEE, 2018: 7132-7141. [Google Scholar]
  31. Wang Q L, Wu B G, Zhu P F, et al. ECA-net: Efficient channel attention for deep convolutional neural networks[C]//2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). New York: IEEE, 2020: 11531-11539. [Google Scholar]
  32. He Y H, Zhu C C, Wang J R, et al. Bounding box regression with uncertainty for accurate object detection[C]//2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). New York: IEEE, 2019: 2883-2892. [Google Scholar]
  33. Yu J H, Jiang Y N, Wang Z Y, et al. UnitBox: An advanced object detection network[C]//Proceedings of the 24th ACM international conference on Multimedia. New York: ACM, 2016: 516-520. [Google Scholar]
  34. He J B, Erfani S, Ma X J, et al. Alpha-IoU: A family of power intersection over union losses for bounding box regression[EB/OL]. [2022-01-22]. http://arxiv.org/abs/2110.13675. [Google Scholar]
  35. Cheng G, Han J W. A survey on object detection in optical remote sensing images[J]. ISPRS Journal of Photogrammetry and Remote Sensing, 2016, 117: 11-28. [NASA ADS] [CrossRef] [Google Scholar]
  36. Xiao Z F, Liu Q, Tang G F, et al. Elliptic Fourier transformation-based histograms of oriented gradients for rotationally invariant object detection in remote-sensing images[J]. International Journal of Remote Sensing, 2015, 36(2): 618-644. [NASA ADS] [CrossRef] [Google Scholar]
  37. Chen X, Liu J H, Xu F, et al. A novel method of aircraft detection under complex background based on circular intensity filter and rotation invariant feature[J]. Sensors, 2022, 22(1): 319. [Google Scholar]
  38. Long Y, Gong Y P, Xiao Z F, et al. Accurate object localization in remote sensing images based on convolutional neural networks[J]. IEEE Transactions on Geoscience and Remote Sensing, 2017, 55(5): 2486-2498. [CrossRef] [Google Scholar]
  39. Yang C C, Ma J Y, Zhang M F, et al. Multiscale facet model for infrared small target detection[J]. Infrared Physics and Technology, 2014, 67: 202-209. [NASA ADS] [CrossRef] [Google Scholar]
  40. Xu S, Fang T, Li D R, et al. Object classification of aerial images with bag-of-visual words[J]. IEEE Geoscience and Remote Sensing Letters, 2010, 7(2): 366-370. [NASA ADS] [CrossRef] [Google Scholar]
  41. Xu Z Z, Xu X, Wang L, et al. Deformable ConvNet with aspect ratio constrained NMS for object detection in remote sensing imagery[J]. Remote Sensing, 2017, 9(12): 1312. [NASA ADS] [CrossRef] [Google Scholar]
  42. Guo W, Yang W, Zhang H J, et al. Geospatial object detection in high resolution satellite images based on multi-scale convolutional neural network[J]. Remote Sensing, 2018, 10(1): 131. [NASA ADS] [CrossRef] [Google Scholar]

All Tables

Table 1

Numbers of targets contained in the dataset

Table 2

Experimental environment configuration

Table 3

Description of evaluation index formula

Table 4

Comparative experiment of different attention mechanisms

Table 5

Comparison of ablation experiments of each module of the improved algorithm

Table 6

Performance comparison between the proposed model and other methods on NWPU VHR-10 dataset (unit:%)

All Figures

thumbnail Fig. 1 YOLOX-NR model structure diagram

(a) The Backbone of YOLOX-NR is based on CSPDarknet, with three added ECA attention block. C3, C4, and C5 are feature maps of different scales extracted by the Backbone network. (b)The Neck uses the structure of the PANet and is optimized by using ECA attention block. (c) Decoupled head is the detection of YOLOX-NR, with added α-IoU loss function. H3, H4, H5 are three output branches of different scales

In the text
thumbnail Fig. 2 Structure comparison diagram of SE module (a) and ECA module (b)
In the text
thumbnail Fig. 3 Performance comparison between our method and the original YOLOX-S model
In the text
thumbnail Fig. 4 The detection results of different targets in NWPU VHR-10 dataset using our method
In the text
thumbnail Fig. 5 Detection performance of YOLOX-NR in RSOD dataset aircraft images
In the text

Current usage metrics show cumulative count of Article Views (full-text article views including HTML views, PDF and ePub downloads, according to the available data) and Abstracts Views on Vision4Press platform.

Data correspond to usage on the plateform after 2015. The current usage metrics is available 48-96 hours after online publication and is updated daily on week days.

Initial download of the metrics may take a while.