Open Access
Issue
Wuhan Univ. J. Nat. Sci.
Volume 31, Number 1, February 2026
Page(s) 1 - 9
DOI https://doi.org/10.1051/wujns/2026311001
Published online 06 March 2026

© Wuhan University 2026

Licence Creative CommonsThis is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

0 Introduction

Precise and efficient medical image segmentation is paramount in medical image analysis [1-2]. However, conventional manual segmentation is labor-intensive, time-consuming, and prone to inter-expert variability [3-4]. This necessitates automated approaches, particularly deep learning algorithms, to accurately delineate organs or pathological regions. Such methods are crucial for facilitating accurate, rapid, and consistent diagnoses for clinicians and researchers[5]. In recent years, the proliferation of advanced deep learning architectures, including Convolutional Neural Networks (CNNs), Vision Transformers (ViTs), and Mamba, has led to substantial breakthroughs in medical image segmentation. Nonetheless, despite their notable benefits, each architecture faces inherent performance constraints dictated by its intrinsic design attributes [6]. CNNs, relying on local convolutional kernels, inherently struggle to capture long-range dependencies, which can lead to suboptimal feature extraction and segmentation outcomes. Conversely, ViTs, while adept at global modeling, are hampered by quadratic computational complexity, limiting their efficiency in dense prediction tasks. Furthermore, Mamba-based models encounter challenges in achieving an optimal effective receptive field [7] when converting 2D image data into 1D sequential formats.

The Receptance Weighted Key Value (RWKV) model[8] introduces the WKV attention mechanism alongside token shift layers, facilitating linear computational complexity within global attention mechanisms while effectively capturing local dependencies. Despite the innovative efforts to extend the application of RWKV to visual domains, challenges persist in the direct adaptation of RWKV for dense image prediction tasks, including the nuanced field of medical image segmentation. This primarily stems from the inherent incompatibility between the causal sequential modeling capabilities of the RWKV architecture and the 2D spatial structure of images. Tailored for 1D sequences modeling, RWKV is not inherently suited for direct application to model 2D image tokens. Prior research has addressed this issue by flattening 2D tokens into 1D sequences via a computational hierarchy[9], such as row-column-major ordering, where each row's end is immediately followed by the next row's start. However, this approach ignores the preservation of spatial continuity[10], thereby compromising the integrity of the intrinsic structural information of the image.

This paper introduces an optimized medical image segmentation framework, termed Multi-head Scan RWKV (MS-RWKV), which is an adaptation of the RWKV architecture [8] for 2D visual tasks. The proposed model preserves the fundamental architecture and inherent benefits of RWKV, while incorporating essential modifications tailored for the segmentation of 2D medical images. 1) Our multi-head scanning module reduces the unidirectional causal bias among image patches, enabling more balanced global receptive field computation. We strategically insert padding tokens between scan-sequence elements that are spatially disconnected, preserving 2D structural continuity. 2) Building on the GhostNetV2 [11] architecture, we design a feature aggregation block that captures local spatial context while strengthening correlations within 1D sequences generated along specific scan directions, using asymmetric convolutions. 3) This novel mechanism broadens token semantics by aggregating multi-scale features from wide receptive fields, effectively addressing orientation sensitivity in 2D images. During training, panoramic token shift (P-Shift) employs structural reparameterization to learn adaptive token shifting across diverse contexts.

We rigorously evaluated our proposed MS-RWKV model through comprehensive experiments on skin lesion segmentation and multi-organ segmentation tasks, showcasing its superior performance and high efficiency in the field of medical image segmentation. Ad-ditionally, we conducted ablation studies to validate the performance of our design. An extensive array of experimental outcomes underscores the robust potential of our model for image segmentation applications.

The main contributions of this study are as follows:

We introduced the MS-RWKV framework, adapting the RWKV model for application in medical image segmentation tasks. This adaptation has demonstrated promise as an enhanced solution for more precise and effective image segmentation.

We integrated a novel multi-head scan mechanism, augmented with padding strategies, into the RWKV architecture. This innovation effectively bridges the divide between 1D sequences processing and 2D image traversal.

During the conversion of 2D images to 1D sequences, we integrate the Feature Aggregation Attention (FAA) module. The asymmetric convolution within this module extracts features that are particularly advantageous for subsequent processing of 1D sequences.

1 Methodology

The RWKV model[12], drawing from natural language processing, combines the parallel training efficiency of transformers[13] with the sequential inference capabilities of Recurrent Neural Networks (RNNs)[14]. The architecture of the RWKV model comprises a series of stacked blocks, each of which integrates time-mixing and channel-mixing blocks, both of which feature recurrent structures.

Time-Mixing Block. This block is engineered to augment the modeling capacity for dependencies and patterns within sequential data. Given an input sequence x=(x1,x2,,xT)Mathematical equation, where T represents the length of the input features after convolutional subsampling, the output sequence o=(o1,o2,,oT)Mathematical equation of the time-mixing block is computed as follows:

r t = ( μ r x t + ( 1 - μ r ) x t - 1 ) W r , Mathematical equation(1)

k t = ( μ k x t + ( 1 - μ k ) x t - 1 ) W k , Mathematical equation(2)

v t = ( μ v x t + ( 1 - μ v ) x t - 1 ) W v , Mathematical equation(3)

o t   = ( σ ( r t ) w k v t ) W o , Mathematical equation(4)

where WoR dio×dattMathematical equation is the output projection matrix,dioMathematical equation is the input/output size, and dattMathematical equation is the RWKV time-mixing block size. WrR datt×dioMathematical equation, WkR datt×dioMathematical equation and WvR datt×dioMathematical equation are the projection matrices for the receptance, key, and value, respectively. μrMathematical equation, μkMathematical equation and μvMathematical equation are time-mixing factors for the receptance, key, and value, respectively. The values of rtMathematical equation, ktMathematical equation, and vtMathematical equation are calculated through linear interpolation between the current input and the input from the previous time step. This block applies a non-linear activation function σ to the receptance vector rtMathematical equation, and then combines the resulting values with the hidden state wkvtMathematical equation through element-wise multiplication.

w k v t = i = 1 t - 1 e - ( t - 1 - i ) w + k i v i + e u + k t v t i = 1 t - 1 e - ( t - 1 - i ) w + k i + e u + k t , Mathematical equation(5)

where wMathematical equation is the channel-wise time decay vector for the previous input, uMathematical equation is the special weighting factor applied to the current input, and wkvtMathematical equation is the weighted summation of the input in the interval [1, t]. The hidden states Eq. (5) can be computed recursively as follows:

w k v t = a t - 1 + e u + k t v t b t - 1 + e u + k t , Mathematical equation(6)

where at=e-wat-1+ektvt, bt=e-wbt-1+ekt,Mathematical equation and a0, b0Mathematical equation are zero-initialized.

Channel-Mixing Block. This block is specifically engineered to enhance the feature representations propagated from the time-mixing block through a series of robust non-linear transformations. Given the input sequence x'=(x1',x2',,xT')Mathematical equation, the specific process of the block is:

r t ' = ( μ r ' x t ' + ( 1 - μ r ' ) x t - 1 ' ) W r ' , Mathematical equation(7)

k t ' = ( μ k ' x t ' + ( 1 - μ k ' ) x t - 1 ' ) W k ' , Mathematical equation(8)

o t ' = σ ( r t ' ) ( m a x ( k t ' , 0 ) 2 W v ' ) , Mathematical equation(9)

where Wr'R dlinear×dioMathematical equation and Wk'R dlinear×dioMathematical equation are the projection matrices for the receptance and key, respectively. Wv'R dlinear×dioMathematical equation is the channel-mixing matrix, and dlinearMathematical equation is the RWKV time-mixing block size. μr'Mathematical equation and μk'Mathematical equation are time-mixing factors for the receptance and key, respectively. The channel-mixing block operates causally, as the computation of ot is contingent solely on xt and xt-1. Intuitively, this amplification process enhances the representations of historical information.

2 Architecture of MS-RWKV

The RWKV[15, 12] architecture, originally conceived for processing 1D sequences, encoun-ters some limitations when tasked with learning from 2D data structures. To address these challenges, we introduce novel modules that enhance the RWKV's capability to effectively process 2D image data.

Overall architecture. The architecture of the MS-RWKV is depicted in Fig. 1(a). The MS-RWKV incorporates a four-stage hierarchical backbone with skip connections. Given an input image I, we first partition the feature map XRH×W×3Mathematical equation into 2D patches via the non-overlapping patch embedding layer, with the channel dimension projected into c dimensions. As illustrated in Fig.1 (b), each stage's MS-RWKV module is tasked with subsequently extracting feature representations at varying levels from the input image. Within these modules, four distinct multi-head scanning trajectories are utilized to linearize the patch tokens into sequences X= [S1, S2, S3, S4], where Sn is the sequence after the n-th scan path. The MS-RWKV module achieves competitive performance by computing global and local attention with linear complexity for input sequences. Following the decoding phase, the image resolution is restored to its original size through the final projection layer, enabling pixel-accurate segmentation. A comprehensive exposition of our architectural design principles and methodological implementations will be systematically delineated in the following sections.

Thumbnail: Fig. 1 Refer to the following caption and surrounding text. Fig. 1 (a) The overall architecture of MS-RWKV. (b) The core of MS-RWKV block

Feature Aggregation Module. The RWKV[12] model, initially tailored for 1D input sequences, encounters challenges in preserving local dependency relationships when applied to 2D image data, which impedes its ability to capture local fine-grained details[16]. Building upon GhostNetV2's[11] approach to capturing local details, we develop a novel feature aggregation module, Feature Aggregation Attention (FAA), to enhance feature aggregation in dense prediction tasks. This module is engineered to enhance local feature extraction capability by expanding effective receptive fields while preserving parameter sparsity through grouped convolutions. FAA consists of two Multi-scale Convolutional Modules (MCM) and an Activation Module (AM). It can be described as follows:

x ^ i = M C M ( D W C ( M C M ( x i ) A M ( x i ) ) ) + x i , Mathematical equation(10)

where xi and x^iMathematical equation represent the input and output tensors of the module in the i-th stage, DWC is a depth-wise convolution with a kernel size of 3 × 3.

As illustrated in Fig. 2 , the input x is partitioned into four subcomponents x = (X1, X2, X3, X4), which are processed through parallel convolutional pathways with asymmetric kernel dimensions. The resultant features are concatenated to form the final output, achieving multi-granularity feature fusion that captures both local details and global context. Crucially, the asymmetric convolutions (kernel size: 1 × KH , KW × 1) explicitly model directional spatial correlations, thereby significantly facilitating sequential scanning in the subsequent RWKV module. The AM branch implements a module consisting of two linear layers and an activation function. FAA leverages the features expanded by the first MCM module, which are then modulated by the AM branch, augmenting the model's expressive power. The enhanced features are subsequently fed into the second MCM module to restore the original feature information for output, effectively aggregating surrounding information. This method effectively mitigates the inherent limitations of the flattening approach.

Thumbnail: Fig. 2 Refer to the following caption and surrounding text. Fig. 2 (a) The diagrams of blocks in feature aggregation module. (b) Multi-scale convolutional module

Scan patterns. The multi-head scan mechanism, which involves the parallel extraction and integration of features across various divergent scanning trajectories [17-18], enables the capture of a global receptive field and the modeling of long-range dependencies. This module also draws inspiration from the research on multi-head scan presented in UltraLight VM-UNet[19] and MHS-VM[20], and we conducted correlation experiments to explore the implications and applications of these methods. The details of the aforementioned process can be formulated as follows:

S 1 , S 2 , S 3 , S 4 = S p [ L N ( X i n ) ] , Mathematical equation(11)

R W - S i = R W K V p ( P r o j ( S i ) ) ,   i = 1,2 , 3,4 , Mathematical equation(12)

X o = C a t ( R W - S 1 , R W - S 2 , R W - S 3 , R W - S 4 ) , Mathematical equation(13)

O u t = P r o j [ L N ( X o ) ] , Mathematical equation(14)

where LN is the LayerNorm, Sp is the Split operation, RWKVp is the RWKV operation with padding, Cat is the concatenation operation, and Proj is the Projection operation. The information gathered from the four branches is then merged and cycled through the subsequent model module. Within each layer, consecutive modules integrate various scanning approaches, thereby enhancing the model's generalization capabilities[21]. Ultimately, we opted for a parallel 4-head scan approach to uniformly decompose the depth feature Xin into four sequences [S1, S2, S3, S4]. In Eqs.(12) and (13), each branch RWKVp with padding p independently extracts pertinent information, which is subsequently aggregated and output through a concatenation strategy. To address the spatial discontinuity resulting from the flattening of the image into 1D sequences, we insert padding tokens at the end of each row and column to detect the end-of-line signal. We execute both row-major and column-major scans in both forward and reverse directions, as delineated in Fig.3. To preserve the resolution of the original image, we remove padding tokens prior to concatenation, thereby restoring the image to its input size. Finally, the four features are concatenated to form the composite feature Xo, which is subsequently processed by LayerNorm and Projection operations to yield the Out.

Thumbnail: Fig. 3 Refer to the following caption and surrounding text. Fig. 3 Illustrations of the scan head block

Note:   After padding the image, one of the four scanning methods is selected to flatten the image into 1D sequences.

Token shift. The token shift mechanism in RWKV [8] was initially introduced to mitigate the misalignment between the 1D decay of attention and the 2D adjacency in images. However, current token shift methods, including the unidirectional token shift (Uni-shift) in RWKV [8] and the quadridirectional token shift (Quad-shift) in Vision-RWKV[22], collect information from limited directions and fail to account for the comprehensive spatial continuity inherent in 2D imagery. To address this, we employ a panoramic token shift (P-Shift) mechanism, integrating it into time-mixing and channel-mixing blocks in the RWKV module. In our approach, we harness a suite of deep convolutional layers equipped with convolutional kernels of diverse sizes to facilitate the extraction and fusion of features. This strategy effectively aggregates information from multiple spatial orientations, enhancing the model's capacity to capture richer spatial features[23]. The mechanism of P-Shift can be formulated as follows:

z ^ i = k s K S D W C o n v k s ( z i ) + z i , Mathematical equation(15)

where ziMathematical equation and z^iMathematical equation represent the input and output tensors of the module in the i-th tage. DWConvks denotes a depthwise convolution with a kernel size of ks. KS defines a set of parallel convolution kernels with values of {1 × 1, 3 × 3, 5 × 5}. This mechanism utilizes a multi-branch architecture during training to capture local contextual information with an expanded visual receptive field. For testing, the multi-branch structure [24] is consolidated into a single branch using a 5×5 convolution kernel, thereby enhancing the model's inference efficiency and reducing parameter count.

3 Experiments and Results

3.1 Datasets and Parameter

We conducted an extensive performance evaluation of our framework across three distinct open-source medical image segmentation datasets: ISIC17[25], ISIC18 [26] and ACDC[27]. These benchmark datasets are widely recognized for their pivotal role in advancing medical image segmentation research. In our experimental setup, we implemented the MS-RWKV model using PyTorch 2.0 and performed all training on an NVIDIA GeForce RTX 3090Ti GPU. To enhance the robustness of our model, we incorporated data augmentation techniques, including random flipping and rotation. The training was conducted with a batch size of 32 over 300 epochs. We opted for the AdamW optimizer for network training, complemented by a cosine annealing schedule for learning rate decay. Given the heterogeneity in difficulty across different datasets, we fine-tuned the hyperparameters accordingly. For the ACDC dataset, we initialized the learning rate at 5 × 10-4 and set the weight decay to 1 × 10-4. For the remaining datasets, we standardized the learning rates at 1 × 10-3 and weight decays at 1 × 10-5.To ensure a rigorous and comprehensive performance evaluation, we adopted a standardized set of quantitative metrics, including the Mean Intersection over Union (mIoU), the Dice Similarity Coefficient (DSC), and Accuracy (Acc).

3.2 Main Results

To verify the effectiveness of our proposed method, we performed a comparative analysis of the MS-RWKV model against several state-of-the-art models.

Table 1 presents our method's performance compared with various approaches on ISIC17 and ISIC18 datasets, where our proposed MS-RWKV achieved the best average mIoU of 83.27% and 81.52%. Specifically, compared with Mamba-based methods (such as H-vmunet[28]), our method improved the mIoU by 1.22% and 0.92%, respectively, and by 0.56% and 0.47% compared with RWKV-based methods (such as RWKV-UNet[29]). As shown in Table 2, compared with other methods, our proposed method achieved the best average DSC of 91.85% on the ACDC dataset.

The quantitative results presented in the tables demonstrate that our method achieves superior performance compared to state-of-the-art approaches for 2D medical image segmentation tasks.

Table 1

Comparative experimental results on the ISIC17 and ISIC18 dataset (unit:%)

Table 2

Performance comparison with state-of-the-art methods on the ACDC dataset for right ventricle (RV), Myocardium (Myo), and left ventricle (LV) segmentation (unit:%)

3.3 Ablation Studies

We conducted comprehensive ablation studies on the ISIC18 dataset to validate the effectiveness of the multi-head scan, feature aggregation attention, and P-shift components. The results are shown below.

Multi-Head Scan with Padding. In our ablation studies, we adopt four parallel scan heads for hierarchical feature extraction from 2D image data. To systematically evaluate the impact of architectural components, we conducted comparative experiments with two distinct configurations: 1) four scan heads without padding mechanisms, and 2) three scan heads with strategic padding implementation. The quantitative results, as detailed in Table 3, demonstrate that increasing the number of scan heads leads to enhanced model capacity and improved feature representation.

Token Shift. To explore the effectiveness of the proposed P-Shift, we compared its performance with Uni-Shift in RWKV[8] and Quad-Shift in Vision-RWKV[22]. The ablation studies presented in Table 4 demonstrate that our proposed P-Shift mechanism coupled with reparameterization significantly enhances the local feature extraction capability of the token shift operation. Our novel token shift architecture effectively exploits the inherent spatial correlations within 2D visual feature maps, facilitating multi-directional feature propagation and adaptive feature aggregation across diverse spatial orientations.

Feature Aggregation Attention. To systematically investigate the impact of various architectural components preceding the MS-RWKV Block, we conducted a series of ablation experiments. Our assessment focused on comparing the performance of two distinct baseline modules (FFN and GhostNetV2) against our proposed FAA module. As presented in Table 5, the proposed FAA module achieves superior performance, surpassing FFN by 1.32% and GhostNetV2 by 0.54% in mIoU.

We conducted comprehensive ablation studies on ISIC18 to evaluate individual components. Table 6 summarizes ablation results, where the last row shows the full model's performance and the preceding rows quantify the impact of module removal. A comparative analysis between the full model and the configurations lacking the FAA reveals that FAA contributes most significantly to the overall performance.

Table 3

Ablation study on number of scan heads and padding

Table 4

Ablation study on P-Shift (unit:%)

Table 5

Ablation study on FAA (unit:%)

Table 6

Ablation study on P-shift and FAA (unit:%)

3.4 Visualization

Qualitative analysis of ISIC 2018 results (Fig. 4) demonstrates that our MS-RWKV architecture facilitates the feature integration of semantic information and finer-grained features. Quantitative experiments confirm that our proposed framework achieves significant gains over the VM-UNet baseline, particularly in capturing subtle texture variations as confirmed by quantitative metrics. These visualizations further validate that MS-RWKV, as an RWKV-based model, holds significant potential in the field of medical image segmentation.

Thumbnail: Fig. 4 Refer to the following caption and surrounding text. Fig. 4 The visual comparison of segmentation results of ours model and other segmentation methods against ground truth on ISIC2018 dataset

4 Conclusion

In this paper, we propose MS-RWKV, which extends the foundational architecture of the RWKV model. MS-RWKV enhances RWKV through three innovations: multi-head scanning with adaptive padding for 2D coherence, P-Shift for spatial dependencies, and feature aggregation attention for multi-scale fusion. Comprehensive empirical evaluations across diverse medical image segmentation benchmarks demonstrate that our MS-RWKV architecture achieves superior performance compared to state-of-the-art approaches. We further plan to extend MS-RWKV to tasks beyond segmentation, such as cross-modal registration and high-fidelity reconstruction.

References

  1. Bai W J, Suzuki H, Huang J, et al. A population-based phenome-wide association study of cardiac and aortic structure and function[J]. Nature Medicine, 2020, 26(10): 1654-1662. [Google Scholar]
  2. Fatma K, Benaissa I, Zitouni A, et al. Assessing the performance of U-Net in 3D medical image segmentation[C]//2024 8th International Conference on Image and Signal Processing and Their Applications (ISPA). New York: IEEE, 2024: 1-6. [Google Scholar]
  3. Jungo A, Meier R, Ermis E, et al. On the effect of inter-observer variability for a reliable estimation of uncertainty of medical image segmentation[C]//Medical Image Computing and Computer Assisted Intervention – MICCAI 2018. Cham: Springer-Verlag, 2018: 682-690. [Google Scholar]
  4. Joskowicz L, Cohen D, Caplan N, et al. Inter-observer variability of manual contour delineation of structures in CT[J]. European Radiology, 2019, 29(3): 1391-1399. [Google Scholar]
  5. Tang H, Chen X M, Liu Y, et al. Clinically applicable deep learning framework for organs at risk delineation in CT images[J]. Nature Machine Intelligence, 2019, 1(10): 480-491. [Google Scholar]
  6. Chen J N, Mei J R, Li X H, et al. TransUNet: Rethinking the U-Net architecture design for medical image segmentation through the lens of transformers[J]. Medical Image Analysis, 2024, 97: 103280. [Google Scholar]
  7. Yang Z W, Li J Y, Zhang H, et al. Restore-RWKV: Efficient and effective medical image restoration with RWKV[J]. IEEE Journal of Biomedical and Health Informatics, 2025, 28(3): 1484-1493. [Google Scholar]
  8. Peng B, Alcaide E, Anthony Q, et al. RWKV: Reinventing RNNs for the transformer era[C]//Findings of the Association for Computational Linguistics: EMNLP 2023. Stroudsburg: ACL, 2023: 14048-14077. [Google Scholar]
  9. Tsai T Y, Lin L, Hu S, et al. UU-mamba: Uncertainty-aware U-mamba for cardiac image segmentation[C]//2024 IEEE 7th International Conference on Multimedia Information Processing and Retrieval (MIPR). New York: IEEE, 2024: 267-273. [Google Scholar]
  10. Duan X J, Shi M C, Wang J M, et al. Segmentation of the aortic dissection from CT images based on spatial continuity prior model[C]//2016 8th International Conference on Information Technology in Medicine and Education (ITME). New York: IEEE, 2016: 275-280. [Google Scholar]
  11. Tang Y H, Han K, Guo J Y, et al. GhostNetV2: Enhance cheap operation with long-range attention[C]//Advances in Neural Information Processing Systems 35 (NeurIPS 2022). 2022: 1-12. [Google Scholar]
  12. Li Z Y, Xia T Y, Chang Y, et al. A survey of RWKV[EB/OL]. [2024-01-12]. arXiv:2412.14847. [Google Scholar]
  13. Dosovitskiy A, Beyer L, Kolesnikov A, et al. An image is worth 16x16 words: Transformers for image recognition at scale[C]//International Conference on Learning Representations (ICLR), 2021: 1-22. [Google Scholar]
  14. Graves A. Long short-term memory[M]//Supervised Sequence Labelling with Recurrent Neural Networks. Berlin: Springer-Verlag, 2012: 37-45. [Google Scholar]
  15. Zhou L, Xiao Z L, Ning Z P. RWKV-based encoder-decoder model for code completion[C]//2023 3rd International Conference on Electronic Information Engineering and Computer (EIECT). New York: IEEE, 2023: 425-428. [Google Scholar]
  16. Huang T, Pei X H, You S, et al. LocalMamba: Visual state space model with windowed selective scan[C]//Computer Vision–ECCV 2024 Workshops. LNCS15633. Cham: Springer-Verlag, 2025: 13-32. [Google Scholar]
  17. Cai Z F, Fan Y L, Zhu M W, et al. Ultra-lightweight network for medical image segmentation inspired by bio-visual interaction[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2025, 35(4): 3486-3497. [Google Scholar]
  18. Peng B, Chen K, Xu Y, et al. RSMamba: Remote sensing image classification with state space model[J]. IEEE Geoscience and Remote Sensing Letters, 2024, 21: 1-5. [Google Scholar]
  19. Wu R K, Liu Y H, Liang P C, et al. UltraLight VM-UNet: Parallel vision mamba significantly reduces parameters for skin lesion segmentation[J]. Patterns, 2025, 6(7): 101298. [Google Scholar]
  20. Ji Z P. MHS-VM: Multi-head scanning in parallel subspaces for vision Mamba[EB/OL]. [2024-01-12]. arXiv:2406.05992. [Google Scholar]
  21. Chen S Q, Zhong X, Dorn S, et al. Improving generalization capability of multiorgan segmentation models using dual-energy CT[J]. IEEE Transactions on Radiation and Plasma Medical Sciences, 2022, 6(1): 79-86. [Google Scholar]
  22. Duan Y C, Wang W Y, Chen Z, et al. Vision-RWKV: Efficient and scalable visual perception with RWKV-like architectures[C]//Proceedings of the International Conference on Learning Representations (ICLR), 2025: 1-23. [Google Scholar]
  23. Kaleybar J M, Saadat H, Khaloo H. Capturing local and global features in medical images by using ensemble CNN-Transformer[C]//Proceedings of the 13th International Conference on Computer and Knowledge Engineering (ICCKE). New York: IEEE, 2023: 1-6. [Google Scholar]
  24. Liu Y S, Zhao Y J, Wang M H, et al. MBD-net: Multi-branch dilated convolutional network with cyst discriminator for renal multi-structure segmentation[C]//45th Annual International Conference of the IEEE Engineering in Medicine & Biology Society. New York: IEEE, 2023: 1-4. [Google Scholar]
  25. Codella N C F, Gutman D, Celebi M E, et al. Skin lesion analysis toward melanoma detection: A challenge at the 2017 International symposium on biomedical imaging (ISBI)[C]//IEEE 15th International Symposium on Biomedical Imaging (ISBI 2018). New York: IEEE, 2018: 168-172. [Google Scholar]
  26. Codella N, Rotemberg V, Tschandl P, et al. Skin lesion analysis toward melanoma detection 2018: A challenge hosted by the international skin imaging collaboration (ISIC)[C]//Proceedings of the Medical Image Computing and Computer Assisted Intervention . 2019: 168-172. [Google Scholar]
  27. Bernard O, Lalande A, Zotti C, et al. Deep learning techniques for automatic MRI cardiac multi-structures segmentation and diagnosis: Is the problem solved?[J]. IEEE Transactions on Medical Imaging, 2018, 37(11): 2514-2525. [Google Scholar]
  28. Wu R K, Liu Y H, Liang P C, et al. H-VMUnet: High-order vision mamba unet for medical image segmentation[J]. Neurocomputing, 2025:129447. [Google Scholar]
  29. Jiang J T, Zhang J N, Liu W X, et al. RWKV-UNet: Improving UNet with Long-Range Cooperation for Effective Medical Image Segmentation[EB/OL]. [2024-01-12]. arXiv:2501.08458. [Google Scholar]
  30. Ronneberger O, Fischer P, Brox T. U-net: Convolutional networks for biomedical image segmentation[C]//Medical Image Computing and Computer-Assisted Intervention – MICCAI 2015. LNCS9351. Cham: Springer-Verlag, 2015: 234-241. [Google Scholar]
  31. Ruan J C, Xiang S H, Xie M Q, et al. MALUNet: A multi-attention and light-weight UNet for skin lesion segmentation[C]//2022 IEEE International Conference on Bioinformatics and Biomedicine. New York: IEEE, 2022: 1150-1156. [Google Scholar]
  32. Zhang Y D, Liu H Y, Hu Q. TransFuse: Fusing transformers and CNNs for medicalimage segmentation[C]//Proceedings of the 24th International Conference on Medical Image Computing and Computer-Assisted Intervention. Cham: Springer-Verlag, 2021: 1150-1156. [Google Scholar]
  33. Ruan J C, Li J C, Xiang S C. VM-UNet: Vision Mamba UNet for Medical Image Segmentation[EB/OL]. [2024-01-12]. arXiv:2402.02491. [Google Scholar]
  34. Oktay O, Schlemper J, Le Folgoc L, et al. Attention U-Net: Learning where to look for the pancreas[J]. Proceedings of the Medical Imaging with Deep Learning (MIDL), 2019, 53: 197-207. [Google Scholar]
  35. Cao H, Wang Y Y, Chen J, et al. Swin-Unet: Unet-like pure transformer forMedical image segmentation[C]//Computer Vision – ECCV 2022 Workshops. Cham: Springer-Verlag, 2023: 205-218. [Google Scholar]
  36. Huang X H, Deng Z F, Li D D, et al. MISSFormer: An effective transformer for 2D medical image segmentation[J]. IEEE Transactions on Medical Imaging, 2023, 42(5): 1484-1494. [Google Scholar]
  37. Hatamizadeh A, Tang Y C, Nath V, et al. UNETR: Transformers for 3D medical image segmentation[C]//2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV). New York: IEEE, 2022: 1748-1758. [Google Scholar]

All Tables

Table 1

Comparative experimental results on the ISIC17 and ISIC18 dataset (unit:%)

Table 2

Performance comparison with state-of-the-art methods on the ACDC dataset for right ventricle (RV), Myocardium (Myo), and left ventricle (LV) segmentation (unit:%)

Table 3

Ablation study on number of scan heads and padding

Table 4

Ablation study on P-Shift (unit:%)

Table 5

Ablation study on FAA (unit:%)

Table 6

Ablation study on P-shift and FAA (unit:%)

All Figures

Thumbnail: Fig. 1 Refer to the following caption and surrounding text. Fig. 1 (a) The overall architecture of MS-RWKV. (b) The core of MS-RWKV block
In the text
Thumbnail: Fig. 2 Refer to the following caption and surrounding text. Fig. 2 (a) The diagrams of blocks in feature aggregation module. (b) Multi-scale convolutional module
In the text
Thumbnail: Fig. 3 Refer to the following caption and surrounding text. Fig. 3 Illustrations of the scan head block

Note:   After padding the image, one of the four scanning methods is selected to flatten the image into 1D sequences.

In the text
Thumbnail: Fig. 4 Refer to the following caption and surrounding text. Fig. 4 The visual comparison of segmentation results of ours model and other segmentation methods against ground truth on ISIC2018 dataset
In the text

Current usage metrics show cumulative count of Article Views (full-text article views including HTML views, PDF and ePub downloads, according to the available data) and Abstracts Views on Vision4Press platform.

Data correspond to usage on the plateform after 2015. The current usage metrics is available 48-96 hours after online publication and is updated daily on week days.

Initial download of the metrics may take a while.