Open Access
Issue
Wuhan Univ. J. Nat. Sci.
Volume 28, Number 6, December 2023
Page(s) 474 - 482
DOI https://doi.org/10.1051/wujns/2023286474
Published online 15 January 2024

© Wuhan University 2023

Licence Creative CommonsThis is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

0 Introduction

Code comments play a crucial role in software development and maintenance, aiding developers to comprehend and reuse code without meticulous examination [1-6]. For example, in the process of transforming program specifications into code, code summarization can significantly enhance the readability and comprehensibility of the code[2-5]. However, manually composing code comments for extensive programming projects can be labor-intensive and time-consuming. Furthermore, as code evolves to meet the developer's requirements, maintaining consistency between the code and comments demands substantial effort. Consequently, the automated generation of high-quality code comments, called "code summarization", holds significant importance in software engineering. Code summarization pertains to producing a natural language description for a provided code snippet.

Recently, many studies have concentrated on utilizing neural-based methods to generate code summarization. For example, Iyer et al [7] trained a Recurrent Neural Network (RNN) with an attention mechanism to generate code summarization. Liang et al [8], Hu et al [9] adopted the traditional RNN-based sequence-to-sequence network [10] with attention mechanism [11] on the different abstraction of code. However, the RNN-based approach fails to capture long-term dependencies between code tokens. To address this problem, CODE-NN [7] uses the Long Short-Term Memory (LSTM) [12] network combined with global attention [11], and HYBRID-DRL [13] applies Reinforcement Learning (RL) [14] to incorporate the AST structure and sequential content of snippet by using an actor-critic network. Furthermore, since the information retrieval methods perform well in code summarization,Refs.[15,16] combined it with the neural-based method to augment the generated code summarization. Considering the success of the Transformer[17] in natural language generation tasks, Refs.[18,19] enhanced its code summarization performance by employing copy attention[20] and relative position encoding[21]. However, it merely treats code as the text and captures the knowledge from the sequence of code tokens. At this point, SIT [22] introduced the structure-induced attention mechanism to capture information from syntax structure, data flow, and data dependency.

With the development of deep learning (DL), a pre-trained high-capacity code language model (e.g., CodeBERT [23] with 125 million parameters, CodeT5[24] with 220 million parameters) emerged. Compared with previous approaches employing traditional models for code summarization, pre-trained language models (PLMs) offer a wealth of prior knowledge acquired during the extensive pre-training phase and embedded within their parameters. In this study, we employ CodeT5 for code summarization, as it boasts an encoder-decoder architecture and excels in generation tasks.

To apply a pre-trained language model for solving downstream tasks, previous research [1,6,7,9-13] adheres to the "pre-train, fine-tune" paradigm for model tuning. However, this method of fine-tuning exhibits heterogeneity, often resulting in suboptimal results in downstream tasks. Our paper embraces the "pre-train, prompt, and predict" paradigm defined in Ref.[25], introducing a prompt to fine-tune CodeT5 homogeneously. Prompt tuning proves to be an effective approach for facilitating knowledge transfer from pre-trained language models into the downstream task. Moreover, numerous studies [26-37] have demonstrated the applicability of pre-trained language models to low-resource downstream tasks using prompt learning methods. Specifically, prompt learning necessitates the transformation of the downstream task into a format consistent with the pre-training task through prompts before fine-tuning the model. This approach is in contrast to simply fine-tuning the PLM with the raw code form, as illustrated in Fig. 1. The prompt incorporates contextual information, enabling CodeT5 to discern the nature of the downstream task. In this paper, we leverage the pre-trained large code language model, CodeT5, and employ the prompt-learning method to enhance its code summarization capability.

thumbnail Fig. 1

Fine-tuning and prompt-learning

Note:  X is the code slot and Z is the comment slot

1 Approach

This section outlines the specifics of our code summarization approach, as shown in Fig. 2. In the offline learning phase, we select CodeT5, which is pre-trained on a substantial code and comment pairs dataset. Subsequently, we devise a prompt-template to transform the input code and comment pair into a code-comment prompt for fine-tuning CodeT5, resulting in a prompt-tuned CodeT5. During the online phase, we employ the prompt-tuned CodeT5 to generate comments based on the provided code prompt.

thumbnail Fig. 2

Approach overview

1.1 CodeT5 Pre-Training

In this section, we introduce the framework and training mechanism of CodeT5.

1.1.1 CodeT5 framework

CodeT5 is an encoder-decoder framework sharing the same architecture as T5[37]. It acquires general representations for a programming language (PL) and natural language (NL) by employing identifier tagging and prediction tasks to capture token-type information from PL. Additionally, CodeT5 utilizes a bimodal dual learning objective to align NL and PL. The total count of pre-trained CodeT5 parameters stands at 220 million.

1.1.2 Training mechanism

The training code data's composition depends on whether it contains an NL description alongside the code snippets, resulting in two possible forms: PL-only unimodal or NL-PL bimodal data. The NL-PL bimodal data is concatenated into a sequence and subsequently tokenized. The tokenized words are joined by a delimiter token during training [SEP] and represented as X = ([CLS], w1, …, wn, [SEP], c1, …, cm, [SEP]), where n and m are the number of NL and PL tokens, respectively. When m=0, it takes on the PL-only unimodal data input form. Additionally, CodeT5 learns the identifier types to capture more code-specific features. Figure 3 shows the pre-training tasks of CodeT5. In the following section, we delve into the specifics of CodeT5's training mechanism.

thumbnail Fig. 3

Pre-training tasks of CodeT5

Identifier-Aware Denoising Pre-Training. Given X, the random length of tokens is masked. The decoder is asked to predict these masked tokens. The mask rate is 15%, and the number of masked tokens is 1 to 5. This task is referred to Masked Span Prediction (MSP), as shown in Fig.3 (a). The masked span prediction loss is as follows:

(1)

where is the model parameter, is the masked input, is the number of tokens predicted by the decoder, is the total number of tokens that need to be predicted, and is the span sequence generated so far.

Furthermore, CodeT5 uses two additional tasks: Identifier Tagging (IT) and Masked Identifier Prediction (MIP), to learn more code-specific syntactic and semantic information.

Identifier Tagging (IT) As shown in Fig.3 (b), the PL segment of X is mapped into a sequence of probabilities in the encoder, and computes a binary cross entropy loss for sequence labeling, and is the encoder parameter.

(2)

Masked Identifier Prediction (MIP) As shown in Fig.3 (c), unlike the random span masking in MSP, MIP only focuses on the identifier tokens in the PL segment to mask and then predict it in an auto-regressive manner, is the target sequence and is the masked input.

(3)

The above task is the identifier-aware denoising task to pre-training CodeT5, and three loss functions are optimized with equal probability.

Bimodal Dual Generation. The PL-NL bimodal data is employed for pre-training CodeT5, facilitating bidirectional conversion as depicted in Fig. 3 (d). This bidirectional conversion encompasses NL to PL generation and PL to NL generation, commonly called dual tasks. It is worth noting that the dual task can be regarded as a specialized span masking approach, where the NL or PL segment is masked from the input. CodeT5 leverages the dual task to enhance the alignment between the NL and PL components.

1.2 Prompt-Based Learning

To fully explore CodeT5's adaptability to the downstream task, code summarization in this case, we employ the prompt learning method to fine-tune CodeT5. Specifically, for a given tuning dataset , each piece of data comprises code and its corresponding comment. Formally, it can be expressed as , where represents the i-th code snippet out of the total N samples, and represents the code comment for . Rather than directly utilizing to fine-tune CodeT5, we create a prompt template to transform the input into a prompt, which serves as a natural language instruction. The aim is to extract task-specific knowledge acquired during pre-training for the downstream task. The defined prompt template is as follows:

Within the prompt template, represents the input code snippet slot, signifies the comment slot generated by CodeT5, and denotes the type of programming language slot (e.g., Java, Python). The natural language present in the prompt serves to instruct CodeT5 regarding the specific task it needs to complete, and the tag (e.g., <code></code>) designates the specific content that needs to be generated.

For prompt learning, with the defined template , we convert the input data as the . The learning objective is to minimize the prediction loss:

(4)

where is the CodeT5 parameter, and is the tuned CodeT5 parameter.

1.3 Comment Generation

When a code snippet is provided, we employ the prompt-tuned CodeT5 to generate the associated comment, which pertains to code summarization in this context. To accomplish this, we convert into using the transformation , and then feed it into the prompt-tuned CodeT5 to generate the comment, thereby filling the slot Z within the .

2 Evaluation

This section will evaluate our code summary approach from two perspectives: effectiveness and usefulness. To assess the efficacy (RQ1), we scrutinize the impact of prompt-tuning data size on our approach and compare its performance to fine-tuning. To evaluate the usefulness (RQ2), we examine how well our approach generates code summarization for code snippets sourced from Stack Overflow, compared with the transformer-based approach [14].

All experiments are executed using Python 3.7 on an NVIDIA GeForce RTX 3090. The operating system is Ubuntu 20.04.3 LTS. The hyperparameters employed for prompt-tuning CodeT5 are detailed in Table 1. It is worth noting that the source length represents the maximum length of the code tokens, while the target length signifies the maximum length of the comment tokens.

Table 1

Hyperparameter settings

2.1 Evaluation Metrics

We use the BELU-4 [38] score and METEOR [39] metrics to measure the performance of the code summarization approach.

2.1.1 BELU-4

BLEU (Bilingual Evaluation Understudy) is a metric for assessing the similarity between the reference and generated text through N-gram analysis. This metric takes into account the coherence and fluency of the generated text. However, the BLEU score may register as exceptionally high when the generated text comprises only a portion of the reference text and exhibits a high level of overlap. To mitigate this issue, we introduce a short penalty (BP) to counteract the potential bias stemming from the length of the generated text, thereby enhancing the rigor of the measurement. The calculation formula for BP is as follows:

(5)

where is the reference text length, is the translated text length. When the length of the translated text is greater than the length of the reference text, the penalty coefficient is 1, meaning no punishment.

As the accuracy of each N-gram statistic diminishes exponentially with the order's increase, a balanced approach is needed to account for each order statistic's effect. To achieve this balance, the geometric average form is employed for averaging, followed by weighting and multiplication by the length penalty factor. The ultimate BLEU calculation formula is as follows:

(6)

where is the geometric average precision of n-gram, and is the number of consecutive occurrences. Here, we use BELU-4 (i.e., N is 4) to evaluate the quality of the generated code comment.

2.1.2 METEOR

The METEOR score is a tool for computing the similarity between reference text and generated text. METEOR incorporates various factors, including word choice, stem matching, and phrasal alignment. METEOR stands out compared with BLEU-4 due to its heightened sensitivity to language nuances and ability to effectively capture the overall meaning and quality of the translated text. The METEOR calculation formula is as follows:

(7)

where , and are the coefficients. is calculated with the , where is the number of chunks, and is the number of mapped unigrams found between the two strings. and are the precision and recall, respectively. , where is the generated text length, and is the reference text length.

In our experiment, we calculate BLEU-4 and METEOR to measure the quality of the generated code comments, which correspond to code summarization. These metrics fall within a range of [0, 1], where a higher value indicates a more favorable outcome.

2.2 Effectiveness Evaluation (RQ1)

In this section, we introduce the details of effectiveness evaluation.

2.2.1 Motivation

We adopt a novel paradigm, "pre-train, prompt, and predict", to fine-tune CodeT5 for code summarization. In the process of model tuning, the size of the tuning dataset can significantly impact the performance of the fine-tuned model. In this context, we initially employ varying data sizes to assess the effectiveness of our approach, specifically tuning CodeT5 for the code summarization task using different data sizes. Additionally, we evaluate the effectiveness of our approach by making comparisons with the "pre-training fine-tuning" paradigm.

2.2.2 Dataset

To ensure the reliability of our results, we utilize a substantial code corpus for fine-tuning and testing CodeT5, specifically CodeSearchNet[29]. This resource comprises thousands of code-comment pairs for six programming languages: Python, Java, JavaScript, Ruby, Go, and PHP. It is important to note that the comments are considered as the ground truth. Initially, we transform all code-comment pair data into the Code-Prompt format, based on the template defined in Section 1.2. The specifics of the converted data are presented in Table 2.

We divide the tuning and validation sets into five equal portions, each accounting for 20% of the total. We use x-pl (x=1, 2, 3, or 4) to represent the number of partitions used for fine-tuning and validating CodeT5. For example, 2-pl indicates that 40% of each language data in the tuning set is used to fine-tune CodeT5, while the remaining 60% is designated for validation. Furthermore, it is important to note that the tuning, validation, and test sets contain distinct data.

Table 2

Data detail

2.2.3 Experiment setting

In this experimental study, we aim to assess how our approach's performance is influenced by two key factors: the size of the tuning dataset and the different tuning ways.

For the different tuning data sizes, we use four sizes of tuning datasets (1-pl, 2-pl, 3-pl, 4-pl) to tune CodeT5. Note that x-pl contains the tuning set from the six programming languages, and each account for 20% of the dataset. We refer to the prompt-tuned CodeT5 as the CodeT5x for the x-pl portion. CodeT5x is tested separately on test sets for the six programming languages. Furthermore, we use a zero-shot setting to contrast the effectiveness of prompt learning. In the zero-shot setting, the vanilla CodeT5 is used directly on the test set to generate code comments for code snippets.

For the different tuning ways, we explore the prompt learning and fine-tuning impact on our approach. Specifically, we use the 5-pl portion (i.e., all the tuning set) dataset to tune CodeT5 with the Prompt-learning to obtain a prompt-tuned CodeT55. In addition, we use all the code-comment tuning set (i.e., the code does not convert into the code prompt format) to tune CodeT5 with the fine-tuning to obtain a fine-tuned CodeT5FT. CodeT55 and CodeT5FT both test on the same test set.

2.2.4 Experiment result

1) Various data size

Table 3 presents the experimental results. The vanilla CodeT5 (without prompt learning using the tuning set) only achieves an average BLEU-4 score of 0.000 6 and a METEOR score of 0.013 0 in the zero-shot setting. This indicates that the task-agnostic CodeT5 does not recognize the specific tasks it needs to accomplish, resulting in generated code comments that exhibit little correlation with the ground truth. Moreover, the vanilla CodeT5 often generates repetitive text, such as function names. For example, the generated comment "observable.(); observable.(); } observable.(); observable.(); }" repeats the "observable" function in the code snippet, whereas the ground truth comment is "Wraps an ObservableSource into an Observable if not already an Observable." There is no semantic connection between these generated and ground truth comments.

Nevertheless, when we apply 1-pl for prompt learning, denoted as CodeT51, the BLEU-4 score experiences a significant rise from 0.000 6 to 0.506 3 (a substantial increase of 0.505 7), and the METEOR score also significantly increases from 0.013 0 to 0.729 5 (an increase of 0.716 5). This demonstrates that prompt learning effectively enhances the performance of CodeT5 in code summarization, resulting in more fluent generated comments. Prompt learning accomplishes this by enabling CodeT5 to capture the semantic associations between the code and comments.

As the prompt learning data size gradually increases, the average BLEU-4 and METEOR scores exhibit a more incremental growth than the significant improvement observed from the zero-shot setting to CodeT51. Furthermore, the model's performance stabilizes at CodeT53, with an average BLEU-4 score of 0.531 9 and an average METEOR score of 0.751 8, showing only marginal improvements at CodeT54, with an average BLEU-4 score of 0.543 3 and an average METEOR score of 0.760 3. This suggests that prompt learning, instead of tuning data size, plays a more fundamental role in enhancing CodeT5. Additionally, CodeT5 performs consistently well across all six languages, indicating that prompt-tuned CodeT5 effectively identifies code written in different programming languages. This ability benefits from its pre-trained task.

Table 3

The effects of different data sizes on Prompt Learning

2) Prompt learning versus Fine-tuning

Table 4 displays the experimental results. We can see that CodeT55 achieves an average BLEU-4 score of 0.546 7 and a METEOR score of 0.785 6, which is 0.162 5 and 0.186 0 higher than CodeT5FT's BLEU-4 and METEOR scores, respectively. It is worth noting that CodeT51 also attains significantly higher average BLEU-4 and METEOR scores compared with CodeT5FT (i.e., 0.506 3 vs. 0.384 2 and 0.729 5 vs. 0.599 6). This underscores the effectiveness of tuning CodeT5 with prompt learning for code summarization. The presence of natural language prompts in the defined template aids CodeT5 in recognizing the content generated in the specific slot during prompt learning or testing. For instance, the LAN slot signals CodeT5 to generate code comments for a particular programming language code. This enables prompt-tuned CodeT5 to be utilized for code summarization across various programming languages and yield significant performance improvements. In contrast, when compared with prompt learning, CodeT5 cannot capture any prompts from the tuning dataset during fine-tuning or testing.

Table 4

The performance between CodeT5 and CodeT5FT

2.3 Practicality Evaluation (RQ2)

2.3.1 Motivation

In this practicality evaluation, our objective is to assess the performance of our prompt-tuned CodeT5 for code summarization on real partial code, compared with an existing transformer-based approach[18].

2.3.2 Dataset

In this study, we have compiled a test dataset from Stack Overflow, a widely used Q&A website. Due to the manual effort required and the transformer-based approach [14] being limited to Java and Python datasets, we have collected data for two programming languages, namely Java and Python, from Stack Overflow. We have gathered 120 code-comment pairs, with 60 pairs for each programming language. To maintain the quality of the collected code-comment pairs, we enlisted the assistance of three graduate students, each with experience in both Java and Python development. Each student was tasked with collecting 15 code-comment pairs for Java and 15 for Python. They were required to meticulously verify the paired comments and code, considering the code's logic and the comment's semantics. Additionally, they cross-examined the collected code-comment pairs dataset to ensure the reliability of the 120 pairs further. It is important to note that the collected comments serve as the ground truth.

2.3.3 Experiment setting

In this phase, we execute the prompt-tuned CodeT54 on the 120 collected code snippets. As a baseline comparison, we also employ the transformer-based model trained by Ahmad et al [18] for testing on the same set of 120 code snippets.

2.3.4 Experiment result

The experiment results are detailed in Table 5, where it becomes evident that our prompt-tuned CodeT54 outperforms the transformer-based approach significantly. Specifically our CodeT54 boasts an average BLEU-4 score of 0.647 2 and an average METEOR score of 0.857 3, surpassing transformer-based BLEU-4 and METEOR scores by 0.291 5 and 0.431 2, respectively. This underscores the superior practicality of our prompt-tuned CodeT54 in code summarization.

This advantage can be attributed to two key factors. Firstly, Ahmad et al [18] treated code as a text for training their transformer-based model, resulting in a model that struggles to comprehend code semantics and generate high-quality comments. In contrast, CodeT5 leverages the vast amount of code semantics knowledge acquired during its pre-training task. However, the most crucial factor is the capacity of CodeT5 as a high-capacity model, featuring 220 million parameters. In contrast, Ahmad et al [18] trained their model with limited data, comprising only 120 000 code-comment pairs.

Table 5

Performance on Stack Overflow dataset

3 Conclusion

This paper adheres to a novel paradigm, "Pre-train, prompt, and predict", for fine-tuning CodeT5 to generate high-quality code summarizations. To achieve this, we have introduced a prompt template that facilitates the conversion of input code into a code prompt, subsequently fine-tuning CodeT5 using this code prompt, a process we refer to as prompt tuning. The experiments demonstrate the effectiveness and practicality of our code summarization approach. Prompt learning yields substantial benefits even with just a 20% tuning set. Furthermore, the performance of the prompt-tuned CodeT5 surpasses that of the transformer-based model in the context of code summarization.

References

  1. Le T H M, Chen H, Babar M A. Deep learning for source code modeling and generation: Models, applications, and challenges [J]. ACM Computing Surveys (CSUR), 2020, 53(3): 1-38. [Google Scholar]
  2. Wang C J, Cao Z X, Yu C L, et al. Nonlinear program construction and verification method based on partition recursion and Morgan's refinement rules[J]. Wuhan University Journal of Natural Sciences, 2023, 28(3): 246-255. [CrossRef] [EDP Sciences] [Google Scholar]
  3. Zuo Z K, Hu Y, Huang Qing, et al. Automatic algorithm programming model based on the improved Morgan's refinement calculus[J]. Wuhan University Journal of Natural Sciences, 2022, 27(5): 405-414. [CrossRef] [EDP Sciences] [Google Scholar]
  4. Zuo Z K, Huang Z P, Fang Y, et al. A unified strategy for formal derivation and proof of binary tree nonrecursive algorithms[J]. Wuhan University Journal of Natural Sciences, 2022, 27(5): 415-423. [CrossRef] [EDP Sciences] [Google Scholar]
  5. You Z, Hu H W, Wang Y T, et al. Improved hybrid collaborative filtering algorithm based on spark platform[J]. Wuhan University Journal of Natural Sciences, 2023, 28(5): 451-460. [CrossRef] [EDP Sciences] [Google Scholar]
  6. Xia X, Bao L F, Lo D, et al. Measuring program comprehension: A large-scale field study with professionals[J]. IEEE Transactions on Software Engineering, 2018, 44(10): 951-976. [CrossRef] [MathSciNet] [Google Scholar]
  7. Iyer S, Konstas I, Cheung A, et al. Summarizing source code using a neural attention model[C]//Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Stroudsburg: Association for Computational Linguistics, 2016: 2073-2083. [Google Scholar]
  8. Liang Y D, Zhu K. Automatic generation of text descriptive comments for code blocks[EB/OL]. [2018-08-21]. https://arxiv.org/abs/1808.06880.pdf. [Google Scholar]
  9. Hu X, Li G, Xia X, et al. Deep code comment generation[C]//Proceedings of the 26th Conference on Program Comprehension. New York: ACM, 2018: 200-210. [Google Scholar]
  10. Sutskever I, Vinyals O, Le Q V. Sequence to sequence learning with neural networks[C]//Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 2. New York: ACM, 2014: 3104-3112. [Google Scholar]
  11. Luong M T, Pham H, Manning C D. Effective approaches to attention-based neural machine translation [EB/OL]. [2015-06-25]. https://arxiv.org/abs/1508.04025.pdf. [Google Scholar]
  12. Hochreiter S, Schmidhuber J. Long short-term memory [J]. Neural Computation, 1997, 9(8): 1735-1780. [CrossRef] [Google Scholar]
  13. Wan Y, Zhao Z, Yang M, et al. Improving automatic source code summarization via deep reinforcement learning[C]//Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering. New York: ACM, 2018: 397-407. [Google Scholar]
  14. Kaelbling L P, Littman M L, Moore A W. Reinforcement learning: A survey[J]. Journal of Artificial Intelligence Research, 1996, 4: 237-285. [Google Scholar]
  15. Zhang J, Wang X, Zhang H Y, et al. Retrieval-based neural source code summarization[C]//Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering. New York: ACM, 2020: 1385-1397. [Google Scholar]
  16. Liu S Q, Chen Y, Xie X F, et al. Retrieval-augmented generation for code summarization via hybrid GNN [EB/OL]. [2020-11-15]. https://arxiv.org/abs/2006.05405.pdf. [Google Scholar]
  17. Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need[C]//Proceedings of the 31st International Conference on Neural Information Processing Systems. New York: ACM, 2017: 6000-6010. [Google Scholar]
  18. Ahmad W U, Chakraborty S, Ray B, et al. A transformer-based approach for source code summarization [EB/OL]. [2020-10-23]. https://arxiv.org/abs/2005.00653.pdf. [Google Scholar]
  19. Yang Z, Keung J, Yu X, et al. A multi-modal transformer-based code summarization approach for smart contracts[C]//2021 IEEE/ACM 29th International Conference on Program Comprehension (ICPC). New York: IEEE, 2021: 1-12. [Google Scholar]
  20. See A, Liu P J, Manning C D. Get to the point: Summarization with pointer-generator networks [EB/OL]. [2017-11-15]. https://arxiv.org/abs/1704.04368.pdf. [Google Scholar]
  21. Shaw P, Uszkoreit J, Vaswani A. Self-attention with relative position representations [EB/OL]. [2018-02-24]. https://arxiv.org/abs/1803.02155.pdf. [Google Scholar]
  22. Wu H Q, Zhao H, Zhang M. Code summarization with structure-induced transformer [EB/OL]. [2020-11-28]. https://arxiv.org/abs/2002.08155.pdf. [Google Scholar]
  23. Feng Z Y, Guo D Y, Tang D Y, et al. CodeBERT: A pre-trained model for programming and natural languages[EB/OL]. [2021-12-05]. https://arxiv.org/abs/2109.00859.pdf. [Google Scholar]
  24. Wang Y, Wang W S, Joty S, et al. CodeT5: Identifier-aware unified pre-trained encoder-decoder models for code understanding and generation[EB/OL]. [2021-12-18]. https://arxiv.org/abs/2109.00859.pdf. [Google Scholar]
  25. Liu P F, Yuan W Z, Fu J L, et al. Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing[J]. ACM Computing Surveys, 2023, 55(9): 1-35. [Google Scholar]
  26. Radford A, Narasimhan K. Improving language understanding by generative pre-training [EB/OL]. [2021-12-18]. http://www.wikicfp.com/cfp/servlet/event.showcfp?eventid=39345. [Google Scholar]
  27. Devlin J, Chang M W, Lee K, et al. BERT: Pre-training of deep bidirectional transformers for language understanding[EB/OL]. [2018-11-06]. https://arxiv.org/abs/1810.04805.pdf. [Google Scholar]
  28. Liu Y H, Ott M, Goyal N, et al. RoBERTa: A robustly optimized BERT pretraining approach[EB/OL]. [2019-12-06]. https://arxiv.org/abs/1907.11692.pdf. [Google Scholar]
  29. Husain H, Wu H H, Gazit T, et al. CodeSearchNet challenge: Evaluating the state of semantic code search[EB/OL]. [2019-12-06]. https://arxiv.org/abs/1909.09436.pdf. [Google Scholar]
  30. Wan Y, Zhao W, Zhang H Y, et al. What do they capture? : A structural analysis of pre-trained language models for source code[C]//Proceedings of the 44th International Conference on Software Engineering. New York: ACM, 2022: 2377-2388. [Google Scholar]
  31. Yuan X E, Lin G J, Tai Y H, et al. Deep neural embedding for software vulnerability discovery: Comparison and optimization[J]. Security and Communication Networks, 2022, 2022: 1-12. [Google Scholar]
  32. Karmakar A, Robbes R. What do pre-trained code models know about code? [C]//2021 36th IEEE/ACM International Conference on Automated Software Engineering (ASE). New York: IEEE, 2022: 1332-1336. [Google Scholar]
  33. Wang C Z, Yang Y H, Gao C Y, et al. No more fine-tuning? An experimental evaluation of prompt tuning in code intelligence[C]//Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering. New York: ACM, 2022: 382-394. [Google Scholar]
  34. Han X, Zhang Z Y, Ding N, et al. Pre-trained models: Past, present and future [J]. AI Open, 2021, 2: 225-250. [CrossRef] [Google Scholar]
  35. Bisht M, Gupta R. Fine-tuned pre-trained model for script recognition [J]. International Journal of Mathematical, Engineering and Management Sciences, 2021, 6(5): 1297-1314. [CrossRef] [Google Scholar]
  36. Liu B Y, Cai Y F, Guo Y, et al. TransTailor: Pruning the pre-trained model for improved transfer learning[J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2021, 35(10): 8627-8634. [CrossRef] [Google Scholar]
  37. Colin R, Noam S, Adam R, et al. Exploring the limits of transfer learning with a unified text-to-text transformer[J]. Journal of Machine Learning Research, 2020, 21(1): 5485-5551. [Google Scholar]
  38. Papineni K, Roukos S, Ward T, et al. BLEU: A method for automatic evaluation of machine translation[C]//Proceedings of the 40th Annual Meeting on Association for Computational Linguistics — ACL '02. Stroudsburg: Association for Computational Linguistics, 2001: 311-318. [Google Scholar]
  39. Banerjee S, Lavie A. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments[C]//Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization. Stroudsburg: Association for Computational Linguistics, 2005: 65-72. [Google Scholar]

All Tables

Table 1

Hyperparameter settings

Table 2

Data detail

Table 3

The effects of different data sizes on Prompt Learning

Table 4

The performance between CodeT5 and CodeT5FT

Table 5

Performance on Stack Overflow dataset

All Figures

thumbnail Fig. 1

Fine-tuning and prompt-learning

Note:  X is the code slot and Z is the comment slot

In the text
thumbnail Fig. 2

Approach overview

In the text
thumbnail Fig. 3

Pre-training tasks of CodeT5

In the text

Current usage metrics show cumulative count of Article Views (full-text article views including HTML views, PDF and ePub downloads, according to the available data) and Abstracts Views on Vision4Press platform.

Data correspond to usage on the plateform after 2015. The current usage metrics is available 48-96 hours after online publication and is updated daily on week days.

Initial download of the metrics may take a while.