EPT: Data Augmentation with Embedded Prompt Tuning for Low-Resource Named Entity Recognition

: Data augmentation methods are often used to address data scarcity in natural language processing (NLP). However, token-label misalignment, which refers to situations where tokens are matched with incorrect entity labels in the augmented sentences, hinders the data augmentation methods from achieving high scores in token-level tasks like named entity recognition (NER). In this paper, we propose em‐ bedded prompt tuning (EPT) as a novel data augmentation approach to low-resource NER. To address the problem of token-label misalign‐ ment, we implicitly embed NER labels as prompt into the hidden layer of pre-trained language model, and therefore entity tokens masked can be predicted by the finetuned EPT. Hence, EPT can generate high-quality and high-diverse data with various entities, which improves performance of NER. As datasets of cross-domain NER are available, we also explore NER domain adaption with EPT. The experimental results show that EPT achieves substantial improvement over the baseline methods on low-resource NER tasks.


Introduction
Named entity recognition (NER) is a fundamental natural language process (NLP) task that involves labeling named entities with predefined categories.Its accuracy has a significant impact on the performance of downstream tasks, such as information extraction [1] , question answering [2] , text summarization [3] and machine translation [4] etc.However, the high cost of obtaining labeled data has limited the amount of available data for most languages and domains.To address this issue, lowresource NER methods [5][6][7][8] have been proposed in recent years.Data augmentation is an effective method for expanding the training set by generating new data with pre-served labels, which is especially useful in low-resource scenarios.Common methods of data augmentation for NLP include word-level modification [9][10][11] and backtranslation [12,13] .
While word-level modification and back-translation have shown satisfactory performance in sentence-level tasks, they can fail in token-level tasks such as NER due to token-label misalignment.For example, word-level modification may replace an entity with alternatives that are coherent in the sentence but entirely mismatched with the original label, such as synonym replacement [10] , auto-regressive Pretrained Language Model (PLM) [11] , Long Short Term Memory-Language Model (LSTM-LM) [14] , and Masked Language Model (MLM) [15] .Simi-larly, back-translation may rely on external word alignment tools [16,17] , leading to fallibility.Some attempts have been made to address token-label misalignment, but they have not been entirely successful.One approach [7] randomly interchanges entities of the same category, which does not contribute to entity diversity and may damage the coherence of the context.Another attempt [18] only modifies non-entity context tokens with the Masked Sequence to Sequence (MASS) [19] method, keeping the entire entities unchanged.However, this method did not achieve notable improvement in pretrained LM-based NER tasks [20] .Building on prior research, Masked Entity Language Model (MELM) [21] attempts to generate augmented data through finetuning a language model on linearized labeled sequences [22,23] , which injects NER labels into the sentence context and finetunes the model to predict masked entities.This approach achieved remarkable performance in NER evaluations.
Inspired by previous work, we introduce Embedded Prompt Tuning (EPT) as a novel approach to lowresource NER, which embeds NER labels as prompts to generate data with more diverse entities that perfectly align with their labels.Similar to MELM, EPT is based on a pretrained masked language model, which is then finetuned to augment data by randomly masking entity tokens in a training set.Besides, we add a few embedding layers to MLM for prompt embedding, which is a slight departure from the standard MLM approach.We note that it is insufficient to simply mask and replace entity tokens using the finetuned MLM, as has been observed in previous work [21] .Figure 1 demonstrates that the finetuned MLM can make predictions with token-label misalignment, such as "Washington has".In contrast, MELM can alleviate the misalignment and give more accurate predictions.However, the labels injected into the sequence can limit the number of original tokens, especially when there are multiple consecutive entities in the sentence, which can negatively impact the coherence of the context and the diversity of the entities.To address this issue, EPT introduces additional embedding layers that separately embed the label and the position of the token as inputs to the hidden layer of MLM.Therefore, the masked tokens are predicted based on both the context and the prompt of the tokens labels and positions.
Our EPT approach has the potential to expand the scope of entity recognition across different domains with specialized entity categories.We applied EPT to the Cross-NER dataset [24] , which comprises fully-labeled NER data from five domains with specialized and shared entity types.Most existing NER methods focus on a specific domain, resulting in suboptimal performance in domain adaptation [25] .By contrast, EPTs ability to extend entity categories allows us to leverage knowledge from other domains to increase entity diversity, even with limited labeled data.We finetuned EPT on the source domain and then on the training sentences of the target domain for data augmentation.Our results show that finetuning over the source domain leads to higher F1-scores.
Overall, this paper makes several contributions: First, we propose a novel approach, EPT, to augment data for low-resource scenarios, which results in significant improvements in both monolingual and crossdomain NER.Second, we introduce implicit label embedding and novel position embedding, which effectively increases entity diversity while mitigating tokenlabel misalignment.Third, we demonstrate the effectiveness of using training sentences from source domains to improve data augmentation in the domain adaptation area.These contributions provide new insights into the field of NER and offer potential solutions to common challenges faced in low-resource scenarios and crossdomain NER.

Method
The process of our data augmentation method is illustrated in Fig. 2. We embed labels and positions of entity tokens into the hidden layers of our model (as described in Section 1.1).Then, the EPT model is finetuned (as outlined in Section 1.2) to generate augmented data from the NER training sentences that contain masked entities (as explained in Section 1.3).The generated data is used for training the NER model along with the original data after several loops of augmentation (as detailed in Section 1.4).Algorithm 1 shows the detailed process and steps of our method.

Label embedding
To address the issue of token-label misalignment, we introduce additional embedding layers to the MLM model, which embeds the labels and positions of entity tokens into hidden vectors.As shown in Fig. 2, these embeddings are added to the token embeddings with a coefficient α during the finetuning process.By doing so, label prompt is taken into consideration during the prediction of a masked token.Additionally, we initialize the weights of the embedding layers with special words, such as "location" for "B-LOC" and "I-LOC".Meanwhile the weights would also be initialized with normal words for a test, which means "B-label" and "I-label" are assigned random values.

Position embedding
The Relative Position Embedding (RPE) is designed to mark the positions of tokens within an entity word.While MELM does not require Position Embedding for the label tokens before and after each entity token, our EPT uses implicit label prompt, which can lead to consecutive entities with the same labels (e. g., multiple "I-ORG" labels in succession, as shown in Fig. 2), making it difficult for the model to predict masked tokens accurately.To address this issue, we introduce a trainable RPE that encodes position information for each sub-word of the entity.Since the number of sub-words in an entity can be various, our RPE is designed to be trainable and flexible as following, enable to handle entities of different lengths: where position x Î N is presented as pair ( s l -1 -s ) Fig. 2 Embedded prompt tuning model and X presents the whole position set; P x is the x-th relative position vector; s is the relative position to the start of the entity; l is the length of the sub-words, and L is the maximum length of all the sub-words of entities and D is the number of the attention heads of EPT.Finally, to reduce the parameters to be trained, we broadcast the position vector multiplied by β to H dimension, which is the depth of the hidden layers.The relative position information of each token will be utilized to predict masked entity tokens.

Fine-Tuning EPT
Due to the invalidity of masking context mentioned in the introduction, there are only entity tokens to be masked during finetuning EPT.During the previous tokenization and embedding, the entity tokens have been randomly masked, and each sentence has several masked versions which are picked at the beginning of each epoch for diversity.The masking ratio of the tokens is identical to MELMs.EPT is trained to maximize the probability of the original sentence X given the masked sentence X ͂ with prompt: ) where E l  E p are embedding of labels and positions.θ represents the parameter of EPT and p θ means the probability of original sentence given masked sentence under for EPT model with parameter θ. x i is the i-th entity token while m i is a Boolean value indicating whether x i is masked or not, and upper bound of i is the account of entity tokens n.The prompt of labels and positions makes the prediction consistent with the original token in many spheres.

Data Augmentation
Generating novel augmented sentences poses a diversity problem of generating samples due to the potential repetition of data when the most probable token is selected directly.EPT employs the same augmentation method as MELM, which involves masking tokens with a rate drawn from a Gaussian distribution and randomly selecting replacements from the top k predicted items.As context masking has marginal effects in NER tasks [20] , we focus only on masking entity tokens in sentences.This process then repeats R times for each sentence in the original dataset, resulting in R masking re-sults.Importantly, EPT can generate more unique entities with semantic rationality, due to the RPE which prevents misplaced token recombination.By contrast, MELM tends to suffer from semantic fragmentation due to explicit label tokens and produces fewer novel entities.

Training NER Model
To check out the quality of the generated data, we do not manually filter the generated data.The augmented training set D aug is combined with the training set D train as the final training set for the NER task.
Algorithm 1 shows the detailed process and steps of our method.
Algorithm 1 Embedded Prompt Tuning (EPT) Given D train , M Given gold training set D train , and pretrained MLM M M ͂ ¬ M + α E label + βE pos Add label embedding layer E label and position embedding layer E pos to M, α and β are coefficients bels Y ãnd positions P from tokenization X ͂ ¬ FINETUNEMASK(X ͂  η) Randomly mask entities for fine-tuning, η is the probability of masking EPT on masked tokens and prompt and positions from tokenization for generation, μ is the probability of masking Most of experiments are conducted on CoNLL NER dataset [26,27] of three languages where L= {English (En), Spanish (Es)}.Initially, N sentences are sampled as D lN train from each language l Î L, where N Î {200 400 600 800} for various low-resource sce- narios.Then we gain development set D lN dev with the same procedure.Finally, D l test is the full test set for each language in evaluation process.For monolingual experiments, we use D lN train as the original data, D lN dev as the development set and D l test as the test set.We also conduct monodomain, cross-domain and multi-domain experiments on Cross NER dataset [24] of five domains, where M= {AI, Literature, Music, Poli- tics, Science}.

Experimental Setting
1) EPT Finetuning EPT parameters are initialized by XLM-RoBERTa-base [28] with a masked language modeling head.And EPT is finetuned for 20 epochs with Adam optimizer [29] , using learning rate set to 1E − 5 and batch size set to 16 with gradient accumulation for per 2 batch.
2) NER Model We use XLM-RoBERTa-base [28] with a head of a dropout layer and a linear layer as the NER model for our experiments.The same optimizer and learning rate are adopted as finetuning.It is generally trained for 10 epochs (20 epochs for Spanish corpus) before the best model is picked according to the performance over development set.The averaged Macro-F1 and Marco-F1 without context are reported over 3 runs while evaluating on test sets.
3) Hyperparameter Tuning The coefficient α of label embedding, the coefficient β for number of EPT augmentation and the round R is respectively set as 1E − 2, 5E − 3 and 3. We also tune embedding hyperparameters on the development set with grid search.4) Computing Infrastructure Our experiments are conducted on NVIDIA 3090 GPU.

Baseline Methods
We compare our EPT with the following two methods to demonstrate its effectiveness: 1) Gold-Only [21] The NER model is trained only on the original gold training set.
2) MELM [21] We first linearize the sequences with labels and fine-tune MELM with them.Then MELM randomly masks entity tokens and predicts a masked entity token not only considering label information but also relying on the context words.MELM is enough to be the baseline method as its substantial improvement over other methods, such as Label-wise Substitution [7] , Data Augmentation with a Generation Approach (DAGA) [23] and Multilingual Data Augmentation (MulDA) [13] .

Monolingual NER
We use the average F1 score of recognition of different entities to measure the performance of the methods.Table 1 shows the results of monolingual lowresource NER.From it we can see our proposed EPT achieves the highest average macro-F1 scores across various levels of low-resource NER, highlighting its effectiveness on monolingual tasks.In comparison to the best-performing method MELM, EPT achieves average macro F1-score improvements of 1.6%, 5.9%, 1.2%, and 3.1% at 200, 400, 600, and 800 levels which means the number of the original sentences, respectively.It is worth noting that EPT achieves a monolingual macro-F1 of 56.8%, while the Gold-Only method fails to predict any entities given only 200 sentences.EPT, and its variations initialized with different weights, consistently achieve the highest F1 scores on the English corpus at different low-resource levels, even though MELM outperforms methods without data augmentation by a considerable margin.Furthermore, the varying performance of EPT initialized with different weights is attributed to the value of initialization, and we will discuss this in more details in Section 2.4.3.
As the neural network failed to converge after 10 epochs, an additional 10 epochs of training were conducted on the Spanish corpus.Remarkably, the proposed EPT outperformed the baseline methods and its variety in terms of the marco-F1 scores.Notably, EPT achieved even more significant gains in the context of the English corpus at each low-resource level.We hypothesize that this could be attributed to the fact that entities in Spanish are easier to truncate into sub-words, which can be reorganized randomly into novel entities and provide substantial improvements to NER models.To support this assumption, we present statistical results about recombined words in Section 3.1, which provides convincing evidence.
Specifically, at a low-resource level of 400 English training samples, EPT outperforms the baseline method MELM by 8.3% in terms of F1 score.We attribute this improvement to the special tokens that are injected into the sentences, which separates entities from their contexts and from other entities, thereby leading to less diversity of entities.Rather than inserting label tokens into sentences, EPT focuses on implicitly embedding label and position information as prompt, which ensures the diversity and quality of augmented entities across different low-resource levels and languages.

Cross-domain NER
In our experiments on cross-domain low-resource NER, we first applied EPT directly on the training set of each domain and the concatenation of them.We then utilized the augmented data generated by this step to finetune NER models using the training sets and development sets of target domains.Finally, we evaluated the NER models on the test sets of target domains.Since expanding the method MELM to multiple domains was costly, we evaluated the performance of the Gold-Only baseline.As illustrated in Fig. 3(a), EPT achieved substantial improvement over the Gold-Only baseline.Compared with the baseline, EPT achieved absolute gains of 19.6, 22.4, 28.9, 8.6, and 24.9 on the AI, literature, music, politics, and science domains, respectively, demonstrating its effectiveness on monodomain NER (Fig. 3(b)).
In addition to the results of monodomain NER, the results of cross-domain NER are depicted in Fig. 3, where it is evident that EPT leads to significant improvement on cross-domain NER tasks with different source and target domains.This highlights the importance of data augmentation techniques like EPT for cross-domain NER, even when the augmented data is obtained from a source domain that is quite distinct from the target domain.
Besides, the results of multidomain NER indicate that the NER models fine-tuned on mixed training sets achieve considerably higher F1 scores.For instance, the politics corpus achieves an F1 score of 68.0 with the domain-mixed training set, while it achieves only 49.7 with the politics training set.However, it is also observed that having more training data does not always lead to better evaluation scores.In fact, in most cases, the F1 scores decrease when English augmented data is mixed into the training set, which could be attributed to the noise in the roughly labeled corpus interfering with the prediction of specific fields.

Ablation study
The results presented in Table 2 demonstrate the efficacy of the position and label embedding used in EPT.Specifically, when fine-tuning EPT without labelembedding (EPT without LE) or without positionembedding (EPT without PE), the F1 scores drop considerably across different low-resource levels.This indicates that the position and label embedding layers play a crucial role in improving the performance of EPT.
Furthermore, the comparison of EPT without LE and EPT without PE with the Gold-Only baseline shows that the label and position information embedded as prompts in EPT indeed help generate diverse and adequate entities from the original data, leading to significantly better NER performance.
It is important to note that the initialization of the label embedding layer can have a significant impact on the performance of the EPT model.The experiments in Table 3 show that the choice of initialization weights can affect the model  s stability and performance.In this case, the EPT model initialized with the embedding weights of special tokens is the most stable of all the models tested in Table 1.

Hyperparameter tuning
To determine the optimal hyperparameters for label embedding coefficient α and position embedding coefficient β, we conduct a grid search in {5E-3, 1E-2, 2E-2} and {2.5E-3, 5E-3, 1E-2}.EPT is finetuned to generate English augmented data on CoNLL dataset.Then a NER tagger is trained on the data and evaluated on English development set for its performance.As shown in Table 4, the best development set F1 is achieved when α=1E-2

Fig.3 Results of cross-domain low-resource NER
The data on the horizontal axis represents the domain of the training set, and the data on the vertical axis represents the domain of the development set and the testing set, where "mix" represents the mixed data of the five domains, and "mix&en" represents the mixed data of 5 domains and the English dataset [26] mentioned above.Thus, the data on the diagonal shows the performance of methods on monodomain NER task, and the other data shows the ability of methods on cross-domain NER task and multi-domain NER task

Analysis and Discussion
The progress made by EPT in the NER mission is clearly related to the label and position embedding we proposed.We suggest that label embedding can produce more valuable samples, and our position embedding reduces samples containing location errors, both of which have been responsible for the success of EPT.Next the correctness of our ideas is shown through data.

Number of Recombined Entities
As illustrated in Fig. 4, our proposed EPT consistently outperforms the baseline method MELM in generating more recombined entities across various lowresource levels and languages.These recombined entities are formed by meaningfully combining parts of different entity words during the prediction of masked entity tokens.The red lines in the figure represent the ratio of recombined entities generated by augmentation methods to the original samples, providing further evidence of the superior diversity of augmented entities achieved by EPT.Interestingly, we observe that more recombined entities are generated on the Spanish corpus than the English corpus, regardless of the augmentation method used.

Role of Position Embedding
In addition to the demonstrated effectiveness of position embedding (PE) in Section 2.4.3, the underlying reasons for its success are still not clear.To investigate this, we computed the cosine similarity between the PE weights of different PE methods and present the results in Fig. 5. Our analysis revealed that the absolute PE (APE), shown as Fig. 5(a), encodes positional information with translation invariance, monotonicity, and symmetry [30] , whereas our relative PE (RPE), shown as Fig. 5(b), learns position embeddings with orthogonality.This implies that masked tokens will be predicted with entity tokens of the same position, leading to better entity recognition.As can be seen from Table 5, our learnable APE achieves a remarkable F1 score among these PE methods.

Conclusion
In this work, we have introduced EPT as a novel approach for data augmentation in low-resource NER.By leveraging label and position embeddings, EPT can predict masked entity tokens based on contextual information, thereby generating augmented data with diverse and novel entities while avoiding the issue of labeltoken misalignment.Furthermore, we have demonstrated the effectiveness of EPT on various NER datasets, including monolingual, monodomain, cross-domain,

Fig.4 Numbers of recombined entities
Bars present the accounts of the recombined entities generated by MELM and our EPT.Lines present the ratios of the recombined entities to the numbers of the original samples at each low-resource level and multidomain scenarios.However, one limitation of our approach is that we have only evaluated it in English and Spanish corpora.Further research could investigate its performance across a wider range of languages and extremely lowresource NER tasks.Additionally, while our experiments demonstrate significant improvements in performance, there may be further optimization of hyperparameters that could lead to even better results.

Fig. 1
Fig.1 Comparison of different data augmentation methods with EPT, k represents the number of randomly selected tokens at each position D aug ¬ D aug È {X aug } end for N NER ¬ FINETUNE ( N D train È D aug ) Train NER model on training and generating dataset D l train , D l dev and D l test are constructed as the same way above, where l Î L. For cross-domain experi- ments, we augment data D s aug with the source train set D s train and the source development set D s dev as the source domain s Î M. Thus D s aug is combined with the target train set D t train for the final NER training, where target domain t Î M. The whole train sets, development sets and test sets are integrated into the multi-domain train set D mix train =  m Î M D m train , the multi-domain development set D mix dev =  m Î M D m dev and the multi-domain test set D mix test =  m Î M D m test for multi-domain experiments.

Fig. 5
Fig.5 Comparison of different position embedding methods

Table 1 Macro-F1 of monolingual low-resource NER %
En and Es mean English data and Spanish data.The numbers separated by slashes represent macro-F1 scores with and without non-entity class, respectively.EPT initializes the weights of the embedding layers with special words, such as "location" for "B-LOC" and "I-LOC", while EPT* initializes them with random values Note:

Table 2 Macro-F1 of the ablation experiments %
Note: Macro-F1 without others means the macro-F1 value without nonentity class

Table 3 Macro-F1 of EPT with different initial weights %
and we adopt it for the rest of this work.