Issue |
Wuhan Univ. J. Nat. Sci.
Volume 30, Number 2, April 2025
|
|
---|---|---|
Page(s) | 184 - 194 | |
DOI | https://doi.org/10.1051/wujns/2025302184 | |
Published online | 16 May 2025 |
Computer Science
CLC number: TP311
HashTrie Functional Framework and Its Application in Chinese-English Pattern Matching
HashTrie函数式框架及其在中英文环境模式匹配算法的应用
1
College of Computer Information Engineering, Jiangxi Normal University, Nanchang 330022, Jiangxi, China
2
Jiangxi Provincial Key Laboratory for High Performance Computing, Jiangxi Normal University, Nanchang 330022, Jiangxi, China
3
State International Science and Technology Cooperation Base of Networked Supporting Software, Jiangxi Normal University, Nanchang 330022, Jiangxi, China
4
Jiujiang Housing Provident Fund Management Center, Jiujiang 332000, Jiangxi, China
† Corresponding author. E-mail: wcj771006@163.com
Received:
17
October
2024
Most existing multi-pattern matching algorithms are designed for single English texts leading to issues such as missed matches and space expansion when applied to Chinese-English mixed-text environments. The HashTrie-based matching machine demonstrates strong compatibility with both Chinese and English, ensuring high accuracy in text processing and subtree positioning. In this study, a novel functional framework based on the HashTrie structure is proposed and mechanically verified using Isabelle/HOL. This framework is applied to design Functional Multi-Pattern Matching (FMPM), the first functional multi-pattern matching algorithm for Chinese-English mixed texts. FMPM constructs the HashTrie matching machine using character codes and threads the machine according to the associations between pattern strings. The experimental results show that as the stored string information increases, the proposed algorithm demonstrates more significant optimization in retrieval efficiency. FMPM simplifies the implementation of the Threaded Hash Trie (THT) for Chinese-English mixed texts, effectively reducing the uncertainties in the transition from the algorithm description to code implementation. FMPM addresses the problem of space explosion Chinese-English mixed texts and avoids issues such as bound variable iteration errors. The functional framework of the HashTrie structure serves as a reference for the formal verification of future HashTrie-based algorithms.
摘要
现有多模式匹配算法大多应用于单一英文环境,当应用于中英文混合环境时存在漏匹配、误匹配和空间膨胀等问题。HashTrie具有良好的中英文兼容性、结点内部查找和处理文本准确性,但并未被机械化验证。本文提出了一种基于HashTrie结构的函数式框架,并首次在Isabelle/HOL中进行了机械验证。应用此框架设计了适用于中英文混合环境的函数式多模式匹配算法Functional Multi Pattern Matching (FMPM),该算法以字符内码为键,构造HashTrie匹配机并根据模式串之间的关联进行线索化。实验结果显示,随着所存储字符串信息的增加,算法展现出更为显著的检索效率优化。FMPM简化了面向中英文混合环境下的多模式匹配算法Threaded Hash Trie (THT)的实现,有效降低从算法描述到代码实现过程中的不可知因素,解决了中英文混合环境下空间膨胀以及边界变量迭代错误等问题。HashTrie结构的函数式框架为今后更多HashTrie结构算法的形式验证提供了借鉴。
Key words: multi-pattern matching / Chinese-English mixed / HashTrie / functional / mechanized verification
关键字 : 多模式匹配算法 / 中英文混合 / HashTrie / 函数式 / 机械验证
Cite this article: ZUO Zhengkang, ZHOU Chao, ZENG Zhicheng, et al. HashTrie Functional Framework and Its Application in Chinese-English Pattern Matching[J]. Wuhan Univ J of Nat Sci, 2025, 30(2): 184-194.
Biography: ZUO Zhengkang, male, Ph.D., Professor, Senior Member of CCF, research direction: software formal methods and generic programming. E-mail: zhengkang2005@iscas.ac.cn
Foundation item: Supported by the National Natural Science Foundation of China (62462036, 62462037), Jiangxi Provincial Natural Science Foundation (20242BAB26017, 20232BAB202010), Cultivation Project for Academic and Technical Leader in Major Disciplines in Jiangxi Province(20232BCJ22013), and the Jiangxi Province Graduate Innovation Found Project (YC2024-S214)
© Wuhan University 2025
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
Current usage metrics show cumulative count of Article Views (full-text article views including HTML views, PDF and ePub downloads, according to the available data) and Abstracts Views on Vision4Press platform.
Data correspond to usage on the plateform after 2015. The current usage metrics is available 48-96 hours after online publication and is updated daily on week days.
Initial download of the metrics may take a while.