Issue |
MATEC Web Conf.
Volume 267, 2019
2018 2nd AASRI International Conference on Intelligent Systems and Control (ISC 2018)
|
|
---|---|---|
Article Number | 04001 | |
Number of page(s) | 4 | |
Section | Management Science and Engineering | |
DOI | https://doi.org/10.1051/matecconf/201926704001 | |
Published online | 11 February 2019 |
An Algorithm Rapidly Segmenting Chinese Sentences into Individual Words
College of Creative Arts Hainan Tropical University, Sanya 572022, Hainan, China
This paper proposes an improved Trie tree structure. The tree node records the position information of the characters participating in the word formation, and the child node uses the hash search mechanism. On this basis, the forward maximum matching algorithm of Chinese word segmentation is optimized. In the process of word segmentation, the automaton mechanism is used to judge whether it constitutes the longest word, and the problem that the forward maximum matching algorithm needs to adjust the string according to the word length is solved. The algorithm time complexity is 1.33, and the comparison test results show that there is a fast word segmentation speed. The forward maximum matching algorithm based on the improved Trie tree structure improves the Chinese word segmentation speed, especially when the dictionary structure needs to be updated in real time.
Key words: Natural language processing / Chinese word segmentation / Forward maximum matching algorithm
© The Authors, published by EDP Sciences, 2019
This is an open access article distributed under the terms of the Creative Commons Attribution License 4.0 (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
Current usage metrics show cumulative count of Article Views (full-text article views including HTML views, PDF and ePub downloads, according to the available data) and Abstracts Views on Vision4Press platform.
Data correspond to usage on the plateform after 2015. The current usage metrics is available 48-96 hours after online publication and is updated daily on week days.
Initial download of the metrics may take a while.