覃 俊,林叶川,易云飞.基于互信息改进算法和t-测试差的壮文分词算法研究[J].中南民族大学学报自然科学版,2017,(4):100-105
基于互信息改进算法和t-测试差的壮文分词算法研究
Research on Zhuangwen Word Segmentation Algorithm Based onMutual Information Improved Algorithm and t-test Difference
  
DOI:
中文关键词: 壮文分词  MI改进算法  t-测试差  混合算法  语义词
英文关键词: zhuangwen word segmentation  MI improved algorithm  t-test difference  hybrid algorithm  semantic word
基金项目:国家科技支撑计划项目子课题(2015BAD29B01) ; 中南民族大学研究生学术创新基金项目(2017sycxjj051)
作者单位
覃 俊1,林叶川1,易云飞2,* 1 中南民族大学 计算机科学学院武汉 430074
2 河池学院 计算机与信息工程学院宜州 546300 
摘要点击次数: 333
全文下载次数: 355
中文摘要:
      针对传统的壮文分词方法将单词之间的空格作为分隔标志,在多数情况下,会破坏多个单词关联组合而成的语义词所要表达的完整且独立的语义信息,在借鉴前人使用互信息 MI方法来度量相邻单词间关联程度的基础上,首次采用互信息改进算法 MIkt-测试差对壮文文本分词,并结合两者在评价相邻单词间的静态结合能力和动态结合能力的各自优势,提出了一种 MIkt-测试差相结合的 TD-MIk 混合算法对壮文文本分词, 并对互信息改进算法 MIkt-测试差、TD-MIk混合算法三种方法的分词效果进行了比较.使用人民网壮文版上的文本集作为训练及测试语料进行了实验,结果表明:三种分词方法都能够较准确而有效地提取文本中的语义词,并且 TD-MIk混合算法的分词准确率最高.
英文摘要:
      The traditional method of Zhuangwen word segmentation is to use the space between words as a separation mark.But in most cases, the word segmentation method will destroy multiple words association combination of semantic words which express the complete and independent semantic information. For the first time we use the mutual information to improve algorithm MIk and t-test difference in Zhuangwen text word segmentation that based on the use of mutual information MI method to measure the degree of correlation between adjacent words, and combine with the two in the evaluation of adjacent words'static binding ability and dynamic binding ability, a TD-MIk hybrid algorithm based on the MIk and t-test difference is proposed. The segmentation effects of MIkt-test difference and TD-MIk hybrid algorithm are compared. We use the text set on the People's network in Zhuangwen as a training and test corpus to do the experiments. The experimental results show that the three segmentation methods can extract the semantic words in text accurately and efficiently, and TD-MIk hybrid algorithm has the highest accuracy of word segmentation .
查看全文   查看/发表评论  下载PDF阅读器
关闭