基于支持向量机和核心特征词的科技文献自动标引研究

来源 :情报理论与实践 | 被引量 : 0次 | 上传用户:pgzwoaini1
下载到本地 , 更方便阅读
声明 : 本文档内容版权归属内容提供方 , 如果您对本文有版权争议 , 可与客服联系进行内容授权或下架
论文部分内容阅读
科技文献通常包括研究目的、方法、结果和结论等信息,如何将科技文献标引上这些信息,帮助科研人员在数量巨大的文献中快速发现符合研究需要的内容显得尤为重要。文章在研究分析科技文献写作特点基础上,提出了基于词、英文(专有名词、缩写词)以及数字的核心特征词提取策略;然后将科技文献标引问题转化为句子分类问题,结合提出的核心特征词,采用支持向量机分类器对科技文献进行句子级别的语义标引。通过对1168篇糖尿病医学类论文实验,证明本文提出的方法能够有效地学习和标引科技文献中的句子,进而有效地对科技文献关键信息点进行自动标引。 Scientific and technical literature usually includes research purposes, methods, results and conclusions and other information, and how to document science and technology information on this information to help researchers in a huge quantity of literature quickly find content that meets research needs is particularly important. Based on the research on the characteristics of scientific and technical documents writing, this paper proposes a strategy of extracting core feature words based on words, English (proper nouns, abbreviations) and numbers. Then, the document classification of scientific articles is transformed into sentence classification problems. Core feature words, the use of support vector machine classifier for scientific literature sentence level semantic indexing. By experimenting with 1168 articles on diabetes medicine, this paper proves that the method proposed in this paper can effectively learn and index sentences in science and technology documents, and then effectively index key information points in scientific literature.
其他文献
在7月8日进行的韩国第34届名人战循环圈赛中,安祚永七段执白4目半击败刘昌赫九段,以三胜一败的战绩与曹薰铉九段、李世石九段同居第二位,睦镇硕六段四战全胜暂居首位,尹炫皙八段、宋泰坤五段、何勇虎二段均一胜三败陷入降级的漩涡,刘昌赫九段四战全败,保级无望。