论文部分内容阅读
主要是为了给维吾尔语、哈萨克语及柯尔克孜语在自然语言处理、语音识别、语音合成、机器翻译、信息检索、维吾尔语智能信息监控以及维吾尔语舆情分析等研究领域提供语料作为目的。在软件的设计和实现过程中参考维吾尔语、哈萨克语和柯尔克孜语的语法规则以及语言特征,同时引入此三种语言的国际编码,除此根据该网页的特征来分析网页的结构进行判断文本而研发了从网上抓取维哈柯多语种纯文本的数据采集器。最后实现了为少数民族自然语言处理研究搭建语料库准备大规模语料。
The main purpose is to provide Uyghur, Kazak and Kirgiz phonetics in the fields of natural language processing, speech recognition, speech synthesis, machine translation, information retrieval, Uyghur intelligent information monitoring and Uyghur public opinion analysis. In the process of designing and implementing the software, the grammar rules and linguistic features of Uyghur, Kazak and Kirgiz are referenced and the international codes of these three languages are introduced. In this way, the structure of the web page is analyzed according to the characteristics of the web page to judge the text, Developed from the Internet to capture multi-language Verhoeven plain text data collector. In the end, it has realized the preparation of large-scale corpus for the construction of corpus for minority language natural language processing research.