论文部分内容阅读
网页主题爬虫能够从万维网中寻找从属于特定主题的网页,并对其中的关键词、段落和图像进行保存和索引。但在网页的表达形式、数量和内容都呈现爆发式增长的情况下,传统的基于关键字匹配的网页主题发现方法已经难以有效的为爬虫程序进行准确的主题识别,这对于搜索引擎建立有效的网页内容数据库和网页内容的话题分析等关键应用都是具大的挑战。提出一种基于深度学习的网页主题爬虫的设计,采用深度信念网络对由文字向量表示的网页内容进行概念表达,以此构建多层次的网页主题概念特征向量,并使用支持向量机模型在新的特征表达下对网页主题进行快速识别,有效提升了爬虫程序对网页主题的识别准确率。
Web Topic Crawlers can look for web pages subordinate to a particular topic from the World Wide Web and save and index the keywords, paragraphs, and images. However, in the case of explosive growth in the form, quantity and content of web pages, the traditional keyword matching-based web theme detection method has been difficult to effectively carry out accurate theme identification for crawler programs, which is effective for search engines Key applications such as topic content analysis of web content databases and web content are big challenges. This paper proposes a deep learning web page theme crawler design, using deep belief network to express the concept of web page content represented by text vector, in order to construct multi-level web page theme conceptual feature vector and use SVM model in new Under the feature expression, the theme of the web page is quickly identified, which effectively improves the recognition accuracy of the web page theme by the crawler program.