论文部分内容阅读
通过对Web主题爬行器在预测链接优先级时所用到的特征因子的细化和重新分类,引入收割率和媒体类型两个新特征作为相关性判断依据,提出一种改进的最好优先搜索算法。该算法采用“细粒度”策略过滤不相关网页,选取多个角度有代表性的特征因子构造链接优先级计算公式,以达到全面揭示和预测链接主题的目的。通过与其他三类主题搜索算法的小规模实验比较,证明改进算法在收割率和平均提交链接数上效果较好。
By refining and reclassifying the eigenfactors used by Web thematic crawler in predicting the link priority, two new features of harvesting rate and media type are introduced as the basis of relevance judgment, and an improved best-first search algorithm . The algorithm uses “fine-grained” strategy to filter irrelevant web pages and selects the representative feature factors from multiple perspectives to construct the link priority formula in order to reveal and predict the purpose of the link theme in an all-round way. By comparing with the other three kinds of thematic search algorithms in small-scale experiments, the improved algorithm is proved to be effective in the harvest rate and the average number of submitted links.