Web化学化工资源的挖掘及化学信息学

Zhaojie Xia, Chunyan Liang, Li Guo, Xiaoxia Li, Zhangyuan Yang, Design and Implementation of a Chemistry-Topic Search Engine, 40th IUPAC Congress, Beijing, China, August 14-19, 2005

引用格式: Zhaojie Xia, Chunyan Liang, Li Guo, Xiaoxia Li, Zhangyuan Yang, Design and Implementation of a Chemistry-Topic Search Engine, 40th IUPAC Congress, Beijing, China, August 14-19, 2005
标题:Design and Implementation of a Chemistry-Topic Search Engine
作者: Zhaojie Xia, Chunyan Liang, Li Guo, Xiaoxia Li, Zhangyuan Yang;中国科学院过程工程研究所多相复杂系统国家重点实验室:高性能计算与化学信息学课题组
关键词: 化学搜索引擎; 化学主题搜索引擎; 网络爬行; 系统设计; 系统实现
摘要:In this paper, we present ChemEngine, a prototype of chemistry-topic search engine which should be growing in popularity because it could offer increased accuracy and extra functionalities not possible with general-purpose Internet search engines.
The ChemEngine consists of three major components: a crawler, an indexer and a searcher like a general-purpose search engine[1], but each component is fairly focused on Internet chemistry resources.
A chemistry focused crawler is designed to only gather pages on chemistry-topic compared to general-purpose crawlers to gather as many pages as it can. The focused crawler performs best-first search: it starts from a seed URL bank with more than 10,000 chemistry links resulted from ChIN[2], the chemistry portal, Chinese National Science Digital Library. The focused crawler fetches a page, determines its relevancy to the chemistry-topic and adds all the links from this page to the frontier with score equal to relevancy of the parent. In the next iteration, crawler picks the URLs with the biggest score in the frontier to crawl. Several types of classifiers are used to determine the page relevancy, such as k-NN, Naive Bayesian, and Support Vector Machine.
The indexer parses and stores information about pages in Unicode, thus making possible to implement a multi-language search engine. So, we can search the pages in English and even Chinese, all in one database. In additional, chemistry dictionary can be used for tokenizing in this step.
The searcher of ChemEngine is to provide quality search results efficiently. We sort the hits not only according to PageRank, but also offer a categorizing interface for search results according to taxonomy of chemistry subjects to solve the information overload problem that occurs when users face thousands of results returned by a web search.
ChemEngine now searches several million pages at present with Boolean searching available.