Web化学化工资源的挖掘及化学信息学----高性能计算与化学信息学课题组

Web化学化工资源的挖掘及化学信息学

Xiaoxia Li, Chunmei Chu, Liuyi Zhuo, Li Guo, “Searching the Chemistry Deep Web”, 40th IUPAC Congress, Beijing, China, August 14-19, 2005 (分组邀请报告)

【大】【中】【小】

引用格式: Xiaoxia Li, Chunmei Chu, Liuyi Zhuo, Li Guo, “Searching the Chemistry Deep Web”, 40th IUPAC Congress, Beijing, China, August 14-19, 2005

标题：Searching the Chemistry Deep Web

作者: Xiaoxia Li, Chunmei Chu, Liuyi Zhuo, Li Guo；中国科学院过程工程研究所多相复杂系统国家重点实验室：高性能计算与化学信息学课题组

关键词: 化学深层网检索; 数据提取; 化学数据库; 搜索引擎;

摘要：The Web is now the largest collection of chemical information ever growing with the rapid popularity of Internet in the last decade. A few general-purpose search engines such as Google are now the daily tools for searching the chemistry resources on Internet. Search engine is implemented by automated crawling following hyperlinks, and indexing of web pages as much as possible based on hyperlink analysis. Public accessed chemical databases are valuable resources on Internet. But the coverage in data items and compounds or systems in different databases varies because of the diversity of chemicals and physical and chemical properties. To find and combine targeted data can be very tedious by searching all possible database web sites manually and frustration when no data turn out after all these searching. Unlike static web pages, the data in Internet chemistry databases can only be retrieved by a query, which can’t be indexed and searched directly by search engines. Thus Web databases are called Deep Web or Invisible Web, in regard to the Surface Web indexed by search engine crawlers following hyperlinks. Accordingly, we may consider the data in distributed chemical databases on Internet as the chemistry Deep Web. Searching the chemistry Deep Web automatically is a new challenge because the data is only accessible by a query and the result pages generated from chemical databases are mostly in the form of HTML documents which are not suitable for automated processing. More over, the searching interface of a database may change over time. This presentation will introduce an approach searching public chemical databases on Internet simultaneously by chemical names, formula and CAS number. The basic idea behind the approach is to add or update a database to the chemistry Deep Web search engine only by configuring so that the maintenance cost is cheaper than updating programs by coding. The approach is based a three-tier model. The model includes a client agent responding user’s queries, a server agent as middle-tier extracting and integrating data from targeted sites, and local databases as data managers providing data retrieval. The approach is implemented by combining HTTP, Java and XML technology. Searching five databases simultaneously is now possible by ChemDB Portal that is the implementation of the approach, which demonstrates its potential to create a search engine dedicated to chemistry data in Deep Web in future.