Web化学化工资源的挖掘及化学信息学----高性能计算与化学信息学课题组

Web化学化工资源的挖掘及化学信息学

Xiaoxia Li, et al., Towards Automated Searching of data in Internet Chemical Databases, The 8th International Conference on Chemical Structures, June 1-5， 2008 Noordwijkerhout, Netherlands.

【大】【中】【小】

引用格式: Xiaoxia Li, et al., Towards Automated Searching of data in Internet Chemical Databases, The 8th International Conference on Chemical Structures, June 1-5， 2008 Noordwijkerhout, Netherlands.

标题：Towards Automated Searching of data in Internet Chemical Databases

作者: Xiaoxia Li, Xiaolong Yuan, Zengcai Liu, Li Guo；中国科学院过程工程研究所多相复杂系统国家重点实验室：高性能计算与化学信息学课题组

关键词: 深层网检索; 数据提取; 搜索引擎; 网络爬行

摘要：Public accessible chemical databases are valuable resources on Internet. To find the data on Web, first you should be aware if such data is available and where to get it. To search for possible data sources, general search engines like Google as well as comprehensive web directories of chemistry resources with list of chemical databases such as ChIN that indexes more than 200 freely accessible chemical databases can be used.

Because of the diversity of chemicals and their properties, the coverage of compounds and property items varies in different chemical databases. To search data for a chemical, it is very often that one needs to search all possible database web sites one by one manually. The data in chemical databases cannot be indexed and searched by traditional search engines based on hyperlink analysis, because the Web pages containing the targeted data are dynamically generated by the database severs to respond to a query, which does not exist before the query and won’t be kept on the server after the query, so cannot be crawled by crawlers of search engines following hyperlinks. Thus the Web databases are collectively called Deep Web,2 the data collection as a whole in varies chemical databases is called Chemistry Deep Web herein accordingly. To create a searching tool for Chemistry Deep Web may not only overcome the limitation of current search engines in searching data for chemicals on Internet but also to make it possible for data integration from different sources that may be further used in computational applications. To our knowledge, the ChemFinder of Cambridgesoft3 is probably the only useful tool that helps searching the Chemistry Deep Web by automatically submitting a data query to different chemical databases.

This presentation will report an approach in developing ChemDB Portal that aims at searching the data in various Web-based chemistry databases by one query. ChemDB Portal is implemented by combining HTTP, Java and XML technology.4,5 In ChemDB Portal, a query is created and submitted to different web based chemical databases on Internet, the HTML documents with the target data returned from these sites are first transformed into XHTML by Tidy, then the target data can be extracted by a data extraction template in XSLT document into a XML document, which can be further mapped into database for XML based retrieval.

How to create a data extraction template for the target data is the key to this XML based approach, which is not only tedious but also a challenging job to create it manually. A semi-automated tool called XE_ChemD that helps create data extraction templates for the chemistry Deep Web has been created. XE_ChemD gets a HTML document by given URL and normalize it to XHTML which is parsed to a XML tree at the same time. After the target data in the source tree are chosen, the candidate XPath expressions that forms the template can be automatically generated based on the context of the target data in terms of their dependence on content, structural, or formatting features.

The data in chemical databases indexed in ChemDB Portal can be searched by a query with identifications of a compound such as CAS registration number, formula, names or structure (see Figure 2). Searching 8 databases simultaneously by one query to ChemDB Portal is now possible that demonstrates its potential to be a search engine dedicated to chemistry data in Deep Web in future.