Institut für Informatik, Universität Freiburg

Semistructured Data and XML

The Web provides access to large data sources which are not explicitly organized as databases. Instead, the information is presented as semistructured data. In contrast to integrating classical distributed databases, handling such data raises several new problems such as schema discovery, wrapping and reorganizing the data sources and coping which changes in autonomous sources. XML, solves the problem of wrapping from HTML to another data model.

Information Integration from the Web

Continuing the FLORID project, the FLORID system has been extended with Web access capabilities (Versions 2.x). A methodology for wrapping and integrating HTML pages by mapping the information in an integrated F-Logic data model representing both the structure of the data sources and containing an application/level model of the information has been developed. HTML pages are wrapped using generic rules for the usual structuring means (i.e., lists, tables, comma-lists, emphasized keywords). The MONDIAL case study documents the practicability of the approach.

The Experiences with F-Logic and FLORID are now continued with the LoPiX (Logic Programming in XML) project, dealing with integration of XML data.

Rewriting queries with caches

In the environment of World Wide Web, the heterogeneity of data sources online (e.g. databases on internet) provides new challenge of data model and database theory in data integration: the requirement of the definition of an expressive language; some data sources are not always available; the access to remote database is expensive; data sources are sometimes incomplete, etc. A new caching technology named proxy cache server provides a promising solution for data integration. Parts of the earlier asked queries are stored in the caches and the queries coming later could be directly answered from the caches instead of going to data sources.

In this part we tackle the problem of semantic cache answering or query rewriting using caches on the theoretical side: how to decide whether the caches can answer the query, partially or equivalently; how to answer; if the query is partially answered, how to get the rest of the query, etc. Due to the flexibility of the internet data source, we are confronted with a more complex data model, with inequality, negation, and schema information.

It has been recognised that query processing of XML with XPATH is effective. The theoretical basis shows that each XPATH expression can be effectively transformed to CTL, a branching time logic in model checking. Since the model checking of CTL is in linear time, it can be concluded that the processing of XML with XPATH achieves this complexity too. In fact, there exists already one such implementation up to now. There are two methods considering the implementation:
  1. Trasform each XPATH to CTL and XML to Kripke structure, then apply the existing model checking algorithm.
  2. Natively write the algorithm, "borrow" the idea of the labeling method using in CTL model checking. (recommended)
Required knowledge: XPATH, XML, Java programming
Tools: XML/DOM parser, XPATH parser in Java (both are standard), JDK.

Supervision: Fang Wei


[IIF]    [DBIS]    [Publications]