Dagstuhl-Seminar "Rule Markup Techniques"
Feb. 3-8, 2002, Schloss Dagstuhl, Germany.

Data Manipulation and Integration in XML

Wolfgang May

Abstract: XPathLog is a Logic-Programming style language for XML querying, data manipulation, and integration. It has been designed as a crossbreed between F-Logic (which has been successfully applied to semistructured data in pre-XML times) and XPath. Its main features are the extension of XPath with variable bindings and the definition of a constructive semantics for XPath atoms in rule heads. For updates and data integration, we favor a graph-based data model instead of the XML tree model: The XTreeGraph is an extension of the XML data model that allows multiple overlapping trees in a graph-like database. Result views are then defined as XML tree views over this internal database. XPathLog and the XTreeGraph have been implemented in the LoPiX system.

Post-Workshop Summary of the Talk: Due to many questions at the workshop, the talk had a second title, "From F-Logic to XPathLog": In addition to the presentation of XPathLog, also its development as a reconcilation between F-Logic as a "proprietary" data model and language for knowledge representation and data integration, and the standards of the XML world is described.

The XPathLog language is a Datalog-like extension of XPath for querying, manipulating and integrating XML data. Based on navigation and filtering, its basic, internal semantics is closely related to F-Logic. The querying part extends XPath with binding variables to XML nodes that are "traversed" when evaluating an XPath expression. Variables can be bound to literals, nodes, and even names, allowing for metadata reasoning. The variable bindings can be output as answers, or they can be communicated to the rule head for specifying updates in the database. In contrast to other approaches, the XPath syntax and semantics is also used for a declarative specification how the database should be updated: when used in rule heads, XPath filters are interpreted as specifications of elements and properties which should be added to the database.

The restriction that XML uses a tree data model directly effects the semantics of updates: if an update specifies that some subtree of a document should be inserted also at another place, the subtree must be copied. Thus, for references into this tree, it must be decided whether they point into the original or into the copy; "sharing" subtrees - as in graph data models such as OEM or F-Logic - is not possible. On the data integration level, when restructuring trees, fusing nodes, and introducing synonyms, this problem occurs even more.

Thus, as a data manipulation and integration language, XPathLog - and the LoPiX implementation - internally use a graph-based, edge-labeled model, called XTreeGraph. The XTreeGraph extends the basic XML data model by modeling multiple overlapping trees, and thus allows for restructuring existing XML trees into a densely connected graph database. XML result trees are then defined as XML tree views by projections from this database.

Thus, the "pure" XPathLog language provides an intuitive Datalog-style extension of XPath for querying, data manipulation, and data integration which is very close to the standard syntax and semantics of XPath. Extended features of the language provide a class hierarchy (including nonmonotonic inheritance), a lightweight signature formalism (which is also used for defining tree views from the XTreeGraph), and data-driven Web access to extend the database with further XML documents. The expressiveness and flexibility of the full language makes it a candidate for combined handling of data, schema-metadata, and semantical metadata such as ontologies.

Slides: [postscript] [pdf]