Example - The GCX Engine in Action
We demonstrate the two key principles of the GCX query engine, namely document projection and active garbage collection by an example. The following query selects, for each department in which at least one employee has a salary specified, the average salary of the department.
Figure 1.1. XQuery input query
In the following, we discuss the processing strategy of GCX when evaluation this query against the XML document in Figure 1.2 below.
Figure 1.2. XML (input) document and XML (input) document tree
Static Analysis
The first phase of the GCX query evaluation strategy is a static analysis, which consists of several steps. First, the query is normalized. For instance, the where-clause in our example query is rewritten into an equivalent if-then-else expression and the XPath expression in the aggregate function fn:avg is rewritten into an equivalent for-clause. Rewriting the XPath expression in aggregate functions to an equivalent for-clause is important insofar as (active) garbage collection will be triggered more often, thus freeing occupied main memory resources early on. We depict the rewritten query in Figure 1.3 below.
Figure 1.3. XQuery rewritten input query
In the next step, from the rewritten (input) query a variable and a projection tree are derived.
Generally spoken, a variable tree of a query summarizes the parent-child relationships between variables.
Every node in a variable tree reflects a variable that has been defined through a for-clause and
an edge between two nodes of a variable tree is drawn if one variable is "a parent" of the other one, which means that the XPath expression (or the "iteration base") of the variable has a previously defined variable (i.e., its parent variable) at the beginning of its XPath expression. The variable tree for our running example query is shown in Figure 1.4 at the left. Note that we introduce a virtual $root variable, which captures absolute paths in the query and is assumed to be implicitly bound to the (virtual) XML document root.
In the next step, we derive a so-called projection tree from the variable tree.
A projection tree defines those parts of the input XML document that are relevant to query evaluation and therefore must be copied into the buffer during query evaluation.
To obtain a projection tree from a variable tree, the latter is extended according to the XPath expressions occurring in the query.
More precisely, for each XPath expression starting with a variable, say $x, we create a child node for the variable tree node of $x, labeled
with a corresponding XPath expression.
In our example we add the two nodes n4 and n5 to the projection tree (see Figure 1.4 at the right).
Node n5 is appended due to the existence check for $dep/employee
/salary
/text() in the input query: the node encodes that,
in order to check if the existence condition holds,
it is necessary to keep the first employee with salary in the buffer (a second employee will not affect the result of the existence check,
hence it suffices to store the nodes at position()=1).
Concerning node n4, we keep all descendant nodes
(using the XPath expression /dos::node(), where dos stands as a shortcut for descendant-or-self()); this is necessary to guarantee the correct evaluation of the
innermost for-loop (i.e., the one encapsulated in the fn:avg function).
Having added these additional nodes, the root node "$root" of the variable tree is labeled "/" (which denotes that all following XPath expressions are absolute) and all remaining nodes of the variable tree are labeled with the corresponding XPath expression of their introducing for-clause.
In a final step of projection tree construction, GCX performs a left-to-right traversal of the projection tree and assigns unique roles (here, roles r0-r4) to the projection tree nodes.
Intuitively, these roles associate each node in the projection tree with a unique subexpression in the query.
Hence, the XPath expressions in the projection tree will help us to identify those parts of the input XML document that are relevant to query evaluation and the roles capture the relevance of input document nodes for a specific XQuery subexpression.
All XML input document nodes that are matched by an XPath expression of the projection tree (or a prefix thereof), such as /company
or /company
/department
, are relevant to query evaluation. In particular, nodes matching the whole expression /company
/department
will be assigned role "r1" (i.e. the role associated with the respective node in the projection tree) and will be stored in the main memory buffer.
It is important to note that not their complete subtrees are relevant, but only nodes matching /employee
/salary
,
/employee
/salary
/dos::node(), and
/employee
/salary
/text() need to be buffered (with roles "r2", "r3", and "r4", respectively).
Figure 1.4. Variable tree (left) and Projection tree (right)
While the idea of XML document projection has been proposed before (e.g. in [1]), our novel concept of assigning roles to buffered nodes allows us to dynamically purge nodes from the main memory buffer. At runtime, roles are removed from the buffered nodes. We say that roles are "signed off". The query evaluator signals this loss of roles to the buffer by issuing so-called "signOff" statements. To this aim, the query is statically rewritten and signOff statements are inserted for all roles in the projection tree. The resulting query is shown below.
Figure 1.5. Rewritten input query with signOff statements
Dynamic Analysis
When static analysis has finished, query evaluation starts. As we will see soon, our "signOff"-mechanism implements a dynamic analysis, which allows us to successively identify buffered parts of the input XML document that have become irrelevant to (further) query evaluation and thus can be purged from the buffer early on. Let us now discuss a step-by-step evaluation of the input query with signOff statements (Figure 1.5) against the XML document (Figure 1.2). Initially, the buffer contains only the (virtual) XML document root. Note that this buffer node matches the top-level node of the projection tree and therefore gets assigned role r0.
Input Stream:
Output Stream: <r>
Buffer Content:
Figure 2.1. Buffer Content
According to the semantics of the input query, the query evaluator first outputs the opening tag <r>
and tries to evaluate the for-clause introducing variable $dep (we denote this for-loop as for$dep in the following).
There are no nodes in the buffer, so the query evaluator request new input from the document projector.
The document projector starts reading the input XML document and,
whenever a token matches a (prefix) path in the projection tree, appends it to the buffer, together with associated role(s).
Input Stream: <company>
Output Stream:
Buffer Content:
Figure 2.2. Buffer Content
The document projector first reads the XML opening tag <company>
, which matches the first path step of projection tree node n2 (see Figure 1.4)
and therefore is appended to the buffer.
Input Stream: <department>
Output Stream:
Buffer Content:
Figure 2.3. Buffer Content
Next, the document projector reads the opening tag <department>
.
This token now matches the full XPath expression /company
/department
of projection
tree node n2, and hence is stored, together with the corresponding role r1.
Now the query evaluator binds variable $dep to this node, since data for this evaluation step has become available.
The query evaluator then tries to evaluate the if-condition, which is not yet possible, because XML data is missing.
Consequently, the query evaluator requests further input from the document projector.
Input Stream: <employee>
Output Stream:
Buffer Content:
Figure 2.4. Buffer Content
The document projector reads the opening tag <employee>
and stores it (the token matches prefix paths of both projection tree node n3 and n5).
Input Stream: <name>
John Doe
</name>
and <job>
Software Analyst
</job>
Output Stream:
Buffer Content:
Figure 2.5. Buffer Content
The document projector reads the opening tag <name>
, the element text "John Doe" and the closing tag </name>
(we consider this reading operation as a whole because nothing important happens) and discards them, because they are not relevant to query evaluation according to the projection tree (i.e., they are not matched by a path or prefix path of any projection tree node).
The same happens for the token sequence <job>
Software Analyst
</job>
.
Input Stream: <salary>
2000
</salary>
Output Stream: <salary_avg>
Buffer Content:
Figure 2.6. Buffer Content
The document projector reads the opening tag <salary>
, the element text "2000" and the closing tag </salary>
(we consider this reading operation as a whole because the if-condition cannot be evaluated earlier) and stores it assigning role r2 and r3 for the element node and r3 and r4 for the text node (they match the corresponding projection tree nodes, respectively).
The query evaluator evaluates the if-condition (to "true"), enters the then-part, and outputs the opening tag <salary_avg>
.
The next step is the evaluation of the for-loop for$_1 inside aggregate function fn:avg.
This for-loop extracts the text value '2000' (and internally passes it to the aggregate function, which stores it and waits for further input).
Input Stream: <employee>
Output Stream:
Buffer Content:
Figure 2.7. Buffer Content
Next, the query evaluator evaluates the two signOff statements that are encapsulated in function fn:avg.
The effect of this evaluation step is that nodes salary
and the corresponding text node
'2000' lose roles r3 (according to the first signOff statement) and node salary
loses
role r2 (according to the second signOff statement).
As the evaluation of the for$_1-clause is not yet finished, the query evaluator requests further input from the document projector.
In response, the document projector reads the opening tag <employee>
and stores it.
Input Stream: <name>
Jane Fletcher
</name>
and <job>
Designer
</job>
Output Stream:
Buffer Content:
Figure 2.8. Buffer Content
The document projector reads the opening tag <name>
, the element text "Jane Fletcher" and the closing tag </name>
(we consider this reading operation as a whole because nothing important happens) and discards it because it is not relevant to query evaluation.
The same happens for the token sequence <job>
Designer
</job>
.
Input Stream: <salary>
2500
</salary>
Output Stream:
Buffer Content:
Figure 2.9. Buffer Content
The document projector reads the opening tag <salary>
and stores it, assigning roles r2 and r3.
Now the query evaluator binds the variable $_1 a second time in the for$_1-loop, i.e. to the second salary
node in the buffer.
The data for computing the aggregate function fn:avg is not yet available in the buffer, so the query evaluator requests new data from the document projector.
The document projector then reads the text element '2500' and the closing tag </salary>
and stores them with the corresponding roles.
The text value '2500' is then passed to the aggregate function
fn:avg.
Input Stream: </employee>
</department>
Output Stream:
2250</salary_avg>
Buffer Content:
Figure 2.10. Buffer Content
Again, the query evaluator evaluates the two signOff statements that are encapsulated in the aggregate function fn:avg.
Now, the right salary
node and the corresponding text node
'2500' lose role r3 (according to the first signOff statement) and the right salary
node loses
role r2 (according to the second signOff statement). After this step, the query evaluator requests new tokens and receives the closing
tags </employee>
and </department>
.
This indicates that the evaluation of the for$_1-loop has finished for the current "department", so the query evaluator computes the result of the aggregate function and
outputs 2250</salary_avg>
.
Input Stream: <department>
Output Stream: </r>
Buffer Content:
Figure 2.11. Buffer Content
Subsequently, the query evaluator evaluates the coming signOff statements, which removes role r4 from the text nodes '2000' and '2500'
and role r1 from the department
node.
Now (active) garbage collection is triggered. It removes all nodes in the buffer that (1) do not carry a role and (2) do not contain any descendant nodes
carrying a role and (3) whose closing tag has been read.
This condition is true for all buffered nodes except the company
node, which violates condition (3) and therefore remains in the buffer.
Once the buffer has been cleaned, the document projector reads the opening tag <department>
(the second department in the XML data set) and stores it assigning role r1.
As the "employee" node of this "department" node does not contain any information, the if-condition will evaluate to "false" and no output will be created for this second department.
We skip the details.
After evaluating the for$_1-loop a second time, all buffered nodes will have lost their roles (and will have been removed) except for the root node.
Finally, after outputing the closing tag </r>
, the last signOff statements in the query is executed, which removes role r0 from the (virtual) root node of the XML document tree and fully clears the buffer.
After this step, the buffer is empty and the query has been (succesfully) evaluated.
By an interleaved interplay between XML document projection, query evaluation, and garbage collection,
GCX achieves a streaming XQuery evaluation, where buffers are continuously filled and purged based on the current state of query evaluation.
Our benchmark results demonstrate both the efficiency and effectiveness of this approach.
References
- [1] A. Marian, J. Siméon: Projecting XML Documents, In Proc. VLDB ‘03, pages 213-224
- [2] Christoph Koch: On the complexity of nonrecursive XQuery and functional query languages on complex values, ACM Transactions on Database Systems, 31(4), 2006
- [3] M. Schmidt, S. Scherzinger, C. Koch: Combined Static and Dynamic Analysis for Effective Buffer Minimization in Streaming XQuery Evaluation, In Proc. ICDE 2007 [.pdf]
- [4] XQuery 1.0: An XML Query Language (W3C)
Last updated: 2009-11-11