We propose the x-scan operator, which efficiently processes non-materialized XML data as it is being received by the data integration system. X-scan matches regular path expression patterns from the query, returning results in pipelined fashion as the data streams across the network. Building from x-scan, we have constructed a new version of the Tukwila Data Integration System that efficiently integrates streaming XML and relational data, combining the results seamlessly into new XML content.
We query XML data by selecting nodes from the graph through regular path expressions, and then manipulate those nodes. The query below, written in the XML-QL language, finds and returns all lab names and cities.
WHERE <db>
<lab>
<name>$n</>
<_*><city>$c</></>
</> ELEMENT_AS $l
</> IN "fig1.xml"
...
which can be expressed equivalently as a combination of 3 regular path expressions (with the latter two dependent on the first):
and which selects the following nodes in the graph:
In many applications involving XML, however, we must be able to process queries over streams of incoming XML data, without having the luxury of first loading the data into a local repository. In particular, data integration applications often involve processing data over sources on a wide-area network whose contents change continuously, and hence storing the data locally is not a viable approach. Furthermore, it is imperative that we produce results incrementally as the data streams into the system, since queries are usually ad-hoc and interactive.
X-scan is motivated by the observation that IDREF links are limited to the scope of the current document, so in principle, the entire XML query graph for a document could be constructed in a single pass. X-scan achieves this by simultaneously parsing the XML data, indexing nodes by their IDs, resolving IDREFs, and returning the nodes that match the path expressions of the query. The path expression matching algorithm is based on finite state machines, as shown below.
The machines are run during XML document parsing.
In addition to the path expression evaluation routines, x-scan includes the following functionality:
Parsing the XML document:
X-scan uses the IBM XML4C parser
Node ID recording and reference resolving:
- XML elements may have ID attributes
- Other elements can reference these
- X-scan resolves all references during parsing
Creating a graph-structured index of the file:
- XML maps to a graph, but is stored as a tree
- X-scan builds an index of the links in the graph
Returning tuples of node locations:
- The output of x-scan is in the form of tuples
- In our example, we have (l, n, c) tuples for each applicable combination of
lab, name, and city
- These tuples are used to construct query results