X-Scan: a Foundation for XML Data Integration

The adoption of XML promises to accelerate construction of systems that integrate distributed, heterogeneous data. Query languages for XML are typically based on regular path expressions that traverse the logical XML graph structure; the efficient evaluation of such path expressions is central to good query processing performance. Most existing XML query processing systems convert XML documents to an internal representation, generally a set of tables or objects; path expressions are evaluated using either index structures or join operations across the tables or objects. Unfortunately, the required index creation or join operations are often costly even with locally stored data, and they are especially expensive in the data integration domain, where the system reads data streamed from remote sources across a network, and seldom reuses results for subsequent queries.

We propose the x-scan operator, which efficiently processes non-materialized XML data as it is being received by the data integration system. X-scan matches regular path expression patterns from the query, returning results in pipelined fashion as the data streams across the network. Building from x-scan, we have constructed a new version of the Tukwila Data Integration System that efficiently integrates streaming XML and relational data, combining the results seamlessly into new XML content.

Background

The eXtensible Markup Language provides a common format for data across the network, and is being supported by nearly every data management tool. XML is based on nested open- and close-tags, or elements, that may have attributes. XML can be represented as a graph, as is shown below with a sample XML document and its equivalent data graph.



Sample XML data graph

We query XML data by selecting nodes from the graph through regular path expressions, and then manipulate those nodes. The query below, written in the XML-QL language, finds and returns all lab names and cities.

WHERE <db>
       <lab>
        <name>$n</>
        <_*><city>$c</></>
       </> ELEMENT_AS $l
      </> IN "fig1.xml"

...

which can be expressed equivalently as a combination of 3 regular path expressions (with the latter two dependent on the first):

Regular path expressions

and which selects the following nodes in the graph:

Graph with query nodes highlighted

Challenges

To date, most efforts to build XML query processors, and to evaluate regular path expressions, have been based on first loading the data into a local repository, building indexes on the repository, and then processing the query. The approaches differ on whether the repository is a relational database, an object-oriented database or a repository for semi-structured data.

In many applications involving XML, however, we must be able to process queries over streams of incoming XML data, without having the luxury of first loading the data into a local repository. In particular, data integration applications often involve processing data over sources on a wide-area network whose contents change continuously, and hence storing the data locally is not a viable approach. Furthermore, it is imperative that we produce results incrementally as the data streams into the system, since queries are usually ad-hoc and interactive.

The x-scan solution

X-scan is a new method for evaluating path expressions as data is streaming into the system. The input to x-scan is an XML data stream and a set of regular path expressions occurring in a query; x-scan's output is a stream of bindings for the variables occuring in the expressions. A key feature of x-scan is that it produces these bindings incrementally, as the XML data is streaming in; hence, x-scan fits naturally as the source operator to a complex pipeline, and it is highly suited for data integration applications.

X-scan is motivated by the observation that IDREF links are limited to the scope of the current document, so in principle, the entire XML query graph for a document could be constructed in a single pass. X-scan achieves this by simultaneously parsing the XML data, indexing nodes by their IDs, resolving IDREFs, and returning the nodes that match the path expressions of the query. The path expression matching algorithm is based on finite state machines, as shown below.

The machines are run during XML document parsing.

In addition to the path expression evaluation routines, x-scan includes the following functionality:

Parsing the XML document:
X-scan uses the IBM XML4C parser

Node ID recording and reference resolving:
- XML elements may have ID attributes
- Other elements can reference these
- X-scan resolves all references during parsing

Creating a graph-structured index of the file:
- XML maps to a graph, but is stored as a tree
- X-scan builds an index of the links in the graph

Returning tuples of node locations:
- The output of x-scan is in the form of tuples
- In our example, we have (l, n, c) tuples for each applicable combination of lab, name, and city
- These tuples are used to construct query results

Project Members

Presentations

Publications