The Tukwila Data Integration System

A data integration system is an automated method for querying across multiple heterogeneous databases in a uniform way.  In essence, a mediated schema is created to represent a particular application domain and data sources are mapped as views over the mediated schema.  The user asks a query over the mediated schema and the data integration system reformulates this into a query over the data sources and executes it.  The system then intelligently processes the query, reading data across the network and responding to data source sizes, network conditions, and other factors.
Example data integration application
 
Query Processing Challenges

Query processing in data integration occurs over network-bound, autonomous data sources ranging from conventional databases on the LAN or intranet to web-based sources across the Internet.  The Tukwila data integration system is designed to scale up to the amounts of data transmissible across intranets and the Internet (tens to hundreds of MBs), with large numbers of data sources.  This requires extensions to traditional optimization and execution techniques for three reasons: there is an absence of quality statistics about the data, data transfer rates are unpredictable and bursty, and slow or unavailable data sources can often be replaced by overlapping or mirrored sources.  Additionally, the natural data format for data integration is XML -- which creates new challenges in query processing and optimization.  (See the discussion of the x-scan operator for more information.)
The Tukwila Solution

Tukwila system diagram


The Tukwila data integration system can make use of a mediated schema by taking the original query and applying a highly efficient query reformulation algorithm, MiniCon, which maps the input query from the mediated schema to the data sources.  Next, the query processing system takes a three-pronged approach to providing efficient query answering:  (1) use of operators with flexible scheduling policies, to mask latencies, (2) overlapping of query operations using pipelining, even for XML data, with the x-scan operator, and (3) convergent query processing, which allows the system to choose a query plan, execute it for some amount of time, then revise statistics and cost estimates and generate an improved plan -- all in mid-stream.  Experiments have shown that the Tukwila approach results in superior performance versus traditional techniques.
Our architecture extends previous innovations in adaptive execution (such as query scrambling, mid-execution re-optimization, eddies, and choose nodes), and we have experimentally demonstrated the benefits of our system (see the papers below for more information).

Recent Work

The recent focus of the Tukwila project has been in three areas:

  • Fully integrated support for efficient processing of XML data, based on the x-scan operator.  XML provides a common encoding for data from many different sources; combined with standardization of schemas (DTDs) across certain domains, it greatly reduces the needs for wrappers and even query reformulation.  The latest versions of Tukwila are built around an adaptive query processing architecture for XML, and can seamlessly combine XML and relational data into new XML content.
  • Support for output of incremental results in a meaningful fashion.  For web-based applications, we would like to see output incrementally, and as early as possible.  (This is in contrast to the typical batch-oriented paradigm, in which overall query completion time is key.)
  • Providing "preview" answers to queries.  In many cases, it may be useful to quickly return some sort of approximate or partial answers during a long-running query, so the user gets early feedback.  We have developed a general framework for expressing the semantics of preview results, as well as optimization and execution strategies for "preview queries." 
  • Development of techniques and algorithms for "convergent query processing."  Our convergent query processing framework allows the system to continuously re-optimize a query at any point as it gets improved information, for fairly complex classes of queries (including grouping, sorting, etc.).

Project Members

Project Alumni

Web Pages

Presentations

Publications

This research has been funded in part by NSF grant 9978567.