The Tukwila Data Integration System

A data integration system is an automated method for querying across multiple heterogeneous databases in a uniform way. In essence, a mediated schema is created to represent a particular application domain and data sources are mapped as views over the mediated schema. The user asks a query over the mediated schema and the data integration system reformulates this into a query over the data sources and executes it. The system then intelligently processes the query, reading data across the network and responding to data source sizes, network conditions, and other factors.

Query Processing Challenges: Query processing in data integration occurs over network-bound, autonomous data sources ranging from conventional databases on the LAN or intranet to web-based sources across the Internet. The Tukwila data integration system is designed to scale up to the amounts of data transmissible across intranets and the Internet (tens to hundreds of MBs), with large numbers of data sources. This requires extensions to traditional optimization and execution techniques for three reasons: there is an absence of quality statistics about the data, data transfer rates are unpredictable and bursty, and slow or unavailable data sources can often be replaced by overlapping or mirrored sources. Additionally, the natural data format for data integration is XML -- which creates new challenges in query processing and optimization. (See the discussion of the x-scan operator for more information.)

The Tukwila Solution: The Tukwila data integration system can make use of a mediated schema by taking the original query and applying a highly efficient query reformulation algorithm, MiniCon, which maps the input query from the mediated schema to the data sources. Next, the query processing system takes a three-pronged approach to providing efficient query answering: (1) use of operators with flexible scheduling policies, to mask latencies, (2) overlapping of query operations using pipelining, even for XML data, with the x-scan operator, and (3) convergent query processing, which allows the system to choose a query plan, execute it for some amount of time, then revise statistics and cost estimates and generate an improved plan -- all in mid-stream. Experiments have shown that the Tukwila approach results in superior performance versus traditional techniques.

Our architecture extends previous innovations in adaptive execution (such as query scrambling, mid-execution re-optimization, eddies, and choose nodes), and we have experimentally demonstrated the benefits of our system (see the papers below for more information).

Recent Work

The recent focus of the Tukwila project has been in three areas:

Fully integrated support for efficient processing of XML data, based on the x-scan operator. XML provides a common encoding for data from many different sources; combined with standardization of schemas (DTDs) across certain domains, it greatly reduces the needs for wrappers and even query reformulation. The latest versions of Tukwila are built around an adaptive query processing architecture for XML, and can seamlessly combine XML and relational data into new XML content.

Support for output of incremental results in a meaningful fashion. For web-based applications, we would like to see output incrementally, and as early as possible. (This is in contrast to the typical batch-oriented paradigm, in which overall query completion time is key.)

Providing "preview" answers to queries. In many cases, it may be useful to quickly return some sort of approximate or partial answers during a long-running query, so the user gets early feedback. We have developed a general framework for expressing the semantics of preview results, as well as optimization and execution strategies for "preview queries."

Development of techniques and algorithms for "convergent query processing." Our convergent query processing framework allows the system to continuously re-optimize a query at any point as it gets improved information, for fairly complex classes of queries (including grouping, sorting, etc.).

Project Members

Professor Alon Halevy
Professor Dan Weld
Zack Ives is responsible for the Tukwila execution engine and its adaptive operation, as well as the optimizer for previewing query results. His work largely relates to adaptive query processing, processing of XML data, XML and zero-knowledge query optimization, and XML query languages.

Project Alumni

Hakim Weatherspoon worked on a web wrapper toolkit. Hakim is now a graduate student at U.C. Berkeley.
Marc Friedman constructed the original optimizer for the Tukwila system. He recently graduated and joined the Viathan Corporation.
Daniela Florescu, INRIA (France) advised during the original Tukwila design stages. She is now at Propel Software..
Rachel Pottinger has developed the Tukwila query reformulator, which uses a new query reformulation algorithm, MiniCon, that is significantly more efficient than previous approaches. Rachel's current work involves the construction of systems for managing models (e.g. schemas).

Web Pages

The x-scan operator, a new primitive for XML data integration
The MiniCon query reformulation algorithm
High-level system overview
Tukwila query execution engine/optimizer interface specifications (UPDATED for XML extensions)
Spring 1999 CSE 544, using Tukwila as the core of the class project

Presentations

Zack Ives' 1998 Qualifying Exam presentation on Adaptive Query Execution (PowerPoint slides)
1999 UW CS&E Industrial Affiliates presentation (PowerPoint slides)
1999 SIGMOD presentation

Publications

Zachary G. Ives, Alon Y. Halevy, Daniel S. Weld. Convergent Query Processing. Submitted for publication, 2001.
Zachary G. Ives, Alon Y. Halevy, Daniel S. Weld. An XML Query Engine for Network-Bound Data. Submitted for publication, 2001.
Rachel Pottinger, Alon Levy. A Scalable Algorithm for Answering Queries Using Views. VLDB 2000, Cairo, Egypt.
Zachary G. Ives, Alon Y. Levy, Daniel S. Weld, Daniela Florescu, Marc Friedman. Adaptive Query Processing for Internet Applications. IEEE Data Engineering Bulletin, Vol. 23 No. 2, June 2000.
Zachary G. Ives, Alon Y. Levy, Daniel S. Weld. Efficient Evaluation of Regular Path Expressions on Streaming XML Data. Technical Report UW-CSE-2000-05-02, University of Washington.
Zachary G. Ives, Daniela Florescu, Marc Friedman, Alon Levy, Daniel S. Weld. An Adaptive Query Execution System for Data Integration. ACM SIGMOD Conference on Management of Data, Philadelphia, PA, June 1-3, 1999.

This research has been funded in part by NSF grant 9978567.