The Tukwila Data Integration System
- A data integration system is an automated method for querying across
multiple heterogeneous databases in a uniform way. In essence, a
mediated schema is created to represent a particular application domain
and data sources are mapped as views over the mediated schema. The user
asks a query over the mediated schema and the data integration system
reformulates this into a query over the data sources and executes it. The
system then intelligently processes the query, reading data across the
network and responding to data source sizes, network conditions, and other
- Query Processing Challenges
Query processing in data integration occurs over network-bound, autonomous
data sources ranging from conventional databases on the LAN or intranet to
web-based sources across the Internet. The Tukwila data integration system
is designed to scale up to the amounts of data transmissible across
intranets and the Internet (tens to hundreds of MBs), with large numbers of
data sources. This requires extensions to traditional optimization and
execution techniques for three reasons: there is an absence of quality
statistics about the data, data transfer rates are unpredictable and bursty,
and slow or unavailable data sources can often be replaced by overlapping or
mirrored sources. Additionally, the natural data format for data
integration is XML -- which creates new challenges in query processing and
optimization. (See the discussion of the
x-scan operator for more information.)
- The Tukwila Solution
The Tukwila data integration system can make use of a mediated schema by
taking the original query and applying a highly efficient query
reformulation algorithm, MiniCon, which maps the input query from the mediated schema to the data
sources. Next, the query processing system takes a three-pronged approach
to providing efficient query answering: (1) use of operators with
flexible scheduling policies, to mask latencies, (2) overlapping of
query operations using pipelining, even for XML data, with the
x-scan operator, and (3) convergent query processing, which
allows the system to choose a query plan, execute it for some amount of
time, then revise statistics and cost estimates and generate an improved
plan -- all in mid-stream. Experiments have shown that the Tukwila approach
results in superior performance versus traditional techniques.
- Our architecture extends previous innovations in adaptive execution
(such as query scrambling, mid-execution re-optimization, eddies, and choose
nodes), and we have experimentally demonstrated the benefits of our system
(see the papers below for more information).
The recent focus of the Tukwila project has been in three areas:
- Fully integrated support for efficient processing of XML data, based
on the x-scan operator. XML provides a common encoding for data from many
different sources; combined with standardization of schemas (DTDs) across
certain domains, it greatly reduces the needs for wrappers and even query
reformulation. The latest versions of Tukwila are built around an
adaptive query processing architecture for XML, and can seamlessly combine
XML and relational data into new XML content.
- Support for output of incremental results in a meaningful fashion.
For web-based applications, we would like to see output incrementally, and
as early as possible. (This is in contrast to the typical batch-oriented
paradigm, in which overall query completion time is key.)
- Providing "preview" answers to queries. In many cases, it may be
useful to quickly return some sort of approximate or partial answers
during a long-running query, so the user gets early feedback. We have
developed a general framework for expressing the semantics of preview
results, as well as optimization and execution strategies for "preview
- Development of techniques and algorithms for "convergent query
processing." Our convergent query processing framework allows the system
to continuously re-optimize a query at any point as it gets improved
information, for fairly complex classes of queries (including grouping,
Zack Ives is responsible for the Tukwila execution engine and its
adaptive operation, as well as the optimizer for previewing query results.
His work largely relates to adaptive query processing, processing of XML
data, XML and zero-knowledge query optimization, and XML query languages.
- Hakim Weatherspoon worked on a web wrapper toolkit. Hakim is now a
graduate student at
Marc Friedman constructed the original optimizer for the Tukwila
system. He recently graduated and joined the
Daniela Florescu, INRIA (France) advised during the original Tukwila
design stages. She is now at
Rachel Pottinger has developed the Tukwila query reformulator, which
uses a new query reformulation algorithm, MiniCon, that is significantly more efficient than previous approaches.
Rachel's current work involves the construction of systems for managing
models (e.g. schemas).
- The x-scan operator, a new primitive for XML data integration
- The MiniCon query reformulation algorithm
- High-level system overview
- Tukwila query execution engine/optimizer
interface specifications (UPDATED for XML extensions)
- Spring 1999
CSE 544, using Tukwila as the core of the class project
- Zachary G. Ives, Alon Y. Halevy, Daniel S. Weld. Convergent Query
Processing. Submitted for publication, 2001.
- Zachary G. Ives, Alon Y. Halevy, Daniel S. Weld. An XML Query
Engine for Network-Bound Data. Submitted for publication, 2001.
- Rachel Pottinger, Alon Levy.
A Scalable Algorithm for Answering Queries Using Views. VLDB
2000, Cairo, Egypt.
- Zachary G. Ives, Alon Y. Levy, Daniel S. Weld, Daniela Florescu, Marc
Adaptive Query Processing for Internet Applications. IEEE Data
Engineering Bulletin, Vol. 23 No. 2, June 2000.
- Zachary G. Ives, Alon Y. Levy, Daniel S. Weld.
Efficient Evaluation of Regular Path Expressions on Streaming XML Data.
Technical Report UW-CSE-2000-05-02, University of Washington.
- Zachary G. Ives, Daniela Florescu, Marc Friedman, Alon Levy, Daniel S.
An Adaptive Query Execution System for Data Integration. ACM
SIGMOD Conference on Management of Data, Philadelphia, PA, June 1-3, 1999.
This research has been funded in part by NSF grant 9978567.