The Tukwila Data Integration System
  
    - A data integration system is an automated method for querying across 
    multiple heterogeneous databases in a uniform way.  In essence, a 
    mediated schema is created to represent a particular application domain 
    and data sources are mapped as views over the mediated schema.  The user 
    asks a query over the mediated schema and the data integration system 
    reformulates this into a query over the data sources and executes it.  The 
    system then intelligently processes the query, reading data across the 
    network and responding to data source sizes, network conditions, and other 
    factors. 
 
  
   
  
    - Query Processing Challenges 
 Query processing in data integration occurs over network-bound, autonomous 
    data sources ranging from conventional databases on the LAN or intranet to 
    web-based sources across the Internet.  The Tukwila data integration system 
    is designed to scale up to the amounts of data transmissible across 
    intranets and the Internet (tens to hundreds of MBs), with large numbers of 
    data sources.  This requires extensions to traditional optimization and 
    execution techniques for three reasons: there is an absence of quality 
    statistics about the data, data transfer rates are unpredictable and bursty, 
    and slow or unavailable data sources can often be replaced by overlapping or 
    mirrored sources.  Additionally, the natural data format for data 
    integration is XML -- which creates new challenges in query processing and 
    optimization.  (See the discussion of the
    x-scan operator for more information.)
    - The Tukwila Solution 
  
 
 
 The Tukwila data integration system can make use of a mediated schema by 
    taking the original query and applying a highly efficient query 
    reformulation algorithm, MiniCon, which maps the input query from the mediated schema to the data 
    sources.  Next, the query processing system takes a three-pronged approach 
    to providing efficient query answering:  (1) use of operators with 
    flexible scheduling policies, to mask latencies, (2) overlapping of 
    query operations using pipelining, even for XML data, with the 
    x-scan operator, and (3) convergent query processing, which 
    allows the system to choose a query plan, execute it for some amount of 
    time, then revise statistics and cost estimates and generate an improved 
    plan -- all in mid-stream.  Experiments have shown that the Tukwila approach 
    results in superior performance versus traditional techniques.
    - Our architecture extends previous innovations in adaptive execution 
    (such as query scrambling, mid-execution re-optimization, eddies, and choose 
    nodes), and we have experimentally demonstrated the benefits of our system 
    (see the papers below for more information). 
Recent Work
  
    The recent focus of the Tukwila project has been in three areas:
    
      - Fully integrated support for efficient processing of XML data, based 
      on the x-scan operator.  XML provides a common encoding for data from many 
      different sources; combined with standardization of schemas (DTDs) across 
      certain domains, it greatly reduces the needs for wrappers and even query 
      reformulation.  The latest versions of Tukwila are built around an 
      adaptive query processing architecture for XML, and can seamlessly combine 
      XML and relational data into new XML content. 
- Support for output of incremental results in a meaningful fashion.  
      For web-based applications, we would like to see output incrementally, and 
      as early as possible.  (This is in contrast to the typical batch-oriented 
      paradigm, in which overall query completion time is key.) 
- Providing "preview" answers to queries.  In many cases, it may be 
      useful to quickly return some sort of approximate or partial answers 
      during a long-running query, so the user gets early feedback.  We have 
      developed a general framework for expressing the semantics of preview 
      results, as well as optimization and execution strategies for "preview 
      queries."  
- Development of techniques and algorithms for "convergent query 
      processing."  Our convergent query processing framework allows the system 
      to continuously re-optimize a query at any point as it gets improved 
      information, for fairly complex classes of queries (including grouping, 
      sorting, etc.). 
  Project Members
  
    - Professor
    
    Alon Halevy 
- Professor
    
    Dan Weld 
- 
    
    Zack Ives is responsible for the Tukwila execution engine and its 
    adaptive operation, as well as the optimizer for previewing query results.  
    His work largely relates to adaptive query processing, processing of XML 
    data, XML and zero-knowledge query optimization, and XML query languages.
    
Project Alumni
  
    - Hakim Weatherspoon worked on a web wrapper toolkit.  Hakim is now a 
    graduate student at
    
    U.C. Berkeley. 
- 
    
    Marc Friedman constructed the original optimizer for the Tukwila 
    system.  He recently graduated and joined the
    
    Viathan Corporation. 
- 
    
    Daniela Florescu, INRIA (France) advised during the original Tukwila 
    design stages.  She is now at
    
    Propel Software.. 
- 
    
    Rachel Pottinger has developed the Tukwila query reformulator, which 
    uses a new query reformulation algorithm, MiniCon, that is significantly more efficient than previous approaches.  
    Rachel's current work involves the construction of systems for managing 
    models (e.g. schemas). 
Web Pages
  
    - The x-scan operator, a new primitive for XML data integration 
- The MiniCon query reformulation algorithm 
- High-level system overview 
- Tukwila query execution engine/optimizer
    interface specifications (UPDATED for XML extensions) 
- Spring 1999
    
    CSE 544, using Tukwila as the core of the class project 
Presentations
  
  Publications
  
    - Zachary G. Ives, Alon Y. Halevy, Daniel S. Weld.  Convergent Query 
    Processing.  Submitted for publication, 2001. 
- Zachary G. Ives, Alon Y. Halevy, Daniel S. Weld.  An XML Query 
    Engine for Network-Bound Data.  Submitted for publication, 2001.
    
- Rachel Pottinger, Alon Levy. 
    
    A Scalable Algorithm for Answering Queries Using Views.  VLDB 
    2000, Cairo, Egypt. 
- Zachary G. Ives, Alon Y. Levy, Daniel S. Weld, Daniela Florescu, Marc 
    Friedman.  
    
    Adaptive Query Processing for Internet Applications.  IEEE Data 
    Engineering Bulletin, Vol. 23 No. 2, June 2000. 
- Zachary G. Ives, Alon Y. Levy, Daniel S. Weld.  
    
    Efficient Evaluation of Regular Path Expressions on Streaming XML Data. 
    Technical Report UW-CSE-2000-05-02, University of Washington. 
- Zachary G. Ives, Daniela Florescu, Marc Friedman, Alon Levy, Daniel S. 
    Weld. 
    
    An Adaptive Query Execution System for Data Integration.  ACM 
    SIGMOD Conference on Management of Data, Philadelphia, PA, June 1-3, 1999.
    
This research has been funded in part by NSF grant 9978567.