Past Talks

Speaker: Brian Cooper (Yahoo! Research)

Title: PNUTS: Yahoo!'s Massive Scale Data Platform

Where: University of Washington, Seattle.
Computer Science and Engineering Department.
Paul Allen Center, CSE-605 (Database Lab) .

When: 2:00 PM (Monday, April 6th, 2009)

Abstract:

I'll describe PNUTS, a massively parallel and geographically distributed database system for Yahoo!'s web applications. When we set out to design PNUTS, our goal was to build a database system that could scale to thousands of servers, but still provide useful DBMS features like indexes, transactions, query optimization, views, and so on. Of course, to reach that scale you have to give up some of the richness of those features, and I'll talk about the tradeoffs that we have faced and the decisions we've made.

PNUTS provides data storage organized as hashed or ordered tables, low latency for large numbers of concurrent requests including updates and queries, and novel per-record consistency guarantees. It is a hosted, centrally managed, and geographically distributed service, and utilizes automated load-balancing and failover to reduce operational complexity. The first version of the system is currently serving in production. I'll describe the motivation for PNUTS and the design and implementation of its table storage and replication layers, and then present experimental results. I'll also discuss experiences building a real production system out of research ideas, and how trying to build a system that actually had to work in production changed our vision and research approach to the system.

Short Bio:

Brian Cooper is a research scientist at Yahoo! Research. Before that he was an assistant professor at Georgia Tech, and before that he was a PhD student at Stanford. His interests are in building distributed systems, and in particular, distributed systems that do database-style management and processing of data. At Yahoo! he works on building very large distributed data storage and processing systems. In previous lives he has worked on self-adaptive peer-to-peer systems, distributed streaming event processing, reliable distributed archival data storage, and XML indexing.


Speaker: Michael Isard (Microsoft Research)

Title: DryadLINQ: distributed data-parallel computing using a high-level language

Where: University of Washington, Seattle.
Computer Science and Engineering Department.
Paul Allen Center, CSE-605 (Database Lab) .

When: 12:30 PM (Thursday, February 5th, 2009)

Abstract :

DryadLINQ is a system and a set of language extensions that enable a new programming model for large scale distributed computing. It generalizes previous execution environments such as SQL, MapReduce, and Dryad in two ways: by adopting an expressive data model of strongly typed .NET objects; and by supporting general-purpose imperative and declarative operations on datasets within a traditional high-level programming language.

I will outline the design of DryadLINQ, including an introduction to the LINQ programming model, and discuss the tradeoffs in both programming models and implementation strategies that we made with DryadLINQ, compared with parallel SQL and MapReduce.

Short Bio:

Michael Isard received his DPhil, in computer vision, from Oxford University in 1998. In 1999 he started work at the Compaq Systems Research Center, and since 2002 has worked for Microsoft Research at their Silicon Valley Campus. He spent much of 2003 to 2005 working closely with the MSN Search product group on the design and implementation of their V1 search engine. His current research interests include large-scale distributed systems and programming models for parallel and distributed computing.


Speaker: Dan Olteanu (University of Oxford)

Title: Scalable Query Processing in Probabilistic Databases with SPROUT

Where: University of Washington, Seattle.
Computer Science and Engineering Department.
Paul Allen Center, CSE-605 (Database Lab) .

When: 11:00 AM (Tuesday, January 13th, 2009)

Abstract:

In this talk I'll address the problem of query evaluation on probabilistic databases and present an efficient query evaluation technique based on Ordered Binary Decision Diagrams (OBDDs). The connection between various classes of tractable queries and OBDDs that have polynomial sizes and can be found in polynomial time is central to this work and will form the main part of the talk. This technique is implemented in SPROUT, a new query engine that extends the PostgreSQL engine with secondary-storage query evaluation algorithms.Preliminary experiments with 1GB of TPC-H data suggest that SPROUT is up to two orders of magnitude faster than state-of-the-art query evaluation techniques.

Short Bio:

Dan Olteanu joined Comlab in Sept 2007 as a University Lecturer [roughly equivalent to tenure-track Assistant Professor in North America]. Dan holds a Dipl.Ing. in Computer Science from Polytechnic University of Bucharest (Sept 2000) and a PhD in Computer Science from Ludwig Maximilian University of Munich (Feb 2005). Before joining Comlab, he was a postdoctoral researcher at Saarland University in Saarbruecken (April 2005 - Aug 2007), a visiting scientist at Cornell University in Ithaca (Fall 2006), and a temporary professor at Ruprecht Karl University in Heidelberg (Summer term 2007).


Speaker: David DeWitt (Microsoft Jim Gray Systems Lab & University of Wisconsin)

Title: Clustera: A Data-Centric Approach to Scalable Cluster Management

Where: University of Washington, Seattle.
Computer Science and Engineering Department.
Paul Allen Center, CSE-601 (Gates Commons) .

When: 1:30 PM (Wednesday, November 19th, 2008)

Abstract:

Twenty-five years ago, when we built our first cluster management system using a collection of twenty VAX 11/750 computers, the idea of a compute cluster was an exotic concept. Today, clusters of 1,000 nodes are common and some of the biggest have in excess of 10,000 nodes. Such clusters are simply awash in data about machines, users, jobs, and files. Many of the tasks that such systems are asked to perform are very similar to database transactions. For example, the system must accept jobs from users and send them off to be executed. The system should not "drop" jobs or lose files due to hardware or software failures. The software must also allow users to stop failed computations or 'change their mind' and retract thousands of submitted but not yet completed jobs. Amazingly, no cluster management system that we are aware of uses a database system for managing its data.

In this talk I will describe Clustera, a new cluster management system we have been working for the last three years. As one would expect from some database types,Clustera uses a relational DBMS to store all its operational data including information about jobs, users, machines, and files (executable, input, and output). One unique aspect of the Clustera design is its use of an application server (JBoss currently) in front of the relational DBMS. Application servers have a number of appealing capabilities. First, they can handle 10s of 1000s of clients. Second, they provide fault tolerance and scalability by running on multiple server nodes. Third, they multiplex connections to the database system to a level that the database system can comfortably support. Compute nodes in a Clustera cluster appear as web clients to the application server and make SOAP calls to submit requests for jobs to execute and to update status information that is stored in the relational database.

Extensibility is a second key goal of the Clustera project. Traditional cluster management systems such as Condor were targeted toward long-running, computational intensive jobs. Newer systems such as Map-Reduce are targeted toward a specific type of data intensive parallel computation. Parallel SQL database systems represent a third type of cluster management system. The Clustera framework was designed to handle each of these classes of jobs in a common execution and data framework.

Short Bio:

David DeWitt is the former department chair and emeritus professor of computer science at the University of Wisconsin, Madison. Professor DeWitt recently joined Microsoft and started the Microsoft Jim Gray Systems Lab. He received a B.A. degree from Colgate University in 1970 and a Ph.D. degree from the University of Michigan in 1976. Professor DeWitt is a member of the National Academy of Engineering. He was named a Fellow of the ACM in 1995. He received the ACM SIGMOD Innovations Award for his contributions to the database systems field in 1995. He has published over 100 technical papers.


Speaker: Laura Chiticariu (IBM Almaden Research Center)

Title: Systems for Tracing the Provenance of Data

Where: University of Washington, Seattle.
Computer Science and Engineering Department.
Paul Allen Center, CSE-605.

When: 2:30 PM (Friday, November 14th, 2008)

Abstract:

Provenance of data describes the origins, as well as the journey of data, throughout its life cycle. The ability to trace data provenance is crucial in today's' information systems, where data is constantly created, copied, transformed and integrated. Provenance allows one to assess the quality and trustworthiness of the data, as well as understand and debug the transformations that data undergoes in such systems.

In this talk I will discuss two principled methods (and corresponding system implementations) for tracing the provenance of data in the context of two commonly used formalisms for specifying data transformations: SQL queries and respectively, schema mappings. Specifically, I will describe the DBNotes annotation management system which traces data provenance over SQL queries, and the SPIDER schema mappings debugging system which traces data provenance over schema mappings.

The type of provenance computed by DBNotes is known as where-provenance, whereas the type of provenance computed by SPIDER is an instance of how-provenance. Towards the end of the talk I will give a brief overview of three main notions of database provenance: why-, where- and how-provenance, which have been proposed and studied in recent years.

Short Bio:

Laura Chiticariu is a Research Staff Member in the Search and Analytics group at IBM Almaden Research Center. She received her PhD from UC Santa Cruz in September 2008.


Speaker: Kristen LeFevre (Univ. of Michigan)

Title: Privacy Protection in Data Publishing

Where: University of Washington, Seattle.
Computer Science and Engineering Department.
Paul Allen Center, CSE-605.

When: 2:30 PM (Friday, November 7th, 2008)

Abstract:

Numerous organizations collect, distribute, and publish personal data for purposes that include demographic and public health research. Protection of individual privacy is an important problem in this setting, and a variety of anonymization techniques have been developed that typically aim to satisfy certain privacy constraints (e.g., k-anonymity and l-diversity) with minimal impact on the quality of the resulting data.

This talk will describe several contributions to this field. In particular, I will describe a scalable workload-aware anonymization tool, which is able to incorporate a class of target workloads, consisting of data mining tasks and queries, when anonymizing an input dataset. I will also briefly describe some extended privacy definitions that allow for the more flexible incorporation of instance-level adversarial background knowledge.

Finally, looking forward, I will describe several emerging data-intensive applications to which conventional definitions of privacy do not easily apply.

Bio:

Kristen LeFevre is an Assistant Professor in EECS at the University of Michigan, where she is a member of the database group and the software research lab. She received her Ph.D. from the University of Wisconsin - Madison in 2007.


Speaker: Chris Jermaine (Univ. of Florida)

Title: MCDB: The Monte Carlo Database System

Where: University of Washington, Seattle.
Computer Science and Engineering Department.
Paul Allen Center, CSE-605.

When: 2:30 PM (Monday, November 3rd, 2008)

Abstract:

Analysts working with large data sets often use statistical models to "guess" at unknown, inaccurate, or missing information associated with the data stored in a database. For example, an analyst for a manufacturer may wish to know, "What would my profits have been if I'd increased my margins by 5% last year?" The answer to this question naturally depends upon the extent to which the higher prices would have affected each customer's demand, which is undoubtedly guessed via the application of some statistical model.

In this talk, I'll describe MCDB, which is a prototype database system that is designed for just such a scenario. MCDB allows an analyst to attach arbitrary stochastic models to the database data in order to "guess" the values for unknown or inaccurate data, such as each customer's unseen demand function. These stochastic models are used to produce multiple possible database instances in Monte Carlo fashion (a.k.a. "possible worlds"), and the underlying database query is run over each instance. In this way, fine-grained stochastic models become first-class citizens within the database. This is in contrast to the "classical" paradigm, where high-level summary data are first extracted from the database, then taken as input into a separate statistical model which is then used for subsequent analysis.

Bio:

Chris Jermaine is an associate professor in the CISE Department at the University of Florida, where he studies databases and data management with a group of exceptional students. He is the recipient of a 2008 Alfred P. Sloan Foundation Research Fellowship, a National Science Foundation CAREER award, and a 2007 ACM SIGMOD Best Paper Award. He received a BA from the Mathematics Department at UCSD, an MSc from the Computer Science and Engineering Department at OSU (his advisor at OSU was Renee Miller, who is now at Toronto), and a PhD from the College of Computing at Georgia Tech (his advisor at Georgia Tech was Ed Omiecinski). Chris grew up in Southern California. In his spare time, he enjoys running, gardening, and outdoor activities such as hiking, climbing, and whitewater boating. In one particular exploit, he and his wife floated a whitwater raft (home-made from scratch using a sewing machine, glue, and plastic) over 100 miles down the Nizina River (and beyond) in Alaska.


Speaker: Jingren Zhou

Title: SCOPE: Easy and Efficient Parallel Processing of Massive Data Sets

Where: University of Washington, Seattle.
Computer Science and Engineering Department.
Paul Allen Center, CSE-605.

When: 2:30 PM (Wednesday, September 24th, 2008)

Abstract:

Companies providing cloud-scale services have an increasing need to store and analyze massive data sets such as search logs and click streams. For cost and performance reasons, processing is typically done on large clusters of shared-nothing commodity machines. It is imperative to develop a programming model that hides the complexity of the underlying system but provides flexibility by allowing users to extend functionality to meet a variety of requirements.

In this talk, we present a new declarative and extensible scripting language, SCOPE (Structured Computations Optimized for Parallel Execution), targeted for this type of massive data analysis. The language is designed for ease of use with no explicit parallelism, while being amenable to efficient parallel execution on large clusters. SCOPE borrows several features from SQL. Data is modeled as sets of rows composed of typed columns. The select statement is retained with inner joins, outer joins, and aggregation allowed. Users can easily define their own functions and implement their own versions of operators: extractors (parsing and constructing rows from a file), processors (row-wise processing), reducers (group-wise processing), and combiners (combining rows from two inputs). SCOPE supports nesting of expressions but also allows a computation to be specified as a series of steps, in a manner often preferred by programmers. We also describe how scripts are compiled into efficient, parallel execution plans and executed on large clusters.


Speaker: Uwe Roehm From University of Sydney

Title: Data Management for High-Throughput Gene Sequencing

Where: University of Washington, Seattle.
Computer Science and Engineering Department.
Paul Allen Center, CSE-605.

When: 11:00 AM (Thursday, July 10th, 2008)

Abstract:

With today's sequencing technology, it is possible to sequence the genome of an individual person within a few weeks for a fraction of the costs of the original Human Genome project. At the same time, labs with several of these NextGen sequencers are faced with TB of data per week that have to be automatically processed and made available to scientists for further analysis. But despite the scale of the problem and the foreseeable wide-spread deployment of these instruments, the lab's data processing capabilities are amazingly underdeveloped: All data processing is still file-based and main memory oriented, tools have to deal with a zoo of (proprietary) text formats, conceptual data models or clear abstraction layers are missing, the quality and provenance of data is captured only in an ad-hoc fashion, etc. In a joint project with Microsoft, we were studying two specific sequencing scenarios - digital gene expression studies and the 1000 genome project - to explore the potential and the limitations of using today's relational database systems as the data processing platform. In particular, we were interested in the storage management for high-throughput sequence data and in leveraging SQL and the stored procedures to implement the data analysis tasks inside the database close to the data. This talk describes how we used several SQL Server features such as filestreams and CLR-based functions & aggregates in unconventional ways to prototype the mentioned scenarios with SQL Server 2008, and gives an overview of our findings about the scalability and performance of these more database-centric approaches.


Speaker: Christoph Koch From Cornell

Title: MayBMS: A Probabilistic Database Management System

Where: University of Washington, Seattle.
Computer Science and Engineering Department.
Paul Allen Center, CSE-605.

When: 11:00 AM (Thursday, July 3rd, 2008)

Abstract:

Databases that contain uncertain data arise naturally in many data management scenarios, such as Web information extraction, data cleaning, data integration, sensor data management, and scientific databases. There are currently no scalable systems for managing and querying such databases. In this talk I present MayBMS, a database management system for efficiently managing and processing large collections of uncertain data that is currently under development at Cornell. MayBMS is based on a clean yet expressive query language that captures many important use cases of probabilistic databases, including what-if queries and the conditioning of databases using new evidence. MayBMS employs a carefully designed succinct representation system for probabilistic databases called U-relations, which nicely unifies various approaches to representing uncertain data, such as c-tables, relational decomposition, and probabilistic graphical models. U-relations allow for the natural reuse of mature relational storage, indexing and query processing techniques to build scalable probabilistic database systems. In addition to the exact processing of probabilistic database queries on U-relations, I discuss the efficient approximation of expressive, compositional queries on probabilistic databases.

Bio:

Christoph Koch is an associate professor of computer science at Cornell University. He is interested in both the theoretical and systems-oriented aspects of data management, and currently works on managing uncertain data, community data management systems, data-driven games, data integration, and Web information extraction and management.


Speaker: Sam Madden From MIT

Title: Column-Oriented Databases: Where's the Beef?

Where: University of Washington, Seattle.
Computer Science and Engineering Department.
Paul Allen Center, CSE-605.

When: 11:30 AM (Friday, June 13th, 2008)

Abstract:

Vertical partitioning is a well-established technique for improving query processing performance in relational database systems. Surprisingly, the database community has recently unleashed a flurry of research projects (C-Store, MonetDB) and startup companies (Vertica, InfoBright, ParAccel) proposing "column-oriented databases", which appear to be nothing more than a conventional database with a fully vertically partitioned storage system. In this talk, I will describe how our work on the C-Store system goes beyond simple vertical partitioning. I will begin with an overview of column-oriented technology and its applications and then focus on the unusual aspects of the design of the storage system and query executor. I will also describe a series of experiments that show why vertical partitioning in a conventional database does not perform as well as a system designed from the ground up to support columns, showing that our academic prototype can achieve order-of-magnitude performance improvements over a commercial database on a recently proposed data warehousing benchmark.

Bio:

Samuel Madden is an Associate Professor in the EECS Department and CSAIL at MIT. He is a specialist in networked data management and database systems. As the author of the TinyDB system for sensor network data collection, the co-creator of the CarTel mobile sensor network system for automobiles, one of the architects of the C-Store database system, and a co-founder of Vertica Systems, a database startup commercializing column-stores. He has published articles in top computer science conferences, including SIGMOD, SenSys, and OSDI on data acquisition and processing, database optimization, query planning, and distributed databases. Madden received the NSF CAREER Award in 2004, the Sloan Fellowship in 2006, was named on of Technology Review's Top 35 Under 35 in 2006 for his work in data management in sensor networks, and won best paper awards in VLDB 2004 and 2007 and MobiCom 2006.


Speaker: Chris Olston From Yahoo! Research

Title: Processing Web-Scale Data with Pig

Where: University of Washington, Seattle.
Computer Science and Engineering Department.
Paul Allen Center, CSE-605.

When: 10:30 AM - 11:30 AM (Tuesday, February 19th)

Abstract:

There is a growing need for ad-hoc analysis of extremely large data sets, especially at internet companies where innovation critically depends on being able to analyze terabytes of logs collected every day. Parallel database products, e.g., Teradata, offer a solution, but are usually prohibitively expensive at this scale. Besides, many of the people who analyze this data are entrenched procedural programmers, who find the declarative, SQL style to be unnatural. The success of the more procedural map-reduce programming model, and its associated scalable implementations on commodity hardware, is evidence of the above. However, the map-reduce paradigm is too low-level and rigid, and leads to a great deal of custom user code that is hard to maintain, and reuse. In this talk I will describe a new language called Pig Latin that we have designed to fit in a sweet spot between the declarative style of SQL, and the low-level, procedural style of map-reduce. The accompanying system, Pig, is fully implemented, and compiles Pig Latin into physical plans that are executed over Hadoop, an open-source, map-reduce implementation. Pig is an open-source, Apache-incubator project, and available for general use. The talk will also cover some of the other topics we are addressing in the Pig project, including: (1) data sampling and synthesis techniques to assist in query debugging, (2) how to schedule queries that can share work, (3) adaptive approaches to physical database design, and (4) adaptive data placement techniques.

Bio:

Christopher Olston is a senior research scientist at Yahoo! Research, after a stint as assistant professor at Carnegie Mellon University from 2003 to 2005. His research interests include data management and web search. Olston received his Ph.D. in 2003 from Stanford University, where he was supported by fellowship awards from the National Science Foundation and the Stanford Graduate Fellowship program. Prior to attending graduate school, he received the 1998 Computing Research Association Award for Outstanding Undergraduates for his work at UC Berkeley. Olston is an avid Cal fan but likes to rollerblade at Stanford.


Distinguished Lecturer Series

Speaker: Joe Hellerstein From UC Berkeley

Title: Declarative Networking: "What" is Next.

Where: University of Washington, Seattle.
Computer Science and Engineering Department.
Paul Allen Center, EEB-105.

When: 3:30 PM - 4:30 PM (Thursday, February 7)

Abstract:

Declarative languages allow programmers to say *what* they want, without worrying over the details of *how* to achieve it. These kinds of languages revolutionized data management decades ago (SQL, spreadsheets), but have had limited success in other aspects of computing. The story seems to be changing in recent years, however. One new chapter is work that my colleagues and I have been pursuing on the design and implementation of declarative languages and runtime systems for network protocol specification. Distributed Systems and Networking appear to be surprisingly natural domains for declarative specifications, and -- given recent interest in revisiting Internet Architecture from scratch -- these domains are ripe for a new programming methodology. The results of our first phase of research have been exciting: we have implemented complex networking infrastructure in 100x less code than traditional implementations, and our programs often match very closely (sometimes line-for-line) with psuedocode published by protocol inventors. As the work on core declarative networking has matured, a number of groups have begun pursuing related applications for declarative languages, including our own emerging work on hybrid protocol synthesis, distributed Machine Learning, and language metacompilation, as well as initial work by others on replication systems, modular robotics, security, distributed debugging, and consensus protocols. This talk will introduce the concepts of Declarative Networking, the state of the research agenda today, and new directions being pursued.

Bio:

Joseph M. Hellerstein is a Professor of Computer Science at the University of California, Berkeley, whose research focuses on data management and networking. His work has been recognized via awards including an Alfred P. Sloan Research Fellowship, MIT Technology Review's inaugural TR100 list, and two ACM-SIGMOD "Test of Time" awards. Key ideas from his research have been incorporated into commercial and open-source database software released by IBM, Oracle, and PostgreSQL. He has also held industrial posts including Director of Intel Research Berkeley, and Chief Scientist of Cohera Corporation


Speaker: Tova Milo from Tel Aviv University.

When: 11:00 AM - 12:00 PM (Friday, July 13th)

Where: University of Washington, Seattle.
Computer Science and Engineering Department.
Paul Allen Center, room CSE - 605.
Campus map pointing to our building.

Title: Querying and Monitoring Business Processes

Abstract:

We present in this talk BPQL, a novel query language for querying and monitoring business processes. The BPQL language is based on an intuitive model of business processes, an abstraction of the emerging BPEL (Business Process Execution Language) standard. It allows users to query business processes specifications, as well as their run time behavior, visually, in a manner very analogous to how such processes are typically specified, and can be employed in a distributed setting, where process components may be provided by distinct providers(peers).

We describe here the query language as well as its underlying formal model. We consider the properties of the various language components and explain how they influenced the language design. In particular we distinguish features that can be efficiently supported, and those that incur a prohibitively high cost, or cannot be computed at all. We also present our implementation which complies with real life standards for business process specifications, XML, and Web services, and is used in the BPQL system.


Speaker: Anhai Doan from University of Wisconsin-Madison.

When: 3:00 - 4:00 PM (Friday, June 1st)

Where: University of Washington, Seattle.
Computer Science and Engineering Department.
Paul Allen Center, room CSE - 605.
Campus map pointing to our building.

Title: The Cimple Project on Community Information Management

Abstract:

In this talk I will give an overview of Cimple, a joint project between the University of Wisconsin-Madison and Yahoo! Research. Cimple develops a generic solution that crawls, extracts, and integrates data, to build structured "portals" for online communities. I will first describe the envisioned working of Cimple and our prototype, DBlife, which is a structured portal being developed for the database research community. Next, I describe the technical challenges underlying Cimple and our solution approaches. Finally, I discuss the connections between Cimple and research in data integration, information extraction, human computation, and Web data management. More information about Cimple can be found at http://www.cs.wisc.edu/~anhai/projects/cimple


Speaker: Deepak Patil from Microsoft.

When: 4:30 - 5:30PM (Monday, June 4th)

Where: University of Washington, Seattle.
Computer Science and Engineering Department.
Paul Allen Center, room CSE - 605.
Campus map pointing to our building.

Title: Scale and Manageability Challenges for high OLTP and VLDB Web Services

Abstract:

Microsoft’s Window’s Live and MSN Web services are growing at rapid pace. As the infrastructure that delivers these services to hundreds of millions of users in over 200 countries - the scale, quality, manageability, security and performance challenges are daunting, yet exciting. Join the talk to listen to a technical perspective on how Microsoft is taking these challenges on and maintaining its winning position. You will hear details on the scale of these services and technology and process challenges wrt delivering ‘able’ services from an operational point. As the company strives to maintain 99.99% availability and high transaction performance for its key web services like Windows Live Messenger, Hotmail, Search, Spaces etc. its Global Foundation Services division is leaving no stone unturned with regards to manageability investments, operational intelligence through data mining and service security. As the division manages peta-bytes of storage – both structured and un-structured – its devising newer ways to deal with the challenges of data management, manipulation, abstraction, mining, transfer and security. Hear the details of all this and more..


Speaker: David Maier from Portland State University.

When: 10:30 - 11:30AM (Friday, April 27th)

Where: University of Washington, Seattle.
Computer Science and Engineering Department.
Paul Allen Center, room CSE - 605.
Campus map pointing to our building.

Title: My Database Does Grids: Generating Data Products in the GridField Model

Abstract:

Scientists’ ability to generate and store simulation results is outpacing their ability to analyze them via ad hoc programs. We believe such programs have an algebraic structure that can be exploited to improve reasoning and performance. In this talk, we present the GridField model that exposes this structure and can be used to express, optimize, and reason about data transformations over gridded datasets. Grid structures are first-class citizens in the model, and operators can manipulate data on both structured and unstructured grids. As simulation results are primarily write-once, our implementation stores data in a column-oriented format that saves space and enables efficient algorithms. We advocate a light-weight design: Our services access the same native data representations as the scientists use themselves and can therefore coexist with legacy applications. Our evaluation of applicability and performance involves datasets from oceanography, seismology, and medicine.

In this talk I will discuss the requirements for representing gridded datasets, present the GridField model for organizing and manipulating such data, illustrate the definition and optimization of data products in the model, and briefly report on experimental evaluation


Speaker: Patrick Valduriez from INRIA-Rennes.

When: 2.00 - 3.00pm (Wednesday, April 18th)

Where: University of Washington, Seattle.
Computer Science and Engineering Department.
Paul Allen Center, room CSE - 605.
Campus map pointing to our building.

Title: Data Currency in Replicated DHTs (Joint work with Reza Akbarinia and Esther Pacitti, to appear at SIGMOD 2007.)

Abstract:

Distributed Hash Tables (DHTs) provide a scalable solution for data sharing in P2P systems. To ensure high data availability, DHTs typically rely on data replication, yet without data currency guarantees. Supporting data currency in replicated DHTs is difficult as it requires the ability to return a current replica despite peers leaving the network or concurrent updates. In this paper, we give a complete solution to this problem. We propose an Update Management Service (UMS) to deal with data availability and efficient retrieval of current replicas based on timestamping. For generating timestamps, we propose a Key-based Timestamping Service (KTS) which performs distributed timestamp generation using local counters. Through probabilistic analysis, we compute the expected number of replicas which UMS must retrieve for finding a current replica. Except for the cases where the availability of current replicas is very low, the expected number of retrieved replicas is typically small, /e.g./ if at least 35% of current replicas are available then the expected number of retrieved replicas is less than 3. We validated our solution through implementation and experimentation over a 64-node cluster and evaluated its scalability through simulation up to 10,000 peers using SimJava. The results show the effectiveness of our solution. They also show that our algorithm used in UMS achieves major performance gains, in terms of response time and communication cost, compared with a baseline algorithm.


Speaker: Daniel Abadi from MIT.

When: Friday, January 5th at 2pm.

Where: University of Washington, Seattle.
Computer Science and Engineering Department.
Paul Allen Center, room CSE - 605.
Campus map pointing to our building.

Title: Query Execution in Column-Oriented Database Systems

Abstract:

Recent research on column-oriented database systems (DBMSs) has shown that these systems can outperform existing row-oriented DBMSs by one to two orders of magnitude on read-mostly query workloads like those found in data warehouses, decision support, and customer relationship management systems. In this talk, I will discuss this exciting new class of database systems and will provide an overview of the C-Store system that we have developed over the past two years at MIT. I will then focus on the design of the column-oriented query execution engine I have developed. In particular, I will discuss the impact on query performance of tuple construction (stitching together attributes from multiple columns into a row-oriented "tuple") and operation on compressed data. Tuple construction allows column-oriented DBMSs to offer a standards-compliant relational database interface (e.g., ODBC, JDBC, etc); however, if done at the wrong point in a query plan, a significant performance penalty is paid.

Similarly, data compression can improve query performance by an order of magnitude by trading cheap CPU cycles for expensive I/O bandwidth.


Speaker: Mirek Riedewald from Cornell.

When: Friday, December 1st at 2pm.

Where: University of Washington, Seattle.
Computer Science and Engineering Department.
Paul Allen Center, room CSE - 605.
Campus map pointing to our building.

Title: Indexing for Function Approximation

Abstract:

The availability of information technology is fundamentally changing the face of modern science. As the Report of the NSF Blue-Ribbon Advisory Panel on Cyberinfrastructure states, "Scientists in many disciplines have begun revolutionizing their fields by using computers, digital data, and networks to replace and extend their traditional efforts. The calculations that can be performed and the information that can be archived and used are exploding." Efficient management of massive amounts of data is crucial for Cyberinfrastructure, also known as eScience, to succeed.

The research presented in this talk was motivated by scientific simulations. Simulation is one of the most powerful tools for studying and understanding real-world physical phenomena, but realistic mathematical models are often very complex and run for a large number of steps. It is infeasible to evaluate these models exactly at each step, and thus scientists trade accuracy for reduced simulation cost. We model high-dimensional function approximation (HFA) as a storage and retrieval problem, and we show that HFA defines a new class of applications for high-dimensional index structures. HFA imposes a mixed query-update workload on the index which leads to novel tradeoffs between efficiency of search versus updates. We present hardness results and we investigate in detail one specific approach to HFA based on Taylor Series expansions, analyzing the index design tradeoffs through a thorough experimental study.


Speaker: David Andersen from CMU.

When: Friday, November 3rd at 2pm.

Where: University of Washington, Seattle.
Computer Science and Engineering Department.
Paul Allen Center, room CSE - 403.
Campus map pointing to our building.

Title: Easing the Pain of Network Data Analysis: Why network researchers need database help.

Abstract:

The "science" of networking research remains disconcertingly ad-hoc, continually reinventing the same analysis tools and different storage formats. The result of this state of being is considerable wasted effort, the inability to easily replicate prior analyses, and gratuitous format differences that make it difficult to compare data-sets.

With a view to lowering the barriers to exploratory network measurement, repeatable experimentation, and data sharing, I will present several challenges in collecting and archiving large volumes of network measurement data. This talk deliberately raises more questions than it answers, in the hope that the database community can find (and solve!) many of these interesting problems. What are the best architectures for managing and analyzing these volumes of data? Can conventional databases and query languages be adapted to deal with the common queries encountered in network data mining?

I will then present our early work on crafting a large-scale Internet data storage and analysis facility -- the Datapository -- designed to create a framework for collaboratively addressing these challenges. While we have not yet tackled many of the database problems listed, our experience so far with the datapository as an analysis tool has been very encouraging. In many cases, we can reduce the major analysis components of contemporary Internet measurement research to one or a few SQL statements and a small amount of glue code to compose the analysis.


Speaker: Dennis Lee from Amazon.

When: Friday, April 14th at 11:30am-12:30pm.

Where: University of Washington, Seattle.
Computer Science and Engineering Department.
Paul Allen Center, room CSE - 605.
Campus map pointing to our building.

Title: Operating and Scaling a Website to Millions of Users.

Abstract:

I relate my experience in running multiple websites all of which has to be up 24x7x365 while serving millions of hits an hour. Many things that are taken for granted in small sites start becoming huge issues when scaled up to this magnitude: deployment, configuration management, machine setup, upgrades, random byzantine failures, resource management, persistence and consistency. I start by sketching the multi-tiered service architecture of Amazon's website .Then I will go through some of the lessons I found in running the website --- both things that seem easy but are not at this scale, and things that seem hard but lend themselves to very different solutions due to the nature of the application that I was managing.


Our first guest speaker: Shankar Pal from Microsoft Research.

When: Friday, February 17th at 1:30pm-3pm.

Title: XML Processing in SQL Server 2005.

Abstract:

SQL Server 2005 had introduced support for XML as a rich data type within its relational infrastructure. XML instances are stored as byte sequences to support the XML data model faithfully. This raises new challenges for storage, query processing, indexing, and schema management.

This talk discusses the many innovations that have gone into the XML processor. A node labeling scheme called OrdPath captures document order and hierarchical relationship in a compressed binary representation while supporting insertion of new nodes at arbitrary positions in the XML tree. The query language supported is XQuery, a candidate recommendation from W3C, using the relational infrastructure with a handful of new operators. Several optimizations have been introduced for high performance. XML instances can be indexed in an edge-table like format. Additionally, packaged indexes are available for optimizing different classes of XQuery workloads. XML schema evolution is supported in a novel way without requiring upgrade of existing data and disrupting existing applications.