The scientific data management landscape is changing. Improvements in instrumentation and simulation software are giving scientists access to data at an unprecedented scale. This data is increasingly being stored in data centers running thousands of commodity servers. This new environment creates significant data management challenges. In addition to efficient query processing, the magnitude of data and queries call for new query management techniques such as runtime query control and intra-query fault tolerance.

In this project, we are developing new data management systems and techniques for enabling scientists to store, analyze, and share large volumes of data using cloud-computing environments.

 

People

Faculty

Graduate Students

Undergraduate Students

Alumni/ae

 

Projects and Publications

 

Astronomy Simulation Use-Case and Benchmark

As one concrete application scenario, we explore the emergent data management needs of the University of Washington’s “N-body Shop” group, which specializes in the development and utilization of large-scale simulations (specifically, “N-body tree codes”). These simulations serve to investigate the formation and evolution of large scale structure in the universe. The UW N-body Shop is representative of the current state-of-the-art in astrophysical cosmological simulation. In 2008, the N-body Shop was the 10th largest consumer of NSF Teragrid time, using 7:5 million CPU hours and generated 50 terabytes of raw data, with an additional 25 terabytes of post-processing information. The total size of each simulation ranges in size from 55 GB to a few TB. We find that the required analysis involves three types of tasks: filtering and correlating the data at different times in the simulation, clustering data within one simulated timestep, and querying the clustered data. We are in the process of developing a benchmark comprising all three types of tasks. The publication below presents the first part of the benchmark along with some preliminary results from running this benchmakrk on Pig/Hadoop and a commercial relational database management system. The following sub-project discusses clustering. We did not yet address the challenge of querying clustered data.

Analyzing Massive Astrophysical Datasets: Can Pig/Hadoop or a Relational DBMS Help?
Sarah Loebman, Dylan Nunley, YongChul Kwon, Bill Howe, Magdalena Balazinska, and Jeffrey P. Gardner
To appear in IASDS 2009

The queries and data sets used in the paper are available here

 

Efficient Clustering Algorithms for Shared-Nothing Systems

In this project, we focus on one common yet challenging data analysis problem from the astronomy simulation domain: massive-scale data clustering. We study the performance and scalability of a clustering algorithm called Friends-of-Friends. This algorithm is designed to cluster points in a multi-dimensional space and is commonly used on simulation data to study galaxy formation and evolution.

Scalable clustering algorithm for N-body simulations in a shared-nothing cluster
YongChul Kwon, Dylan Nunley, Jeffrey P. Gardner, Magdalena Balazinska, Bill Howe, and Sarah Loebman
University of Washington Technical Report. UW-CSE-09-06-01. June 2009

The queries and data sets used in the paper are available here

 

The Parallax Time-based Progress Indicator for Parallel Queries

In parallel query-processing environments, accurate, time-oriented progress indicators could provide much utility to users given that queries take a very long time to complete and both inter- and intra-query execution times can have high variance. In these systems, query times depend on the query plans and the amount of data being processed, but also on the amount of parallelism available, the types of operators (often user-defined) that perform the processing, and the overall system load. None of the techniques used by existing tools or available in the literature provide a non-trivial progress indicator for parallel queries. In this project, we are building Parallax, an accurate, time-oriented progress indicator for parallel queries.

Toward A Progress Indicator for Parallel Queries.
Kristi Morton, Abe Friesen, Magdalena Balazinska, Dan Grossman
University of Washington Technical Report. UW-CSE-09-07-01. July 2009

 

Dataset and Benchmark

Acknowledgments

The Nuage project is partially supported by NSF CAREER award IIS-0845397, NSF CRI grant CNS-0454425, an HP Labs Innovation Research
Award, gifts from Microsoft Research, and Balazinska's Microsoft Research New Faculty Fellowship.

Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.