The scientific data management landscape is changing. Improvements in instrumentation and simulation software are giving scientists access to data at an unprecedented scale. This data is increasingly being stored in data centers running thousands of commodity servers. This new environment creates significant data management challenges. In addition to efficient query processing, the magnitude of data and queries call for new query management techniques such as runtime query control and intra-query fault tolerance.
In this project, we are developing new data management systems and techniques for enabling scientists to store, analyze, and share large volumes of data using cloud-computing environments.
Faculty
Graduate Students
Undergraduate Students
Alumni/ae
As one concrete application scenario, we explore the emergent data management needs of the University of Washington’s “N-body Shop” group, which specializes in the development and utilization of large-scale simulations (specifically, “N-body tree codes”). These simulations serve to investigate the formation and evolution of large scale structure in the universe. The UW N-body Shop is representative of the current state-of-the-art in astrophysical cosmological simulation. In 2008, the N-body Shop was the 10th largest consumer of NSF Teragrid time, using 7:5 million CPU hours and generated 50 terabytes of raw data, with an additional 25 terabytes of post-processing information. The total size of each simulation ranges in size from 55 GB to a few TB. We find that the required analysis involves three types of tasks: filtering and correlating the data at different times in the simulation, clustering data within one simulated timestep, and querying the clustered data. We are in the process of developing a benchmark comprising all three types of tasks. The publication below presents the first part of the benchmark along with some preliminary results from running this benchmakrk on Pig/Hadoop and a commercial relational database management system. The following sub-project discusses clustering. We did not yet address the challenge of querying clustered data.
Analyzing Massive Astrophysical Datasets: Can Pig/Hadoop or a Relational DBMS Help?
Sarah Loebman, Dylan Nunley, YongChul Kwon, Bill Howe, Magdalena Balazinska, and Jeffrey P. Gardner
To appear in IASDS 2009
The queries and data sets used in the paper are available here
Scalable clustering algorithm for N-body simulations in a shared-nothing cluster
YongChul Kwon, Dylan Nunley, Jeffrey P. Gardner, Magdalena Balazinska, Bill Howe, and Sarah Loebman
University of Washington Technical Report. UW-CSE-09-06-01. June 2009
The queries and data sets used in the paper are available here
In parallel query-processing environments, accurate, time-oriented progress indicators could provide much utility to users given that queries take a very long time to complete and both inter- and intra-query execution times can have high variance. In these systems, query times depend on the query plans and the amount of data being processed, but also on the amount of parallelism available, the types of operators (often user-defined) that perform the processing, and the overall system load. None of the techniques used by existing tools or available in the literature provide a non-trivial progress indicator for parallel queries. In this project, we are building Parallax, an accurate, time-oriented progress indicator for parallel queries.
Toward A Progress Indicator for Parallel Queries.
Kristi Morton, Abe Friesen, Magdalena Balazinska, Dan Grossman
University of Washington Technical Report. UW-CSE-09-07-01. July 2009
The Nuage project is partially supported by NSF CAREER award IIS-0845397, NSF CRI grant CNS-0454425, an HP Labs Innovation Research
Award, gifts from Microsoft Research, and Balazinska's Microsoft Research New Faculty Fellowship.
Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.