The goal of NWDS is to bring together researchers and practitioners in the field of databases and data management systems working in the Pacific North-West.
One of our main activities is a an annual meeting. See here the list of past meetings.
The other main activity is a talk series with a variety of distinguished speakers from academia and industry. The details are below. Our past talks can be found on the NWDS youtube channel. Please note that not all talks are recorded.
Instructions to sign up for the NWDS mailing list are at the bottom of this page.
Speaker: Anastasia Ailamaki
Where: University of Washington, Seattle.
Allen School of Computer Science and Engineering.
Paul G. Allen Center, CSE 291
When: Thursday, November 21st, 2024, 11am-12pm
Title: The New Memory Wall and how it changes database system design
Abstract: To bridge the ever-growing memory-processor speed gap (aka the "memory wall"), computer architects introduce new levels of caching that trade capacity for speed, and database designers develop cache-aware query processing algorithms since the late 1990s. Nowadays distributed query processing on the cloud is the norm; memory resources grow increasingly heterogeneous and disaggregated, mitigating the benefit of cache-aware query processing techniques. Recently, in contrast with traditional CPU-centric architectures, "memory-centric" systems that use memory pooling attract interest but also raise significant challenges as data move in unpredictable ways along a multi-dimensional memory hierarchy. Therefore, data movement emerges as a key performance bottleneck as it incurs a major cost in distributed query processing. In this talk, I will discuss the new memory wall and the challenges and opportunities it brings to database system design.
Bio: Anastasia Ailamaki is a Professor of Computer and Communication Sciences at the École Polytechnique Fédérale de Lausanne (EPFL), a visiting researcher at Google, and the co-founder and Chair of the Board of Directors of RAW Labs SA, a Swiss company developing systems to analyze heterogeneous big data from multiple sources efficiently. She earned a Ph.D. in Computer Science from the University of Wisconsin-Madison in 2000. She has received the 2019 ACM SIGMOD Edgar F. Codd Innovations Award and the 2020 VLDB Women in Database Research Award. She is also the recipient of an ERC Consolidator Award (2013), the Finmeccanica endowed chair from the Computer Science Department at Carnegie Mellon (2007), a European Young Investigator Award from the European Science Foundation (2007), an Alfred P. Sloan Research Fellowship (2005), an NSF CAREER award (2002), twelve best-paper awards and three Test-of-Time prizes at international scientific conferences. She has received the 2018 Nemitsas Prize in Computer Science by the President of Cyprus and the 2021 ARGO Innovation Award by the President of the Hellenic Republic. She is an ACM fellow, an IEEE fellow, a member of the Academia Europaea, and an elected member of the Swiss, the Belgian, the Greek, and the Cypriot National Research Councils.
Speaker: Laurel Orr
Where: University of Washington, Seattle.
Allen School of Computer Science and Engineering.
Paul G. Allen Center, CSE 291
When: Friday, April 12th, 2024, 2:30pm-3:30pm
Title: From Text2SQL to Automating BI: The Coming Wave of LLM Analytic Agents
Abstract: Large Language Models (LLMs) have seemingly promised to democratize and automate data analytics since the early days of GPT-3. While we’ve seen numerous works apply LLMs to solve components of the data analytics pipelines — e.g. data cleaning, schema matching, and text-to-SQL — we have yet to really see LLMs take over and automate enterprise analytics. What are we missing? We argue that to automate analytics, you need to automate the full end-to-end workflow including steps after the user finishes their analysis. In this talk, we’ll introduce a new data stack for enterprise analytics built on LLM agents. To make this new agent data stack work, we’ll discuss how you first need to focus on building single agents that can solve isolated analytics tasks for enterprise data. We’ll then investigate how you can coordinate and plan across individual agents to automate end-to-end workflows. While LLM agents are still in their infancy, we believe the coming wave of agents holds promise to both automate classically hard data management problems and leverage data management solutions.
Bio: Laurel Orr is a researcher at Numbers Station working applying generative AI to data tasks. She graduated with a PhD in Databases and Data Management from Paul G Allen School for Computer Science and Engineering at the University of Washington and then was a PostDoc at Stanford working for Chris Ré in the HazyReserach Labs. Her research interests are broadly at the intersection of artifical intelligence, foundation models, and data management. She focuses on how to train, customize, and deploy foundation models to data tasks. This includes problems around data curation and management for RAG systems, efficient model training and inference for batch workloads, and prompting paradigms to for high performant, customized models.
Speaker: Jun Yang
Where: University of Washington, Seattle.
Allen School of Computer Science and Engineering.
Paul G. Allen Center, CSE 291
When: Thursday, April 11th, 2024, 3:30pm-4:30pm
Title: What Teaching Databases Taught me about Researching Databases
Abstract: Declarative querying and automatic optimization are the cornerstones of the success and longevity of database systems, yet these concepts often pose challenges for novice learners accustomed to different coding paradigms. The transition is further hampered by the lack of query debugging tools (be it for correctness or performance) compared to the plethora available for programming languages. The talk samples several systems that we build at Duke University to help students learn and debug database queries. These systems have not only helped scale up teaching and improve learning, but also inspired research on interesting and fundamental questions concerning databases. Furthermore, with the rise of generative AI, we argue that there is a heightened need for skills in scrutinizing and debugging AI-generated queries, and we outline several ongoing and future work directions aimed at addressing this emerging challenge.
Bio: Jun Yang is currently the Bishop-MacDermott Family Professor of Computer Science at Duke University. He joined Duke after receiving his Ph.D. from Stanford in 2001 and chaired the Department of Computer Science at Duke during 2020-2023. He has broad research interests in databases and data-intensive systems. He is a Trustee of the VLDB Endowment and served as the general co-chair of SIGMOD 2017 and the co-Editor-in-Chief of PVLDB during 2022-2023. He is a recipient of the CAREER Award, IBM Faculty Award, HP Labs Innovation Research Award, and Google Faculty Research Award. He has striven to connect research to his other passions, such as journalism, where he has worked on computational fact-checking since its nascent days, and education, where he has built a number of software tools for learning databases. He received the David and Janet Vaughan Brooks Teaching Award at Duke.
Speaker: Luca Scheerer
Where: University of Washington, Seattle.
Allen School of Computer Science and Engineering.
Paul G. Allen Center, CSE 291
When: Friday, March 29, 2024, 2:30pm-3:30pm
Title: QirK: Question Answering via Intermediate Representation on Knowledge Graphs
Abstract: QirK seeks to bridge the gap between the capabilities of Large Language Models (LLMs) and the structured interpretability of database systems, addressing the complexity of querying Knowledge Graphs. QirK enables users to interact with Knowledge Graphs by posing questions in natural language. This is achieved by mapping the input query to an intermediate representation (IR) via LLMs, then repairing it into a valid relational database query through semantic search on vector embeddings. By leveraging this IR, QirK can answer structurally complex questions that are still beyond the reach of emerging Large Language Models (LLMs) while ensuring complete and accurate results. This is a joint work by Luca Scheerer (ETH Zurich), Anton Lykov (UW), Moe Kayali (UW), Ilias Fountalis (RelationalAI), Nikolaos Vasiloglou (RelationalAI), Dan Olteanu (UZH), and Dan Suciu (UW).
Bio: Luca Scheerer is a second year Computer Science MSc student at ETH Zurich specializing in data systems and the intersection of systems and machine learning. He has interned at Google working on their internal knowledge graph and at the market maker Citadel Securities.
Speaker: Kurt Stockinger
Where: University of Washington, Seattle.
Allen School of Computer Science and Engineering.
Paul G. Allen Center, CSE 291
When: Friday, March 1st, 2024, 2:30pm-3:30pm
Title: Querying Databases in Natural Language
Abstract: Being able to query relational databases in natural language can be considered as one the holy grails in database research. Especially with the rise of large language models we seem to be even closer to reaching this goal. However, are we there yet? When we look at the performance of the best systems using academic benchmarks, we might believe that we have arrived. However, when we evaluate how these systems perform in real-world applications, we realize that we still have a long way to the summit. In this talk we provide insights into the fascinating world of applying natural language processing and machine learning techniques to tackle this fundamental database problem. First, we explain how pretrained language models can be used to translate natural language questions into SQL. Afterward, we show how we can automatically generate training datasets when working with new databases where little to no training data is available. Finally, we address the limits of current systems when dealing with real-world applications and sketch research directions of how to tackle these challenges.
Bio: Kurt Stockinger is currently a Visiting Scholar at University of Washington. In his parallel life across the Atlantic, he is a Professor of Computer Science, Director of Studies in Data Science and Head of the Intelligent Information Systems Group. He is also an external lecturer at University of Zurich. Kurt Stockinger's research focuses on Data Science with emphasis on Big Data, Natural Language Query Processing, Query Optimization and Quantum Computing. Essentially, his research interests are at the intersection of databases, natural language processing and machine learning. Previously Kurt Stockinger worked at Credit Suisse in Zurich, Switzerland, at Lawrence Berkeley National Laboratory in Berkeley, California, at California Institute of Technology, California as well as at CERN in Geneva, Switzerland. He holds a Ph.D. in computer science from CERN / University of Vienna.
Speaker: Faisal Nawab
Where: University of Washington, Seattle.
Allen School of Computer Science and Engineering.
Paul G. Allen Center, CSE 291
When: Friday, February 2nd, 2024, 2:30pm-3:30pm
Title: Enabling Emerging Edge and IoT Applications with Edge-Cloud Data Management
Abstract: The potential of Edge and IoT applications encompasses realms like smart cities, mobility solutions, and immersive technologies. Yet, the actualization of these promising applications stumbles upon a fundamental impediment: the prevailing cloud data management technologies are often hosted on remote data centers. This architectural choice introduces daunting challenges, including substantial wide-area latency, burdensome connectivity and communication bandwidth demands, and regulatory constraints related to personal and sensitive data. This talk presents our research in introducing edge-cloud data management that provides a framework for managing data across edge nodes to overcome the limits of cloud-only data management. We encounter various challenges to achieving this vision such as managing the sheer amount of edge nodes, their sporadic availability, and device constraints in terms of compute, storage, and trust. To navigate these multifaceted challenges, our work redesigns distributed data management technologies to adapt to the edge environment. This includes introducing design concepts in the domains of hierarchical and asymmetric edge-cloud data management, decentralized edge coordination techniques, and edge-friendly mechanisms to maintain security and trust. The talk includes a demonstration of 'AnyLog'–an edge-cloud data management solution that integrates our research findings.
Bio: Faisal Nawab is an assistant professor in the computer science department at the University of California, Irvine. He is the director of EdgeLab, which is dedicated to building edge-cloud data management solutions for emerging edge and IoT applications. Faisal's research is influenced by practical industry problems through his involvement with the startup 'AnyLog' where he acts as the lead architect of designing an edge-cloud database. Faisal has received recognition for his work, winning the "Next-Generation Data Infrastructure" award from Facebook, being named the runner-up for the IEEE TEMS Blockchain Early-Career Award, and being awarded several NSF grants, and industry funding from Meta and Roblox.
Speaker: Pat Helland
Where: University of Washington, Seattle.
Allen School of Computer Science and Engineering.
Paul G. Allen Center, CSE 291
When: Friday, January 26th, 2024, 2:30pm-3:30pm
Title: I'm SO Glad I'm Uncoordinated!
Abstract: In my 44 years building software, technology trends have dramatically changed what's difficult and what's hard. In 1978, CPU, storage, and memory were precious and expensive but coordinating across work was effectively free. Running on a single server, networking was infinitely expensive as we had none. Now, there's an abundance of computation, memory, storage, and network with even more on the way! The only challenge is coordination. Year after year, the cost of coordinating gets larger in terms of instruction opportunities lost while waiting. The first half of the talk explains these changes and their impact on our systems. In response, there are many approaches to avoiding or minimizing the pain of coordination. We taxonomize these solutions and discuss how our systems are evolving and likely to evolve as the world changes around us. I am, indeed, a person who's uncoordinated and very likely to drop and/or break stuff. I've adapted to that in my personal life and spend a great deal of my professional life looking for ways our systems can avoid the need to coordinate.
Bio: Pat Helland has been building distributed systems, database systems, high-performance messaging systems, and multiprocessors since 1978, shortly after dropping out of UC Irvine without a bachelor's degree. That hasn't stopped him from having a passion for academics and publication. From 1982 to 1990, Pat was the chief architect for TMF (Transaction Monitoring Facility), the transaction logging and recovery systems for NonStop SQL, a message-based fault-tolerant system providing high-availability solutions for business critical solutions. In 1991, he moved to HaL Computers where he was chief architect for the Mercury Interconnect Architecture, a cache-coherent non-uniform memory architecture multiprocessor. In 1994, Pat moved to Microsoft to help the company develop a business providing enterprise software solutions. He was chief architect for MTS (Microsoft Transaction Server) and DTC (Distributed Transaction Coordinator). Starting in 2000, Pat began the SQL Service Broker project, a high-performance transactional exactly-once in-order message processing and app execution engine built deeply into Microsoft SQL Server 2005. From 2005-2007, he worked at Amazon on scalable enterprise solutions, scale-out user facing services, integrating product catalog feeds from millions of sellers, and highly-available eventually consistent storage. From 2007 to 2011, Pat was back at Microsoft working on a number of projects including Structured Streams in Cosmos. Structured streams kept metadata within the "big data" streams that were typically 10s of terabytes in size. This metadata allowed affinitized placement within the cluster as well as efficient joins across multiple streams. On launch, this doubled the work performed within the 250PB store. Pat also did the initial design for Baja, the distributed transaction support for a distributed event-processing engine implemented as an LSM atop structured streams providing transactional updates targeting the ingestion of "the entire web in one table" with changes visible in seconds. Starting in 2012, Pat has worked at Salesforce on database technology running within cloud environments. His current interests include latency bounding of online enterprise-grade transaction systems in the face of jitter, the management of metastability in complex environments, and zero-downtime upgrades to databases and stateful applications. In his spare time, Pat regularly writes for ACM Queue, Communications of the ACM, and various conferences. He has been deeply involved in the organization of the HPTS (High Performance Transactions Systems - www.hpts.ws) workshop since 1985. His blog is at pathelland.substack.com and he parsimoniously tweets with the handle @pathelland.
Speaker: Jin Wang
Where: University of Washington, Seattle.
Allen School of Computer Science and Engineering.
Paul G. Allen Center, CSE 291
When: Friday, January 19th, 2024, 2:30pm-3:30pm
Title: Towards End-to-end Data Pipeline for Effective Data Science
Abstract: Nowadays data-driven approaches have become a mainstream research methodology in multiple communities. To support effective and scalable data science applications on the ever growing datasets, researchers from both academic and industrial fields have made great efforts in building end-to-end data pipelines. In this talk, I will present my efforts in improving two essential components of an end-to-end data pipeline: data preparation and data processing. First, I will present a unified self-supervised learning paradigm that can improve the performance of a variety of data preparation tasks, such as dataset discovery, table annotation and entity matching. Next, I will introduce my work in optimizing parallel recursive queries to support analytical workloads in data processing. Finally, I will conclude with the vision for future work of data pipelines.
Bio: Jin Wang is a research scientist and research lead from Megagon Labs. Before that he obtained his PhD degree of Computer Science from University of California, Los Angeles in July 2020. His research interests lie in the board area of data management and data science. In particular, his research focuses on Database systems, Datalog, Data Integration and Table Representation Learning. His work appears in leading conferences and journals of data management such as SIGMOD, VLDB, ICDE and VLDB Journal.
Speaker: Juliana Freire
Where: University of Washington, Seattle.
Allen School of Computer Science and Engineering.
Paul G. Allen Center, CSE 291
When: Monday, March 6th, 2023, 1:30pm-2:30pm
Title: Dataset Search for Data Discovery, Augmentation, and Explanation
Abstract: Recent years have seen an explosion in our ability to collect and catalog immense amounts of data about our environment, society, and populace. Moreover, with the push towards transparency and open data, scientists, governments, and organizations are increasingly making structured data available on the Web and in various repositories and data lakes. Combined with advances in analytics and machine learning, the availability of such data should in theory allow us to make progress on many of our most important scientific and societal questions. However, this opportunity is often missed due to a central technical barrier: it is currently nearly impossible for domain experts to weed through the vast amount of available information to discover datasets that are needed for their specific application. While search engines have addressed the discovery problem for Web documents, there are many new challenges involved in supporting the discovery of structured data---from crawling the Web in search of datasets, to the need for dataset-oriented queries and new strategies to rank and display results. I will discuss these challenges and present our recent work in this area. In particular, I will introduce a new class of data-relationship queries that, given a dataset, identifies related datasets; I will describe a collection of methods that efficiently support different kinds of relationships that can be used for data explanation and augmentation; and I will demonstrate Auctus, an open-source dataset search engine that we have developed at the NYU VIDA Center.
Bio: Juliana Freire is a Professor of Computer Science and Data Science at New York University. She was the elected chair of the ACM Special Interest Group on Management of Data (SIGMOD), served as a council member of the Computing Research Association’s Computing Community Consortium (CCC), and was the NYU lead investigator for the Moore-Sloan Data Science Environment. She develops methods and systems that enable a wide range of users to obtain trustworthy insights from data. This spans topics in large-scale data analysis and integration, visualization, machine learning, provenance management, and web information discovery, and different application areas, including urban analytics, predictive modeling, and computational reproducibility. Freire has co-authored over 200 technical papers (including 11 award-winning publications), several open-source systems, and is an inventor of 12 U.S. patents. She is an ACM Fellow, a AAAS Fellow, and a recipient of an NSF CAREER, two IBM Faculty awards, and a Google Faculty Research award. She received the ACM SIGMOD Contributions Award in 2020. Her research has been funded by the National Science Foundation, DARPA, Department of Energy, National Institutes of Health, Sloan Foundation, Gordon and Betty Moore Foundation, W. M. Keck Foundation, Google, Amazon, AT&T Research, Microsoft Research, Yahoo! and IBM. She received a B.S. degree in computer science from the Federal University of Ceara (Brazil), and M.Sc. and Ph.D. degrees in computer science from the State University of New York at Stony Brook.
Speaker: Sean J. Taylor
Where: University of Washington, Seattle.
Allen School of Computer Science and Engineering.
Paul G. Allen Center, CSE 291
When: Monday, February 27th, 2023, 1:30pm-2:30pm
Title: When Do We Need Casual Inference in Data Science?
Abstract: The most common applications of causal inference to business decision-making are in two main areas: product experiments which inform launch decisions and algorithmic policies based on machine learning models. These applications focus on the special case where interventions are relatively cheap. However, practical analytics tasks encountered by many data scientists and analysts (where interventions are usually not possible) are currently underserved by causal inference. I review the tasks we tend to encounter in practice, discuss how the causal inference lens can change the results, and speculate about the barriers to adoption of these ideas in organizations.
Speaker: Alvitta Ottley
Where: University of Washington, Seattle.
Allen School of Computer Science and Engineering.
Paul G. Allen Center, CSE 291
When: Monday, February 13th, 2023, 1:30pm-2:30pm
Title: Improving Human-Machine Partnership Through Observational Learning
Abstract: There is a fast-growing interest in analyzing user interaction to create adaptive systems that can assist or collaborate on data analysis. However, the first step for an intelligent visualization response is understanding the user. Dr. Ottley’s work uses an observational learning framework, akin to humans learning concepts like language and behavior naturally through observations, often with no explicit feedback. The goal is to enable computers to infer user attributes and strategies by observing their interactions with a system. In this talk, Dr. Ottley summarizes her lab's work on user modeling for data visualization and gives a snapshot of the current research achievements and what is possible in the near and distant future. Then, she presents techniques for capturing and predicting user behavior, focusing on inferring attention, personality, biases, and knowledge by analyzing log data. Finally, Dr. Ottley highlights the significant roadblocks and future directions for visualization research.
Bio: Dr. Alvitta Ottley is an Assistant Professor in Computer Science & Engineering Department at Washington University in St. Louis, Missouri, USA. She also holds a courtesy appointment in the Psychological and Brain Sciences Department. Her research uses interdisciplinary approaches to solve problems such as how best to display information for effective decision-making and how to design human-in-the-loop visual analytics interfaces that are more attuned to the way people think. Dr. Ottley received an NSF CRII Award in 2018 for using visualization to support medical decision-making, the NSF Career Award for creating context-aware visual analytics systems, and the 2022 EuroVis Early Career Award. In addition, her work has appeared in leading conferences and journals such as CHI, VIS, and TVCG, achieving the best paper and honorable mention awards.
Speaker: Emre Kiciman
Where: University of Washington, Seattle.
Allen School of Computer Science and Engineering.
Paul G. Allen Center, CSE 291
When: Monday, February 6th, 2023, 1:30pm-2:30pm
Title: Modeling the Data-Generating Process is Necessary for Out-of-Distribution Generalization
Abstract: At Microsoft Research, we are working to broaden the usage of causal AI, especially for decision-making applications, through both fundamental research and practical tooling. In this talk, I'll briefly introduce the PyWhy open-source tools and ecosystem and the fundamental research challenges we are prioritizing based on our experiences with causal AI: better elicitation of the domain knowledge and causal assumptions necessary for a valid causal analysis; the need for better validation and trustworthiness of causal analyses; and the extension of causal analysis methods to support analysis over high-dimensional, unstructured data, such as images and text. I'll spend the bulk of the talk deep-diving into our recent research towards the latter topic, connecting causal graphs and the statistical independences they encode with the loss functions and constraints imposed by invariant representation learning approaches for domain generalization. Based on the causal relationships between spurious attributes and the classification label, we obtain realizations of the canonical causal graph that characterize common distribution shifts and show that each shift entails different independence constraints over observed variables. This work explains why no single current method performs consistently across all kinds of distribution shifts, and leads to a new algorithm, Causally Adaptive Constraint Minimization (CACM), that adaptively identifies and applies the correct independence constraint for regularization. Extensive experiments show that adaptive dataset-dependent constraints lead to the highest accuracy on unseen domains, demonstrating the criticality of modeling the causal relationships inherent in the data-generating process.
Bio: See here.
Speaker: Sudeepa Roy
Where: University of Washington, Seattle.
Allen School of Computer Science and Engineering.
Paul G. Allen Center
When: Monday, January 30th, 2023, 1:30pm-2:30pm
Title: Toward Interpretable and Actionable Data Analysis with Query Debugging and Causal Inference
Abstract: In today’s data-driven world, users in different fields routinely collect, study, and make decisions supported by data. This motivates development of new techniques to help users from various backgrounds and levels of expertise process data, extract useful information and insights from data, and subsequently make sound decisions. In this talk, I will describe some of our work toward interpretable and actionable data analysis focusing on two steps of the data analysis pipeline. First, I will discuss generating explanations to help new programmers and students debug wrong queries and write correct relational queries. Then, I will talk about our research on connecting data management research with causal inference research to enable causal analysis and hypothetical reasoning for large complex data, and conclude with future research directions.
Bio: Sudeepa Roy is an Associate Professor in Computer Science at Duke University. She works broadly in data management, with a focus on the foundational aspects of big data analysis, which includes causality and explanations for big data, data repair, query optimization, probabilistic databases, and database theory. Before joining Duke in 2015, she did a postdoc at the University of Washington, and obtained her Ph.D. from the University of Pennsylvania. She is a recipient of the VLDB Early Career Research Contributions Award, an NSF CAREER Award, and a Google Ph.D. fellowship in structured data. She is a co-director of the Almost Matching Exactly (AME) lab for interpretable causal inference at Duke.
Speaker: Anna Fariha
Where: University of Washington, Seattle.
Allen School of Computer Science and Engineering.
Paul G. Allen Center
When: Monday, November 22th, 2021, 2:30pm-3:30pm
Title: Blame the data, not the system: how data constraints can help in trustworthy machine learning and explain causes of data-system malfunction.
Abstract: The core of modern data-driven systems comprises models learned from large datasets, and they are usually optimized to target particular data and workloads. While these data-driven systems have seen wide adoption and success, their reliability and proper function hinge on the data's continued conformance to the systems initial settings and assumptions. My research focuses on designing mechanisms to assess the trustworthiness of a system's inferences and explain causes of system malfunction due to data nonconformance. The key idea here is that since data is central to data-driven systems, it can guide us to determine whether predictions made by an ML model can be trusted, and to expose the cause of a system's unexpected behavior. In this talk, I will talk about mechanisms and explanation frameworks to facilitate trusting and understanding outcomes involving data and data systems.
Bio: I am a Researcher at Microsoft. I obtained my Ph.D. from the University of Massachusetts, Amherst under the supervision of Alexandra Meliou. My primary area of research revolves around data management; but, the application areas of my research have been interdisciplinary, spanning from program synthesis and software engineering to machine learning, natural language processing, and human-computer interaction. I am interested in designing mechanisms for enhancing system usability, by developing intelligent tools towards boosting end-user productivity, and developing mechanisms for explaining system behavior ranging from traditional systems to opaque, data-driven systems.
Speaker: Tim Kraska
Where: University of Washington, Seattle.
Allen School of Computer Science and Engineering.
Paul G. Allen Center
When: Monday, May 24th, 2021, 9am - 10am
Title: Towards Instance-Optimized Data Systems
Abstract: Recently, there has been a lot of excitement around ML-enhanced (or learned) algorithm and data structures. For example, there has been work on applying machine learning to improve query optimization, indexing, storage layouts, scheduling, log-structured merge trees, sorting, compression, sketches, among many other things. Arguably, the motivation behind these techniques are similar: machine learning is used to model the data and/or workload in order to derive a more efficient algorithm or data structure. Ultimately, what these techniques will allow us to build are “instance-optimized” systems; systems that self-adjust to a given workload and data distribution to provide unprecedented performance and avoid the need for tuning by an administrator. In this talk, I will provide an overview of the opportunities and limitations of learned index structures, storage layouts, and query optimization techniques we have been developing in my group, and how we are integrating these techniques to build a first instance-optimized database system.
Bio: Tim Kraska is an Associate Professor of Electrical Engineering and Computer Science in MIT's Computer Science and Artificial Intelligence Laboratory, co-director of the Data System and AI Lab at MIT (DSAIL@CSAIL), and co-founder of Einblick Analytics. Currently, his research focuses on building systems for machine learning, and using machine learning for systems. Before joining MIT, Tim was an Assistant Professor at Brown, spent time at Google Brain, and was a PostDoc in the AMPLab at UC Berkeley after he got his PhD from ETH Zurich. Tim is a 2017 Alfred P. Sloan Research Fellow in computer science and received several awards including the VLDB Early Career Research Contribution Award, the VMware Systems Research Award, the university-wide Early Career Research Achievement Award at Brown University, an NSF CAREER Award, as well as several best paper and demo awards at VLDB and ICDE.
Speaker: Aaron Elmore
Where: University of Washington, Seattle.
Allen School of Computer Science and Engineering.
Paul G. Allen Center
When: Monday, April 12th, 2021, 11am-12:15pm
Title: CrocodileDB: Resource Efficient Database Execution
Abstract: The coming end of Moore’s law requires that data systems be more judicious with computation and resources as the growth in data outpaces the availability of computational resources. Current database systems are eager and aggressively consume resources to immediately and quickly complete the task at hand. Intelligently deferring a task to a later point in time can increase result reuse, reduce work that might later be invalidated, or avoid unnecessary work altogether. In this talk I will introduce CrocodileDB, a resource-efficient database system that automatically optimizes deferment based on user-specification and workload prediction. CrocodileDB integrates new ways of specifying timing information, new query execution policies, new task schedulers, and new data loading schemes. In particular, this talk will highlight two new query execution paradigms, Intermittent Query Processing and Incremental-Aware Query Execution.
Bio: Aaron J. Elmore is an Assistant Professor in the Department of Computer Science, and the College of the University of Chicago. Aaron was previously a Postdoctoral Associate at MIT. Aaron's thesis on Elasticity Primitives for Database-as-a-Service was completed at the University of California, Santa Barbara. His recent research interests focus on building data systems that address the growing data deluge. He is currently an associate editor for SIGMOD record, and has served as co-chair for SIGMOD demonstration track, the inaugural SIGMOD student research competition, and VLDB proceeding editor.
Listed in reverse chronological order. Click here for abstracts.
Please sign up for the nwds mailing list here. We use this list primarily to send announcements for upcoming events. After you register, you can send mail to that list at nwds at cs.washington.edu.
To become a member, please contact Magda or Dan.
The North-West Database Society was founded on January 1st 2006 by Dan Suciu and Magdalena Balazinska. It is inspired by the New-England Database Society.