Northwest Database Society (NWDS)

Mission Statement

The goal of NWDS is to bring together researchers and practitioners in the field of databases and data management systems working in the Pacific North-West.

One of our main activities is a an annual meeting. See here the list of past meetings.

The other main activity is a talk series with a variety of distinguished speakers from academia and industry. The details are below. Our past talks can be found on the NWDS youtube channel. Please note that not all talks are recorded.

Instructions to sign up for the NWDS mailing list are at the bottom of this page.

Spring 2025

Speaker: Richard Wesley

Where: University of Washington, Seattle.
Allen School of Computer Science and Engineering.
Paul G. Allen Center, CSE 291

When: Friday, May 2nd, 2025, 2:30pm-3:20pm

Title: Working at DuckDB

Abstract: DuckDB is a relatively new embedded analytic database. Although it was developed at a research institution (CWI), it is now out in the wild as an open source project with commercial support through DuckDB Labs BV. In this talk, one of the core team will explain what it is, why it is, and how interactions between the user base and the research community have influenced his work on the ordered data subsystems.

Bio: Richard is an IBM brat who started programming in the mid-1970s. After a brief flirtation with pure math, he moved to Seattle in 1989, where he worked in the software industry, specializing in digital signal processing. In 2004, he joined visualization pioneer Tableau, where he worked on connectivity, query compilation, and performance, including being the lead developer for the Tableau Data Engine. He tried to retire in 2019 but spent so much time hacking on DuckDB that Hannes tracked him down and invited him to join the DuckDB Labs team, where he focuses on temporal data processing.

Recording

Speaker: Jianguo Wang

Where: University of Washington, Seattle.
Allen School of Computer Science and Engineering.
Paul G. Allen Center, CSE 291

When: Friday, May 2nd, 2025, 2:30pm-3:20pm

Title: Database Systems for LLMs: Vector Databases and Beyond

Abstract: Vector databases have recently emerged as a hot topic due to the widespread interest in LLMs, where vector databases provide the relevant context that enables LLMs to generate more accurate responses. Current vector databases can be broadly categorized into two types: specialized and integrated. Specialized vector databases are explicitly designed for managing vector data, while integrated vector databases support vector search within an existing database system. While specialized vector databases are interesting, there is a significant customer base interested in integrated vector databases for various reasons, such as reluctance to move data out, the desire to link vector embeddings with their source data, and the need for advanced vector search capabilities. However, integrated vector databases face challenges in performance and interoperability. In this talk, I will share our recent experience in building integrated vector databases within two important classes of databases: Relational Databases and Graph Databases. I will show how we address the performance and interoperability challenges, resulting in much more powerful database systems that support advanced RAGs. Next, I will present other challenges in vector databases along with our ongoing work. Finally, I will discuss the broader role of database systems in the era of LLMs and explore how to build future databases that extend beyond vector databases to better support LLMs.

Bio: Jianguo Wang is an Assistant Professor of Computer Science at Purdue University. He obtained his Ph.D. from the University of California, San Diego. He has worked or interned at Zilliz, Amazon AWS, Microsoft Research, Oracle, and Samsung on various database systems. His current research interests include database systems for the cloud and LLMs, especially Disaggregated Databases and Vector Databases. He regularly publishes and serves as a program committee member at premier database conferences such as SIGMOD, VLDB, and ICDE. He also served as a panel moderator for the VLDB'24 panel on vector databases. His research has won multiple awards, including the ACM SIGMOD Research Highlight Award, the NSF CAREER Award, and the IEEE TCDE Rising Star Award.

Recording

Speaker: Bailu Ding

Where: University of Washington, Seattle.
Allen School of Computer Science and Engineering.
Paul G. Allen Center, CSE 291

When: Friday, April 11th, 2025, 2:30pm-3:20pm

Title: Overview of Volcano and Cascades Query Optimizer

Abstract: The performance of a query crucially depends on the ability of the query optimizer to choose a good execution plan from a large space of alternatives. With the discovery of algebraic transformation rules, extensibility is a key requirement for query optimizers. This talk will give an overview of extensible query optimizers, focusing on the Volcano/Cascades frameworks that are used by several relational database systems in the industry.

Bio: Bailu Ding is a Principal Researcher at Data Systems group in Microsoft Research. She has been working on query processing and query optimization. Her recent work includes leveraging machine learning for database systems and increasing the efficiency of AI applications with vector search. She co-authored the book Extensible Query Optimizers in Practice published by Foundations and Trends® in Databases (https://www.nowpublishers.com/article/Details/DBS-077).

Recording

Winter 2025

Speaker: Dan Olteanu

Where: University of Washington, Seattle.
Allen School of Computer Science and Engineering.
Paul G. Allen Center, CSE 291

When: Friday, January 10th, 2025, 2:30pm-3:20pm

Title: Factorized Databases

Abstract: In this talk I will explain the foundations of factorized databases and overview some of their applications. Factorized databases are compressed yet lossless representations of relational data that allow for efficient processing in the compressed domain. They are relational algebra expressions built using the union operator, the Cartesian product operator and data values. By exploiting the distributivity of product over union, they avoid the redundancy in the tabular representation of relational data. Factorized representations of query results can be computed directly from the input database and in time proportional to their sizes and the input database size. Since their introduction about a decade ago, there has been great progress on the theory, systems and applications of factorized databases to: relational query processing, provenance management, probabilistic databases, incremental view maintenance, graph databases, and in-database machine learning.

Bio: Dan Olteanu is a professor at the University of Zurich, where he leads the Data Systems and Theory group (https://www.ifi.uzh.ch/en/dast.html), and a computer scientist at RelationalAI (https://relational.ai). He currently works on incremental view maintenance, cardinality estimation, in-database machine learning and linear algebra, adaptive query processing, and fact attribution in query answering.

Recording

Fall 2024

Speaker: Anastasia Ailamaki

Where: University of Washington, Seattle.
Allen School of Computer Science and Engineering.
Paul G. Allen Center, CSE 291

When: Thursday, November 21st, 2024, 11am-12pm

Title: The New Memory Wall and how it changes database system design

Abstract: To bridge the ever-growing memory-processor speed gap (aka the "memory wall"), computer architects introduce new levels of caching that trade capacity for speed, and database designers develop cache-aware query processing algorithms since the late 1990s. Nowadays distributed query processing on the cloud is the norm; memory resources grow increasingly heterogeneous and disaggregated, mitigating the benefit of cache-aware query processing techniques. Recently, in contrast with traditional CPU-centric architectures, "memory-centric" systems that use memory pooling attract interest but also raise significant challenges as data move in unpredictable ways along a multi-dimensional memory hierarchy. Therefore, data movement emerges as a key performance bottleneck as it incurs a major cost in distributed query processing. In this talk, I will discuss the new memory wall and the challenges and opportunities it brings to database system design.

Bio: Anastasia Ailamaki is a Professor of Computer and Communication Sciences at the École Polytechnique Fédérale de Lausanne (EPFL), a visiting researcher at Google, and the co-founder and Chair of the Board of Directors of RAW Labs SA, a Swiss company developing systems to analyze heterogeneous big data from multiple sources efficiently. She earned a Ph.D. in Computer Science from the University of Wisconsin-Madison in 2000. She has received the 2019 ACM SIGMOD Edgar F. Codd Innovations Award and the 2020 VLDB Women in Database Research Award. She is also the recipient of an ERC Consolidator Award (2013), the Finmeccanica endowed chair from the Computer Science Department at Carnegie Mellon (2007), a European Young Investigator Award from the European Science Foundation (2007), an Alfred P. Sloan Research Fellowship (2005), an NSF CAREER award (2002), twelve best-paper awards and three Test-of-Time prizes at international scientific conferences. She has received the 2018 Nemitsas Prize in Computer Science by the President of Cyprus and the 2021 ARGO Innovation Award by the President of the Hellenic Republic. She is an ACM fellow, an IEEE fellow, a member of the Academia Europaea, and an elected member of the Swiss, the Belgian, the Greek, and the Cypriot National Research Councils.

Spring 2024

Speaker: Laurel Orr

Where: University of Washington, Seattle.
Allen School of Computer Science and Engineering.
Paul G. Allen Center, CSE 291

When: Friday, April 12th, 2024, 2:30pm-3:30pm

Title: From Text2SQL to Automating BI: The Coming Wave of LLM Analytic Agents

Abstract: Large Language Models (LLMs) have seemingly promised to democratize and automate data analytics since the early days of GPT-3. While we’ve seen numerous works apply LLMs to solve components of the data analytics pipelines — e.g. data cleaning, schema matching, and text-to-SQL — we have yet to really see LLMs take over and automate enterprise analytics. What are we missing? We argue that to automate analytics, you need to automate the full end-to-end workflow including steps after the user finishes their analysis. In this talk, we’ll introduce a new data stack for enterprise analytics built on LLM agents. To make this new agent data stack work, we’ll discuss how you first need to focus on building single agents that can solve isolated analytics tasks for enterprise data. We’ll then investigate how you can coordinate and plan across individual agents to automate end-to-end workflows. While LLM agents are still in their infancy, we believe the coming wave of agents holds promise to both automate classically hard data management problems and leverage data management solutions.

Bio: Laurel Orr is a researcher at Numbers Station working applying generative AI to data tasks. She graduated with a PhD in Databases and Data Management from Paul G Allen School for Computer Science and Engineering at the University of Washington and then was a PostDoc at Stanford working for Chris Ré in the HazyReserach Labs. Her research interests are broadly at the intersection of artifical intelligence, foundation models, and data management. She focuses on how to train, customize, and deploy foundation models to data tasks. This includes problems around data curation and management for RAG systems, efficient model training and inference for batch workloads, and prompting paradigms to for high performant, customized models.

Recording

Speaker: Jun Yang

Where: University of Washington, Seattle.
Allen School of Computer Science and Engineering.
Paul G. Allen Center, CSE 291

When: Thursday, April 11th, 2024, 3:30pm-4:30pm

Title: What Teaching Databases Taught me about Researching Databases

Abstract: Declarative querying and automatic optimization are the cornerstones of the success and longevity of database systems, yet these concepts often pose challenges for novice learners accustomed to different coding paradigms. The transition is further hampered by the lack of query debugging tools (be it for correctness or performance) compared to the plethora available for programming languages. The talk samples several systems that we build at Duke University to help students learn and debug database queries. These systems have not only helped scale up teaching and improve learning, but also inspired research on interesting and fundamental questions concerning databases. Furthermore, with the rise of generative AI, we argue that there is a heightened need for skills in scrutinizing and debugging AI-generated queries, and we outline several ongoing and future work directions aimed at addressing this emerging challenge.

Bio: Jun Yang is currently the Bishop-MacDermott Family Professor of Computer Science at Duke University. He joined Duke after receiving his Ph.D. from Stanford in 2001 and chaired the Department of Computer Science at Duke during 2020-2023. He has broad research interests in databases and data-intensive systems. He is a Trustee of the VLDB Endowment and served as the general co-chair of SIGMOD 2017 and the co-Editor-in-Chief of PVLDB during 2022-2023. He is a recipient of the CAREER Award, IBM Faculty Award, HP Labs Innovation Research Award, and Google Faculty Research Award. He has striven to connect research to his other passions, such as journalism, where he has worked on computational fact-checking since its nascent days, and education, where he has built a number of software tools for learning databases. He received the David and Janet Vaughan Brooks Teaching Award at Duke.

Recording

Speaker: Luca Scheerer

Where: University of Washington, Seattle.
Allen School of Computer Science and Engineering.
Paul G. Allen Center, CSE 291

When: Friday, March 29, 2024, 2:30pm-3:30pm

Title: QirK: Question Answering via Intermediate Representation on Knowledge Graphs

Abstract: QirK seeks to bridge the gap between the capabilities of Large Language Models (LLMs) and the structured interpretability of database systems, addressing the complexity of querying Knowledge Graphs. QirK enables users to interact with Knowledge Graphs by posing questions in natural language. This is achieved by mapping the input query to an intermediate representation (IR) via LLMs, then repairing it into a valid relational database query through semantic search on vector embeddings. By leveraging this IR, QirK can answer structurally complex questions that are still beyond the reach of emerging Large Language Models (LLMs) while ensuring complete and accurate results. This is a joint work by Luca Scheerer (ETH Zurich), Anton Lykov (UW), Moe Kayali (UW), Ilias Fountalis (RelationalAI), Nikolaos Vasiloglou (RelationalAI), Dan Olteanu (UZH), and Dan Suciu (UW).

Bio: Luca Scheerer is a second year Computer Science MSc student at ETH Zurich specializing in data systems and the intersection of systems and machine learning. He has interned at Google working on their internal knowledge graph and at the market maker Citadel Securities.

Recording

Winter 2024

Speaker: Kurt Stockinger

Where: University of Washington, Seattle.
Allen School of Computer Science and Engineering.
Paul G. Allen Center, CSE 291

When: Friday, March 1st, 2024, 2:30pm-3:30pm

Title: Querying Databases in Natural Language

Abstract: Being able to query relational databases in natural language can be considered as one the holy grails in database research. Especially with the rise of large language models we seem to be even closer to reaching this goal. However, are we there yet? When we look at the performance of the best systems using academic benchmarks, we might believe that we have arrived. However, when we evaluate how these systems perform in real-world applications, we realize that we still have a long way to the summit. In this talk we provide insights into the fascinating world of applying natural language processing and machine learning techniques to tackle this fundamental database problem. First, we explain how pretrained language models can be used to translate natural language questions into SQL. Afterward, we show how we can automatically generate training datasets when working with new databases where little to no training data is available. Finally, we address the limits of current systems when dealing with real-world applications and sketch research directions of how to tackle these challenges.

Bio: Kurt Stockinger is currently a Visiting Scholar at University of Washington. In his parallel life across the Atlantic, he is a Professor of Computer Science, Director of Studies in Data Science and Head of the Intelligent Information Systems Group. He is also an external lecturer at University of Zurich. Kurt Stockinger's research focuses on Data Science with emphasis on Big Data, Natural Language Query Processing, Query Optimization and Quantum Computing. Essentially, his research interests are at the intersection of databases, natural language processing and machine learning. Previously Kurt Stockinger worked at Credit Suisse in Zurich, Switzerland, at Lawrence Berkeley National Laboratory in Berkeley, California, at California Institute of Technology, California as well as at CERN in Geneva, Switzerland. He holds a Ph.D. in computer science from CERN / University of Vienna.

Recording

Speaker: Faisal Nawab

Where: University of Washington, Seattle.
Allen School of Computer Science and Engineering.
Paul G. Allen Center, CSE 291

When: Friday, February 2nd, 2024, 2:30pm-3:30pm

Title: Enabling Emerging Edge and IoT Applications with Edge-Cloud Data Management

Abstract: The potential of Edge and IoT applications encompasses realms like smart cities, mobility solutions, and immersive technologies. Yet, the actualization of these promising applications stumbles upon a fundamental impediment: the prevailing cloud data management technologies are often hosted on remote data centers. This architectural choice introduces daunting challenges, including substantial wide-area latency, burdensome connectivity and communication bandwidth demands, and regulatory constraints related to personal and sensitive data. This talk presents our research in introducing edge-cloud data management that provides a framework for managing data across edge nodes to overcome the limits of cloud-only data management. We encounter various challenges to achieving this vision such as managing the sheer amount of edge nodes, their sporadic availability, and device constraints in terms of compute, storage, and trust. To navigate these multifaceted challenges, our work redesigns distributed data management technologies to adapt to the edge environment. This includes introducing design concepts in the domains of hierarchical and asymmetric edge-cloud data management, decentralized edge coordination techniques, and edge-friendly mechanisms to maintain security and trust. The talk includes a demonstration of 'AnyLog'–an edge-cloud data management solution that integrates our research findings.

Bio: Faisal Nawab is an assistant professor in the computer science department at the University of California, Irvine. He is the director of EdgeLab, which is dedicated to building edge-cloud data management solutions for emerging edge and IoT applications. Faisal's research is influenced by practical industry problems through his involvement with the startup 'AnyLog' where he acts as the lead architect of designing an edge-cloud database. Faisal has received recognition for his work, winning the "Next-Generation Data Infrastructure" award from Facebook, being named the runner-up for the IEEE TEMS Blockchain Early-Career Award, and being awarded several NSF grants, and industry funding from Meta and Roblox.

Speaker: Pat Helland

Where: University of Washington, Seattle.
Allen School of Computer Science and Engineering.
Paul G. Allen Center, CSE 291

When: Friday, January 26th, 2024, 2:30pm-3:30pm

Title: I'm SO Glad I'm Uncoordinated!

Abstract: In my 44 years building software, technology trends have dramatically changed what's difficult and what's hard. In 1978, CPU, storage, and memory were precious and expensive but coordinating across work was effectively free. Running on a single server, networking was infinitely expensive as we had none. Now, there's an abundance of computation, memory, storage, and network with even more on the way! The only challenge is coordination. Year after year, the cost of coordinating gets larger in terms of instruction opportunities lost while waiting. The first half of the talk explains these changes and their impact on our systems. In response, there are many approaches to avoiding or minimizing the pain of coordination. We taxonomize these solutions and discuss how our systems are evolving and likely to evolve as the world changes around us. I am, indeed, a person who's uncoordinated and very likely to drop and/or break stuff. I've adapted to that in my personal life and spend a great deal of my professional life looking for ways our systems can avoid the need to coordinate.

Bio: Pat Helland has been building distributed systems, database systems, high-performance messaging systems, and multiprocessors since 1978, shortly after dropping out of UC Irvine without a bachelor's degree. That hasn't stopped him from having a passion for academics and publication. From 1982 to 1990, Pat was the chief architect for TMF (Transaction Monitoring Facility), the transaction logging and recovery systems for NonStop SQL, a message-based fault-tolerant system providing high-availability solutions for business critical solutions. In 1991, he moved to HaL Computers where he was chief architect for the Mercury Interconnect Architecture, a cache-coherent non-uniform memory architecture multiprocessor. In 1994, Pat moved to Microsoft to help the company develop a business providing enterprise software solutions. He was chief architect for MTS (Microsoft Transaction Server) and DTC (Distributed Transaction Coordinator). Starting in 2000, Pat began the SQL Service Broker project, a high-performance transactional exactly-once in-order message processing and app execution engine built deeply into Microsoft SQL Server 2005. From 2005-2007, he worked at Amazon on scalable enterprise solutions, scale-out user facing services, integrating product catalog feeds from millions of sellers, and highly-available eventually consistent storage. From 2007 to 2011, Pat was back at Microsoft working on a number of projects including Structured Streams in Cosmos. Structured streams kept metadata within the "big data" streams that were typically 10s of terabytes in size. This metadata allowed affinitized placement within the cluster as well as efficient joins across multiple streams. On launch, this doubled the work performed within the 250PB store. Pat also did the initial design for Baja, the distributed transaction support for a distributed event-processing engine implemented as an LSM atop structured streams providing transactional updates targeting the ingestion of "the entire web in one table" with changes visible in seconds. Starting in 2012, Pat has worked at Salesforce on database technology running within cloud environments. His current interests include latency bounding of online enterprise-grade transaction systems in the face of jitter, the management of metastability in complex environments, and zero-downtime upgrades to databases and stateful applications. In his spare time, Pat regularly writes for ACM Queue, Communications of the ACM, and various conferences. He has been deeply involved in the organization of the HPTS (High Performance Transactions Systems - www.hpts.ws) workshop since 1985. His blog is at pathelland.substack.com and he parsimoniously tweets with the handle @pathelland.

Recording

Speaker: Jin Wang

Where: University of Washington, Seattle.
Allen School of Computer Science and Engineering.
Paul G. Allen Center, CSE 291

When: Friday, January 19th, 2024, 2:30pm-3:30pm

Title: Towards End-to-end Data Pipeline for Effective Data Science

Abstract: Nowadays data-driven approaches have become a mainstream research methodology in multiple communities. To support effective and scalable data science applications on the ever growing datasets, researchers from both academic and industrial fields have made great efforts in building end-to-end data pipelines. In this talk, I will present my efforts in improving two essential components of an end-to-end data pipeline: data preparation and data processing. First, I will present a unified self-supervised learning paradigm that can improve the performance of a variety of data preparation tasks, such as dataset discovery, table annotation and entity matching. Next, I will introduce my work in optimizing parallel recursive queries to support analytical workloads in data processing. Finally, I will conclude with the vision for future work of data pipelines.

Bio: Jin Wang is a research scientist and research lead from Megagon Labs. Before that he obtained his PhD degree of Computer Science from University of California, Los Angeles in July 2020. His research interests lie in the board area of data management and data science. In particular, his research focuses on Database systems, Datalog, Data Integration and Table Representation Learning. His work appears in leading conferences and journals of data management such as SIGMOD, VLDB, ICDE and VLDB Journal.

Winter 2023

Speaker: Juliana Freire

Where: University of Washington, Seattle.
Allen School of Computer Science and Engineering.
Paul G. Allen Center, CSE 291

When: Monday, March 6th, 2023, 1:30pm-2:30pm

Title: Dataset Search for Data Discovery, Augmentation, and Explanation

Abstract: Recent years have seen an explosion in our ability to collect and catalog immense amounts of data about our environment, society, and populace. Moreover, with the push towards transparency and open data, scientists, governments, and organizations are increasingly making structured data available on the Web and in various repositories and data lakes. Combined with advances in analytics and machine learning, the availability of such data should in theory allow us to make progress on many of our most important scientific and societal questions. However, this opportunity is often missed due to a central technical barrier: it is currently nearly impossible for domain experts to weed through the vast amount of available information to discover datasets that are needed for their specific application. While search engines have addressed the discovery problem for Web documents, there are many new challenges involved in supporting the discovery of structured data---from crawling the Web in search of datasets, to the need for dataset-oriented queries and new strategies to rank and display results. I will discuss these challenges and present our recent work in this area. In particular, I will introduce a new class of data-relationship queries that, given a dataset, identifies related datasets; I will describe a collection of methods that efficiently support different kinds of relationships that can be used for data explanation and augmentation; and I will demonstrate Auctus, an open-source dataset search engine that we have developed at the NYU VIDA Center.

Bio: Juliana Freire is a Professor of Computer Science and Data Science at New York University. She was the elected chair of the ACM Special Interest Group on Management of Data (SIGMOD), served as a council member of the Computing Research Association’s Computing Community Consortium (CCC), and was the NYU lead investigator for the Moore-Sloan Data Science Environment. She develops methods and systems that enable a wide range of users to obtain trustworthy insights from data. This spans topics in large-scale data analysis and integration, visualization, machine learning, provenance management, and web information discovery, and different application areas, including urban analytics, predictive modeling, and computational reproducibility. Freire has co-authored over 200 technical papers (including 11 award-winning publications), several open-source systems, and is an inventor of 12 U.S. patents. She is an ACM Fellow, a AAAS Fellow, and a recipient of an NSF CAREER, two IBM Faculty awards, and a Google Faculty Research award. She received the ACM SIGMOD Contributions Award in 2020. Her research has been funded by the National Science Foundation, DARPA, Department of Energy, National Institutes of Health, Sloan Foundation, Gordon and Betty Moore Foundation, W. M. Keck Foundation, Google, Amazon, AT&T Research, Microsoft Research, Yahoo! and IBM. She received a B.S. degree in computer science from the Federal University of Ceara (Brazil), and M.Sc. and Ph.D. degrees in computer science from the State University of New York at Stony Brook.

Recording

Speaker: Sean J. Taylor

Where: University of Washington, Seattle.
Allen School of Computer Science and Engineering.
Paul G. Allen Center, CSE 291

When: Monday, February 27th, 2023, 1:30pm-2:30pm

Title: When Do We Need Casual Inference in Data Science?

Abstract: The most common applications of causal inference to business decision-making are in two main areas: product experiments which inform launch decisions and algorithmic policies based on machine learning models. These applications focus on the special case where interventions are relatively cheap. However, practical analytics tasks encountered by many data scientists and analysts (where interventions are usually not possible) are currently underserved by causal inference. I review the tasks we tend to encounter in practice, discuss how the causal inference lens can change the results, and speculate about the barriers to adoption of these ideas in organizations.

Recording

Speaker: Alvitta Ottley

Where: University of Washington, Seattle.
Allen School of Computer Science and Engineering.
Paul G. Allen Center, CSE 291

When: Monday, February 13th, 2023, 1:30pm-2:30pm

Title: Improving Human-Machine Partnership Through Observational Learning

Abstract: There is a fast-growing interest in analyzing user interaction to create adaptive systems that can assist or collaborate on data analysis. However, the first step for an intelligent visualization response is understanding the user. Dr. Ottley’s work uses an observational learning framework, akin to humans learning concepts like language and behavior naturally through observations, often with no explicit feedback. The goal is to enable computers to infer user attributes and strategies by observing their interactions with a system. In this talk, Dr. Ottley summarizes her lab's work on user modeling for data visualization and gives a snapshot of the current research achievements and what is possible in the near and distant future. Then, she presents techniques for capturing and predicting user behavior, focusing on inferring attention, personality, biases, and knowledge by analyzing log data. Finally, Dr. Ottley highlights the significant roadblocks and future directions for visualization research.

Bio: Dr. Alvitta Ottley is an Assistant Professor in Computer Science & Engineering Department at Washington University in St. Louis, Missouri, USA. She also holds a courtesy appointment in the Psychological and Brain Sciences Department. Her research uses interdisciplinary approaches to solve problems such as how best to display information for effective decision-making and how to design human-in-the-loop visual analytics interfaces that are more attuned to the way people think. Dr. Ottley received an NSF CRII Award in 2018 for using visualization to support medical decision-making, the NSF Career Award for creating context-aware visual analytics systems, and the 2022 EuroVis Early Career Award. In addition, her work has appeared in leading conferences and journals such as CHI, VIS, and TVCG, achieving the best paper and honorable mention awards.

Recording

Speaker: Emre Kiciman

Where: University of Washington, Seattle.
Allen School of Computer Science and Engineering.
Paul G. Allen Center, CSE 291

When: Monday, February 6th, 2023, 1:30pm-2:30pm

Title: Modeling the Data-Generating Process is Necessary for Out-of-Distribution Generalization

Abstract: At Microsoft Research, we are working to broaden the usage of causal AI, especially for decision-making applications, through both fundamental research and practical tooling. In this talk, I'll briefly introduce the PyWhy open-source tools and ecosystem and the fundamental research challenges we are prioritizing based on our experiences with causal AI: better elicitation of the domain knowledge and causal assumptions necessary for a valid causal analysis; the need for better validation and trustworthiness of causal analyses; and the extension of causal analysis methods to support analysis over high-dimensional, unstructured data, such as images and text. I'll spend the bulk of the talk deep-diving into our recent research towards the latter topic, connecting causal graphs and the statistical independences they encode with the loss functions and constraints imposed by invariant representation learning approaches for domain generalization. Based on the causal relationships between spurious attributes and the classification label, we obtain realizations of the canonical causal graph that characterize common distribution shifts and show that each shift entails different independence constraints over observed variables. This work explains why no single current method performs consistently across all kinds of distribution shifts, and leads to a new algorithm, Causally Adaptive Constraint Minimization (CACM), that adaptively identifies and applies the correct independence constraint for regularization. Extensive experiments show that adaptive dataset-dependent constraints lead to the highest accuracy on unseen domains, demonstrating the criticality of modeling the causal relationships inherent in the data-generating process.

Bio: See here.

Recording

Speaker: Sudeepa Roy

Where: University of Washington, Seattle.
Allen School of Computer Science and Engineering.
Paul G. Allen Center

When: Monday, January 30th, 2023, 1:30pm-2:30pm

Title: Toward Interpretable and Actionable Data Analysis with Query Debugging and Causal Inference

Abstract: In today’s data-driven world, users in different fields routinely collect, study, and make decisions supported by data. This motivates development of new techniques to help users from various backgrounds and levels of expertise process data, extract useful information and insights from data, and subsequently make sound decisions. In this talk, I will describe some of our work toward interpretable and actionable data analysis focusing on two steps of the data analysis pipeline. First, I will discuss generating explanations to help new programmers and students debug wrong queries and write correct relational queries. Then, I will talk about our research on connecting data management research with causal inference research to enable causal analysis and hypothetical reasoning for large complex data, and conclude with future research directions.

Bio: Sudeepa Roy is an Associate Professor in Computer Science at Duke University. She works broadly in data management, with a focus on the foundational aspects of big data analysis, which includes causality and explanations for big data, data repair, query optimization, probabilistic databases, and database theory. Before joining Duke in 2015, she did a postdoc at the University of Washington, and obtained her Ph.D. from the University of Pennsylvania. She is a recipient of the VLDB Early Career Research Contributions Award, an NSF CAREER Award, and a Google Ph.D. fellowship in structured data. She is a co-director of the Almost Matching Exactly (AME) lab for interpretable causal inference at Duke.

Fall 2021

Speaker: Anna Fariha

Where: University of Washington, Seattle.
Allen School of Computer Science and Engineering.
Paul G. Allen Center

When: Monday, November 22th, 2021, 2:30pm-3:30pm

Title: Blame the data, not the system: how data constraints can help in trustworthy machine learning and explain causes of data-system malfunction.

Abstract: The core of modern data-driven systems comprises models learned from large datasets, and they are usually optimized to target particular data and workloads. While these data-driven systems have seen wide adoption and success, their reliability and proper function hinge on the data's continued conformance to the systems initial settings and assumptions. My research focuses on designing mechanisms to assess the trustworthiness of a system's inferences and explain causes of system malfunction due to data nonconformance. The key idea here is that since data is central to data-driven systems, it can guide us to determine whether predictions made by an ML model can be trusted, and to expose the cause of a system's unexpected behavior. In this talk, I will talk about mechanisms and explanation frameworks to facilitate trusting and understanding outcomes involving data and data systems.

Bio: I am a Researcher at Microsoft. I obtained my Ph.D. from the University of Massachusetts, Amherst under the supervision of Alexandra Meliou. My primary area of research revolves around data management; but, the application areas of my research have been interdisciplinary, spanning from program synthesis and software engineering to machine learning, natural language processing, and human-computer interaction. I am interested in designing mechanisms for enhancing system usability, by developing intelligent tools towards boosting end-user productivity, and developing mechanisms for explaining system behavior ranging from traditional systems to opaque, data-driven systems.

Recording

Spring 2021

Speaker: Tim Kraska

Where: University of Washington, Seattle.
Allen School of Computer Science and Engineering.
Paul G. Allen Center

When: Monday, May 24th, 2021, 9am - 10am

Title: Towards Instance-Optimized Data Systems

Abstract: Recently, there has been a lot of excitement around ML-enhanced (or learned) algorithm and data structures. For example, there has been work on applying machine learning to improve query optimization, indexing, storage layouts, scheduling, log-structured merge trees, sorting, compression, sketches, among many other things. Arguably, the motivation behind these techniques are similar: machine learning is used to model the data and/or workload in order to derive a more efficient algorithm or data structure. Ultimately, what these techniques will allow us to build are “instance-optimized” systems; systems that self-adjust to a given workload and data distribution to provide unprecedented performance and avoid the need for tuning by an administrator. In this talk, I will provide an overview of the opportunities and limitations of learned index structures, storage layouts, and query optimization techniques we have been developing in my group, and how we are integrating these techniques to build a first instance-optimized database system.

Bio: Tim Kraska is an Associate Professor of Electrical Engineering and Computer Science in MIT's Computer Science and Artificial Intelligence Laboratory, co-director of the Data System and AI Lab at MIT (DSAIL@CSAIL), and co-founder of Einblick Analytics. Currently, his research focuses on building systems for machine learning, and using machine learning for systems. Before joining MIT, Tim was an Assistant Professor at Brown, spent time at Google Brain, and was a PostDoc in the AMPLab at UC Berkeley after he got his PhD from ETH Zurich. Tim is a 2017 Alfred P. Sloan Research Fellow in computer science and received several awards including the VLDB Early Career Research Contribution Award, the VMware Systems Research Award, the university-wide Early Career Research Achievement Award at Brown University, an NSF CAREER Award, as well as several best paper and demo awards at VLDB and ICDE.

Recording

Speaker: Aaron Elmore

Where: University of Washington, Seattle.
Allen School of Computer Science and Engineering.
Paul G. Allen Center

When: Monday, April 12th, 2021, 11am-12:15pm

Title: CrocodileDB: Resource Efficient Database Execution

Abstract: The coming end of Moore’s law requires that data systems be more judicious with computation and resources as the growth in data outpaces the availability of computational resources. Current database systems are eager and aggressively consume resources to immediately and quickly complete the task at hand. Intelligently deferring a task to a later point in time can increase result reuse, reduce work that might later be invalidated, or avoid unnecessary work altogether. In this talk I will introduce CrocodileDB, a resource-efficient database system that automatically optimizes deferment based on user-specification and workload prediction. CrocodileDB integrates new ways of specifying timing information, new query execution policies, new task schedulers, and new data loading schemes. In particular, this talk will highlight two new query execution paradigms, Intermittent Query Processing and Incremental-Aware Query Execution.

Bio: Aaron J. Elmore is an Assistant Professor in the Department of Computer Science, and the College of the University of Chicago. Aaron was previously a Postdoctoral Associate at MIT. Aaron's thesis on Elasticity Primitives for Database-as-a-Service was completed at the University of California, Santa Barbara. His recent research interests focus on building data systems that address the growing data deluge. He is currently an associate editor for SIGMOD record, and has served as co-chair for SIGMOD demonstration track, the inaugural SIGMOD student research competition, and VLDB proceeding editor.

Past Talks

Listed in reverse chronological order. Click here for abstracts.

Winter 2021

Hossein Ahmadi and Aleksandras Surna, Google [abstract]
Lin Ma, Carnegie Melon University [abstract]
Jialin Ding, MIT [abstract]
Laurel Orr, Stanford University [abstract]
Rebecca Taft, CockroachDB [abstract]

Fall 2019

Daniel Ting, Tableau [abstract]

Spring 2019

Graham Cormode, University of Warwick [abstract][video]
Sebastian Breß, TU Berlin [abstract]
Jonas Traub, TU Berlin [abstract]
Andreas Kunft, TU Berlin [abstract]
Martin Kiefer, TU Berlin [abstract]
Niv Dayan, Harvard University [abstract]

Winter 2019

Karthik Ramachandra, Microsoft Research India [abstract][video]
Michael Cafarella, University of Michigan [abstract][video]
Paris Koutris, University of Wisconsin-Madison [abstract][video]
Azza Abouzied, New York University, Abu Dhabi [abstract][video]

Fall 2018

Daniel Harrison, Cockroach Labs [abstract] [video] [slides]
Spyros Blanas, Ohio State University [abstract]
Arun Kumar, University of California, San Diego [abstract] [video]

Summer 2018

Holger Pirk, Imperial College London [abstract] [video]
Stratos Idreos, Harvard University [abstract] [video]

Winter 2018

NWDS Annual Meeting

Fall 2017

Oliver Kennedy, University at Buffalo [abstract] [video]]
Neal Fachan [abstract], Qumulo
Gerome Miklau, University of Massachusetts Amherst [abstract] [video]

Spring 2017

Frank McSherry [abstract] [video]

Winter 2017

Gang Luo, University of Washington [abstract] [video]
Tim Kraska, Brown University [abstract]

Fall 2016

Dharma Shukla, Microsoft [abstract] [video (internal)]
Olga Papaemmanouil, Brandeis University [abstract] [video][slides]
Immanuel Trummer, Cornell University [abstract] [video][slides]

Spring 2016

David Chu, Microsoft Research [abstract]
Craig Chambers, Google [abstract]
Daisy Zhe Wang, UFL [abstract]
Angel Viña, CEO, Denodo Technologies [abstract] [video (internal)]
Xin Luna Dong, Google [abstract] [video (internal)]

Winter 2016

Fatma Özcan, IBM Almaden Research Center [abstract]
Sudipta Sengupta, Microsoft Research [abstract] [video]

Fall 2015

Yannis Papakonstantinou, UCSD [abstract] [video]
Sailesh Krishnamurthy, Amazon [abstract] [video]
Mehul Shah, Amazon [abstract]
Daniel von Dincklage, Google [abstract] [video]
Jennie Duggan, Northwestern University [abstract] [selected slides]

Earlier talks

Atri Rudra, University of Buffalo [abstract]
Anant Bhardwaj, MIT [abstract] [slides]
Barzan Mozafari, University of Michigan [abstract] [slides]
Mike Cafarella, University of Michigan [abstract]
Dan Olteanu, University of Oxford [abstract] [slides]
Andy Pavlo, CMU [abstract] [slides]
Tim Kraska, Brown University [abstract] [slides]
Donald Kossmann, ETH Zurich [abstract]
Hiroaki Shiokawa, NTT [abstract] [slides]
Molham Aref and Todd Veldhuizen, LogicBlox [abstract] [slides]
Darrick S Sogabe and Doug Brown, Teradata [abstract] [slides]
Volker Markl and his students, TU-Berlin [abstract] [slides]
Ricardo Baeza-Yates, Yahoo! Research [abstract] [video] [slides]
Christopher Re, University of Wisconsin [abstract]
Chris Lintott, University of Oxford [abstract] [video] [slides]
Alon Halevy, Google [abstract] [video] [slides]
Aaron Kimball, Odiago Inc. [abstract] [video] [slides]
Ashraf Aboulnaga, University of Waterloo [abstract] [video]
Rhonda Baldwin, Greenplum, a division of EMC [abstract] [video]
Boon Thau Loo, University of Pennsylvania [abstract] [video]
Philip A. Bernstein, Microsoft Research. [abstract] [video]
Luna Dong, AT&T Labs - Research. [abstract] [video] [slides]
Shivnath Babu, Duke University. [abstract] [video]
Yanif Ahmed, Johns Hopkins University. [abstract] [video]
Matthias Bratner and William Candillon, 28msec. [abstract] [video]
Jeff Ullman , Stanford University. [abstract] [video]
Leo Bertossi, Carleton University. [abstract] [video] [slides]
Christian Liensberger, Microsoft Corporation. [abstract] [video] [slides]
Surajit Chaudhari, Microsoft Research. [abstract]
Sergey Melnik, Google. [abstract] [video]
Michael Kallay, Microsoft
Daniel Abadi, Yale University. [abstract] [video]
David Maier, Portland State University.
Alan Gates, Yahoo!
Phil A. Bernstein, Microsoft Research
Nilesh Dalvi, Yahoo! Research
Benny Kimelfeld (IBM Almaden)
Brian Cooper, Yahoo! Research
Michael Isard, Microsoft Research
Dan Olteanu, University of Oxford
David Dewitt, Microsoft Jim Gray Systems Lab
Laura Chiticariu, IBM Almaden
Kristen LeFevre, University of Michigan
Chris Jermaine, University of Florida
Jingren Zho, Microsoft Research
Uwe Roehm, University of Sydney
Christoph Koch, Cornell
Sam Madden, MIT
Chris Olston, Yahoo! Research
Joseph M. Hellerstein, UC Berkeley
Tova Milo, Tel Aviv University
Anhai Doan, University of Wisconsin-Madison
Deepak Patil, Microsoft
David Maier, Portland State University
Patrick Valduriez, INRIA-Rennes
Daniel Abadi, MIT
Mirek Riedewald, Cornell
David Anderson, CMU
Dennis Lee, Amazon
Shankar Pal, Microsoft Research

Mailing List

Please sign up for the nwds mailing list here. We use this list primarily to send announcements for upcoming events. After you register, you can send mail to that list at nwds at cs.washington.edu.

To become a member, please contact Magda or Dan.

History

The North-West Database Society was founded on January 1st 2006 by Dan Suciu and Magdalena Balazinska. It is inspired by the New-England Database Society.