Friday, July 19, 2019
Bill & Melinda Gates Center For Computer Science & Engineering
Zillow Conference Center
University of Washington
3800 E Stevens Way NE
Seattle, WA 98195
15 min - From Open Satellite Imagery to Emergency Response - Valentina Staneva
15 min - Data Management for Debugging Deep Learning Models over Images - Parmita Mehta
15 min - Data Management for Modern Video Analytics - Brandon Haynes
15 min - Towards Efficient Querying of Rich Video Content - Maureen Daum
15 min - Discovering Acyclic Schemas - Batya Kenig
15 min - A Layered Aggregate Engine for Analytics Workloads - Max Schleich
15 min - Pessimistic Cardinality Estimation - Walter Cai
15 min - Relational Causal Models - Babak Salimi (by Zoom)
ABSTRACT: In this talk, Molham Aref will make the case for a first-principles approach to machine learning over relational databases that exploits recent development in database systems and theory. The input to learning classification and regression models is defined by feature extraction queries over relational databases. He casts the machine learning problem as a database problem by decomposing the learning task into a batch of aggregates over the feature extraction query and by computing this batch over the input database. The performance of this approach benefits tremendously from structural properties of the relational data and of the feature extraction query; such properties may be algebraic (semi-ring), combinatorial (hypertree width), or statistical (sampling). This translates to several orders-of-magnitude speed-up over state-of-the-art systems.
This work is based on collaboration with Hung Q. Ngo (RelationalAI), Mahmoud Abo-Khamis (RelationalAI), Ryan Curtin (RelationalAI), Dan Olteanu (Oxford), Maximilian Schleich (Oxford), Ben Moseley (CMU), and XuanLong Nguyen (Michigan) and other members of the RelationalAI team and faculty network.
BIO: Molham Aref is the Chief Executive Officer of RelationalAI. He has more than 28 years of experience in leading organisations that develop and implement high value machine learning and artificial intelligence solutions across various industries. Prior to RelationalAI he was CEO of LogicBlox and Predictix (now Infor), Optimi (now Ericsson), and co-founder of Brickstream (now FLIR). Molham held senior leadership positions at HNC Software (now FICO) and Retek (now Oracle).
One of the key bottlenecks in building machine learning systems is creating and managing the massive training datasets that today’s models learn from. In this talk, I will describe my work on data management systems that let users specify training datasets in higher-level, faster, and more flexible ways, leading to applications that can be built in hours or days, rather than months or years.
I will start by describing Snorkel, an open-source system for programmatically labeling training data that has been deployed by major technology companies, academic labs, and government agencies. In Snorkel, rather than hand-labeling training data, users write labeling functions which label data using heuristic strategies such as pattern matching, distant supervision, and other models. These labeling functions can have noisy, conflicting, and correlated outputs, which Snorkel models and combines into clean training labels. We solve this novel data cleaning problem without any ground truth labels using a matrix-completion style approach, which we show has strong consistency guarantees, and demonstrate that Snorkel leads to impactful gains in applications ranging from knowledge base construction to medical imaging.
Next, I will give an overview of two other systems that accelerate training data creation and management: TANDA, a system for optimizing and managing data augmentation strategies, wherein a labeled dataset is artificially expanded by transforming data points; and MeTaL, a system for integrating training labels across multiple related tasks. I will conclude by outlining future research directions for further accelerating and democratizing machine learning workflows, such as higher-level interfaces and massively multi-task frameworks.
There is an increasing need to bring machine learning to a wide diversity of hardware devices from the datacenter to the edge. Current frameworks rely on vendor-specific operator libraries and optimize for a narrow range of server-class GPUs. Deploying workloads to new platforms such as mobile phones, embedded devices, and accelerators (e.g., FPGAs, ASICs) requires significant manual effort. In this talk, we will talk about TVM – an end to end optimizing deep compiler stack that brings deep learning models on diverse hardware back-ends that are competitive with state-of-the-art hand-tuned libraries for low-power CPU, mobile GPU, and server-class GPUs.