Analysis, Visualization & Discovery
Wednesday February 4, 3:30pm — 075 Johnson Hall

DeepDive: A Data System for Macroscopic Science

Christopher Ré
Assistant Professor, Stanford University

ABSTRACT

Many pressing questions in science are macroscopic in that these questions require that a scientist integrate information from many data sources. Often, these data sources are documents that contain natural language text, tables, and figures. Such documents contain valuable information, but they are difficult for machines to understand unambiguously. This talk describes DeepDive, a statistical extraction and integration system to extract information from such documents. For tasks in paleobiology, DeepDive-based systems are surpassing human volunteers in data quantity, recall, and precision. This talk describes recent applications of DeepDive and DeepDive's technical core. One of those core technical issues is efficient statistical inference. In particular, this talk will describe our recent Hogwild! and DimmWitted engines that explore a fundamental tension between statistical efficiency (steps until convergence) and hardware efficiency (efficiency of each of those steps).

DeepDive is open source and available from DeepDive.stanford.edu.

BIO

Christopher (Chris) Ré is an assistant professor in the Department of Computer Science at Stanford University and a Robert N. Noyce Family Faculty Scholar. His work's goal is to enable users and developers to build applications that more deeply understand and exploit data. Chris received his PhD from the University of Washington in Seattle under the supervision of Dan Suciu. For his PhD work in probabilistic data management, Chris received the SIGMOD 2010 Jim Gray Dissertation Award. He then spent four wonderful years on the faculty of the University of Wisconsin, Madison, before moving to Stanford in 2013. He helped discover the first join algorithm with worst-case optimal running time, which won the best paper at PODS 2012. He also helped develop a framework for feature engineering that won the best paper at SIGMOD 2014. In addition, work from his group has been incorporated into scientific efforts including the IceCube neutrino detector and PaleoDeepDive, and into Cloudera's Impala and products from Oracle, Pivotal, and Microsoft's Adam. He received an NSF CAREER Award in 2011, an Alfred P. Sloan Fellowship in 2013, and a Moore Data Driven Investigator Award in 2014.