Principle investigator: Jun Yang, Duke University
Undergraduate students supported through REU supplements:
Recent technological advances have enabled collection of massive amounts of data in science, commerce, and society. These large, high-resolution datasets have brought us closer than ever before to solving important problems such as decoding human genomes and coping with climate changes. Meanwhile, the exponential growth in the amount of data has created an urgent and difficult technical challenge. Many existing data analysis tools still assume that datasets fit in main memory; when applied to massive datasets, they become unacceptably slow because of excessive disk input/output (I/O) operations.
Across application domains, much of advanced data analysis is done with programs custom-developed by statisticians. Unfortunately, progress has been hindered by the lack of easy-to-use statistical computing environments that support I/O-efficient execution of programs over large datasets. There have been many approaches toward I/O-efficiency, but none has gained traction with the statistical computing community because of difficult challenges ranging from efficiency to usability. Disk-based storage engines and I/O-efficient function libraries provide only a partial solution, because many sources of I/O-inefficiency in a program remain at a higher, inter-operation level: e.g., how (potentially large) intermediate results are passed between operations, how much performance can be gained by deferring and reordering operations, etc. Database systems seem to be a natural solution, with I/O-efficiency and a high-level language (SQL) enabling many high-level optimizations. However, much work in integrating databases and statistical computing remains database-centric, forcing statisticians to learn unfamiliar languages and deal with their impedance mismatch with the host language.
To make a practical impact on the statistical computing community, we postulate that a better approach is to make it completely transparent to users how I/O-efficiency is achieved. Transparency means no SQL, or any new language to learn. Transparency means that existing code should run without modification, and automatically gain I/O-efficiency. This project, nicknamed RIOT, seeks to extend R---an open-source statistical computing environment widely used by statisticians---to transparently provide efficient I/O. Achieving transparency is challenging; RIOT plans to do so with an end-to-end solution that addresses issues on all fronts in an innovative way: I/O-efficient algorithms, pipelined execution to avoid intermediate results, I/O-cost-driven expression optimization through aggressively deferred evaluation, smart storage and materialization options, and seamless integration of these features with an interpreted host language. Initial results with a prototype system called RIOT-DB have been promising, and serve as a proof of concept for the work ahead.
Though the RIOT project targets R, it also addresses the general, longstanding problem of integrating database-style querying with programming languages, which has regained much attention recently. We also note that there exist many interesting, fundamental issues common to automatic parallelization and achieving I/O-efficiency transparently. RIOT plans to draw from work in programming languages and high-performance computing communities.
In the first year of the project, we invested most effort in the prototyping and development activities. Experience gained from these activities has provided us with valuable insights and ideas for planning out the reminder of the project. We have completed the development of RIOT-DB, a proof-of-concept prototype system that maps data and computation in R to an underlying database system, and made the source code available (see below). We have also demonstrated a working prototype of the next generation of RIOT at ICDE 2010. A detailed description of our contributions can be found below in our 2009-2010 project report.
In the second and third years of the project, we expanded into several focused problems, including building a better storage engine and designing better execution and optimization frameworks for RIOT, extending database systems to support matrix storage and computation, parallelization using GPU and cloud, and leveraging SSDs for better performance. A detailed description of our contributions can be found below in our project reports for 2010-2011 and 2011-2012.
In the fourth and final years of the project, we wrapped up our work on RIOT storage and optimization engines, and turned our attention to optimizations for SSDs and parallelization in the cloud. Work on SSDs led to a novel index structure and a new data permutation algorithm; work on cloud parallelization led to a system called Cumulon, which has now branched off as a project by itself. A detailed description of our contributions can be found below in our 2012-2013 report and the final report.