Principle investigator: Jun Yang, Duke UniversityFaculty
"Big data" have been growing in volume and diversity at an explosive rate, bringing enormous potential for transforming science and society. Driven by the desire to convert messy data into insights, analysis has become increasingly statistical, and there are more people than ever interested in analyzing big data. The rise of cloud computing in recent years, exemplified by the popularity of services such as Amazon EC2, offers a promising possibility for supporting big-data analytics. Its "pay-as-you-go" business model is especially attractive: users gain on-demand access to computing resources while avoiding hardware acquisition and maintenance costs.
However, it remains frustratingly difficult for many scientists and statisticians to use the cloud for any nontrivial statistical analysis of big data. First, developing efficient statistical computing programs requires a great deal of expertise and effort. Popular cloud programming platforms, such as Hadoop, require users to code and think in low-level, platform-specific ways, and, in many cases, resort to extensive manual tuning to achieve acceptable performance. Second, deploying such programs in the cloud is hard. Users are faced with a maddening array of choices, ranging from resource provisioning (e.g., type and number of machines to request on Amazon EC2), software configuration (e.g., number of parallel execution slots per machine for Hadoop), to execution parameters and implementation alternatives. Some of these choices can be critical to meeting deadlines and staying within budget, but current systems offer little help to users in making such choices.
This project aims to build Cumulon, an end-to-end solution for making statistical computing over big data easier and more efficient in the cloud. When developing data analysis programs, users will be able to think and code in a declarative fashion, without being concerned with how to map data and computation onto specific hardware and software platforms. When deploying such programs, Cumulon will present users with best "plans" meeting their requirements, along with information that is actually helpful in making decisions---in terms of completion time and monetary cost. For example, given a target completion time, Cumulon can suggest the best plan on Amazon EC2 that minimizes the expected total cost. A plan encodes choices of not only implementation alternatives and execution parameters, but also cluster resource and configuration parameters. This project will develop effective cost modeling and efficient optimization techniques for the vast search space of possible plans. Once a plan is chosen, Cumulon automatically takes care of all details, including reserving hardware, configuring software, and executing the program.Cumulon also addresses the challenges of uncertainty and extensibility. Because of inevitable uncertainty in performance predictions, Cumulon will treat uncertainty as part of user preference, allowing users to consider, for example, only plans that complete on time with high certainty. Cumulon is designed to be "deeply" extensible: developers can contribute new computational primitives to Cumulon in a way that extends not only its functionality but also its optimizability. Cumulon will feature a performance trace repository, which collects data from past deployments and uses them to improve cost modeling and optimization; in this sense, Cumulon uses big data to help tame big data.
Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the funding organizations.