Projects‎ > ‎

DDM: Derived Data Maintenance

Supported by National Science Foundation CAREER Award IIS-0238386, CAREER: Techniques and Applications of Derived Data Maintenance (abstract)

Principle investigator: Jun Yang, Duke University

Alumni

  • Christopher N. Bond (BS with High Distinction, Spring 2005)
  • Badrish Chandramouli (PhD, Summer 2008; first employment: Microsoft Research)
  • Gregory Filpus (Undergraduate Student, Summer 2006)
  • Hao He (PhD, Summer 2007; first employment: Google Inc.)
  • Adam Silberstein (PhD, Spring 2007; first employment: Yahoo! Research)
  • Congi Wu (Undergraduate Student, Summer 2006 and Spring 2008)
  • Junyi Xie (PhD, Fall 2007; first employment: Oracle Corp.)

Introduction

The focus of this project is on techniques and applications of derived data maintenance. Derived data is the result of applying some transformation, structural or computational, to base data. The use of derived data to facilitate access to base data is a recurring technique in many areas of computer science. Examples of derived data include caches, replicas, indexes, materialized views, synopses, etc. Regardless of the varying forms, purposes, complexity, and accuracy of derived data, it must be maintained when base data is updated. Thus, derived data maintenance is a fundamental problem in computer science. It is also an evolving problem: existing techniques are constantly challenged by the explosive growth in data volume and number of data producers and consumers, and by increasing diversity in data formats.

Traditionally, derived data maintenance has been tackled separately in different contexts, e.g., index updates and materialized view maintenance in databases, cache coherence and replication protocols in distributed systems. Although they share the same underlying theme, these techniques have been developed and applied largely disjointly. Newer and more complex data management tasks, however, call for creative combinations of the traditionally separate ideas. Semantic caching, which has received tremendous interests recently for its applications in caching dynamic Web contents, is a good example of incorporating the idea of materialized views into a cache. With "outside-the-box" thinking such as semantic caching, we seek to discover more techniques that combine multiple flavors of derived data to provide better solutions to problems.

Progress

In the first year of this project, we made progress on the following specific research problems: (1) caching for view maintenance; (2) caching for stream data processing; (3) caching for XML indexing; (4) incremental maintenance of XML structural indexes. A detailed description of our contributions can be found below in our 2003-2004 project report.

In the second year of this project, we broadened our study of derived data maintenance to continuous queries and subscriptions, and began investigating derived data in a network setting. We made contributions to the following specific research problems: (1) caching for stream data processing (continued from the first year); (2) incremental maintenance of order-based XML labeling; (3) asymmetric batch incremental view maintenance; (4) scalable continuous query processing; (5) querying networked data. A detailed description of our contributions can be found below in our 2004-2005 project report.

In the third year of this project, we continued to expand the applications of our derived data maintenance techniques in the following areas: (1) derived data in scalable continuous query processing (continued from the second year); (2) derived data in wide-area network application (including continued efforts on networked data querying from the second year and new thrust on wide-area publish/subscribe); (3) derived data for graph reachability; (4) derived data in wireless sensor networks; (5) asymmetric batch incremental view maintenance (continued from the second year). A detailed description of our contributions can be found below in our 2005-2006 project report.

In the fourth year of this project, we continued to investigate problems related to derived data in a number of areas: (1) query suspend and resume (continued from work started in the second year); (2) derived data for queries over graph-structured data (continued from earlier work on reachability, with emphasis shifted from supporting low-level primitives to user-level queries); (3) derived data in scalable continuous query processing (continued from work in the third year but now with a more practical slant); (4) derived data in computational biology workflows; (5) derived data in wireless sensor networks (continued from the third year). A detailed description of our contributions can be found below in our 2006-2007 project report.

In the fifth (and final) year of the project, we worked on the following problems: (1) derived data in scalable continuous query processing (continued from work in previous years); (2) derived data in wireless sensor networks (continued from work in previous years); (3) recovering data from derived data that is noisy, incomplete, summarized, or perturbed; (4) derived data in computational biology workflows (continued from the fourth year); (5) derived data in information extraction. A detailed description of our contributions can be found below in our final project report.

In addition, we have been actively applying derived data techniques to areas beyond computer science. Specifically, we have been working with a group of computational immunologists led by Dr. Thomas B. Kepler at Duke University on developing a system for tracking lineage, dependency, and versioning of derived datasets in computational biology workflows. Also, together with other computer scientists, we have been collaborating with a group of ecologists and statisticians led by Dr. James S. Clark at the Duke University School of Environment on developing a wireless sensor network in Duke Forest to study how various environmental variables influence forest growth.

A number of our contributions have been published in premier conferences, including 7 full papers in SIGMOD 2004-2007, one demo paper in SIGMOD 2008, 4 full papers in VLDB 2006-2008, 9 full papers in ICDE 2003-2008, one paper each in CIKM 2005, ESA 2005, ISAAC 2005, DASFAA 2006, and CIDR 2007. For detailed descriptions of these contributions please refer to our project reports and publications below.

  • Project report (HTML) and poster (PDF) for academic year 2003-2004.
  • Project report (PDF) for academic year 2004-2005.
  • Project report (PDF) for academic year 2005-2006.
  • Project update (PDF), January 2007.
  • Project report (PDF) for academic year 2006-2007.
  • Final project report (PDF).

Publications

  • Badrish Chandramouli and Jun Yang. "End-to-End Support for Joins in Large-Scale Publish/Subscribe Systems." In Proceedings of the 34th International Conference on Very Large Data Bases (VLDB '08), Auckland, New Zealand, August 2008. Acceptance rate: 16.5%.
    Available for download: paper.
  • Badrish Chandramouli, Jun Yang, Pankaj K. Agarwal, Albert Yu, and Ying Zheng. "ProSem: Scalable Wide-Area Publish/Subscribe." In Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data (SIGMOD '08), Vancouver, Canada, June 2008. System demonstration description. Acceptance rate: 31.9%.
    Available for download: paper.
  • Fei Chen, AnHai Doan, Jun Yang, and Raghu Ramakrishnan. "Efficient Information Extraction over Evolving Text Data." In Proceedings of the 24th International Conference on Data Engineering (ICDE '08), Cancun, Mexico, April 2008. Acceptance rate: 12.1%.
    Available for download: paper.
  • Junyi Xie, Jun Yang, Yuguo Chen, Haixun Wang, and Philip S. Yu. "A Sampling-Based Approach to Information Recovery." In Proceedings of the 24th International Conference on Data Engineering (ICDE '08), Cancun, Mexico, April 2008. Acceptance rate: 19.2%.
    Available for download: paper.
  • Badrish Chandramouli, Jeff M. Phillips, and Jun Yang. "Value-Based Notification Conditions in Large-Scale Publish/Subscribe Systems." In Proceedings of the 33rd International Conference on Very Large Data Bases (VLDB '07), Vienna, Austria, September 2007. Acceptance rate: 16.4%.
    Available for download: paper.
  • Adam Silberstein, Gavino Puggioni, Alan Gelfand, Kamesh Munagala, and Jun Yang. "Suppression and Failures in Sensor Networks: A Bayesian Approach." In Proceedings of the 33rd International Conference on Very Large Data Bases (VLDB '07), Vienna, Austria, September 2007. Acceptance rate: 16.4%.
    Available for download: paper.
  • Badrish Chandramouli, Christopher N. Bond, Shivnath Babu, and Jun Yang. "Query Suspend and Resume." In Proceedings of the 2007 ACM SIGMOD International Conference on Management of Data (SIGMOD '07), Beijing, China, June 2007. Acceptance rate: 14.6%.
    Available for download: paper and technical report.
  • Hao He, Haixun Wang, Jun Yang, and Philip S. Yu. "BLINKS: Ranked Keyword Searches on Graphs." In Proceedings of the 2007 ACM SIGMOD International Conference on Management of Data (SIGMOD '07), Beijing, China, June 2007. Acceptance rate: 14.6%.
    Available for download: paper and technical report.
  • Badrish Chandramouli, Christopher N. Bond, Shivnath Babu, and Jun Yang. "On Suspending and Resuming Dataflows." In Proceedings of the 23rd International Conference on Data Engineering (ICDE '07), Istanbul, Turkey, April 2007. Poster paper. Acceptance rate: 27.6%. Results in this paper are subsumed by those in the paper titled "Query Suspend and Resume" in the 2007 ACM SIGMOD International Conference on Management of Data (SIGMOD '07).
  • Adam Silberstein and Jun Yang. "Multiple Aggregation for In-Network Control of Sensors." In Proceedings of the 23rd International Conference on Data Engineering (ICDE '07), Istanbul, Turkey, April 2007. Acceptance rate: 18.5%.
    Available for download: paper and technical report.
  • Adam Silberstein, Rebecca Braynard, Gregory Filpus, Gavino Puggioni, Alan Gelfand, Kamesh Munagala, and Jun Yang. "Data-Driven Processing in Sensor Networks." In Proceedings of the 3rd Biennial Conference on Innovative Data Systems Research (CIDR '07), Asilomar, California, USA, January 2007.
    Available for download: paper.
  • Pankaj K. Agarwal, Junyi Xie, Jun Yang, and Hai Yu. "Scalable Continuous Query Processing by Tracking Hotspots." In Proceedings of the 32nd International Conference on Very Large Data Bases (VLDB '06), Seoul, Korea, September 2006. Acceptance rate: 13.8%. Results in this paper are subsumed by those in the paper titled "Input-Sensitive Scalable Continuous Join Query Processing" in ACM Transactions on Database Systems (TODS) 2009.
    Available for download: paper and technical report.
  • Badrish Chandramouli, Junyi Xie, and Jun Yang. "On the Database/Network Interface in Large-Scale Publish/Subscribe Systems." In Proceedings of the 2006 ACM SIGMOD International Conference on Management of Data (SIGMOD '06), Chicago, Illinois, USA, June 2006. Acceptance rate: 13.0%.
    Available for download: paper and technical report.
  • Adam Silberstein, Rebecca Braynard, and Jun Yang. "Constraint-Chaining: On Energy-Efficient Continuous Monitoring in Sensor Networks." In Proceedings of the 2006 ACM SIGMOD International Conference on Management of Data (SIGMOD '06), Chicago, Illinois, USA, June 2006. Acceptance rate: 13.0%.
    Available for download: paper.
  • Adam Silberstein, Kamesh Munagala, and Jun Yang. "Energy-Efficient Monitoring of Extreme Values in Sensor Networks." In Proceedings of the 2006 ACM SIGMOD International Conference on Management of Data (SIGMOD '06), Chicago, Illinois, USA, June 2006. Acceptance rate: 13.0%.
    Available for download: paper.
  • Junyi Xie and Jun Yang. "A Survey of Join Processing in Data Streams." Chapter in Data Streams: Models and Algorithms. Charu Aggarwal, ed. Springer. 2006. Invited contribution.
    Available for download: chapter.
  • Badrish Chandramouli, Jun Yang, and Amin Vahdat. "Distributed Network Querying with Bounded Approximate Caching." In Proceedings of the 11th International Conference on Database Systems for Advanced Applications (DASFAA '06), Singapore, April 2006. Acceptance rate: 24.5%.
    Available for download: paper and technical report.
  • Adam Silberstein, Rebecca Braynard, Carla Ellis, Kamesh Munagala, and Jun Yang. "A Sampling-Based Approach to Optimizing Top-k Queries in Sensor Networks." In Proceedings of the 22nd International Conference on Data Engineering (ICDE '06), Atlanta, Georgia, USA, April 2006. Acceptance rate: 19.5%.
    Available for download: paper.
  • Haixun Wang, Hao He, Jun Yang, Philip S. Yu, and Jeffrey Xu Yu. "Dual Labeling: Answering Graph Reachability Queries in Constant Time." In Proceedings of the 22nd International Conference on Data Engineering (ICDE '06), Atlanta, Georgia, USA, April 2006. Acceptance rate: 19.5%.
    Available for download: paper.
  • Adam Silberstein, Rebecca Braynard, and Jun Yang. "Energy-Efficient Continuous Isoline Queries in Sensor Networks." In Proceedings of the 22nd International Conference on Data Engineering (ICDE '06), Atlanta, Georgia, USA, April 2006. Poster paper. Acceptance rate: 31.1%. Results in this paper are subsumed by those in the paper titled "Constraint-Chaining: On Energy-Efficient Continuous Monitoring in Sensor Networks" in the 2006 ACM SIGMOD International Conference on Management of Data (SIGMOD '06).
    Available for download: paper.
  • Pankaj K. Agarwal, Junyi Xie, Jun Yang, and Hai Yu. "Monitoring Continuous Band-Join Queries over Dynamic Data." In Proceedings of the 16th Annual International Symposium on Algorithms and Computation (ISAAC '05), Sanya, Hainan, China, December 2005.
    Available for download: paper.
  • Hao He, Haixun Wang, Jun Yang, and Philip S. Yu. "Compact Reachability Labeling for Graph-Structured Data." In Proceedings of the 14th ACM International Conference on Information and Knowledge Management (CIKM '05), Bremen, Germany, November 2005. Acceptance rate: 17.9%.
    Available for download: paper and full version.
  • Kamesh Munagala, Jun Yang, and Hai Yu. "Online View Maintenance Under a Response-Time Constraint." In Proceedings of the 13th Annual European Symposium on Algorithms (ESA '05), Mallorca, Spain, October 2005.
    Available for download: paper.
  • Junyi Xie, Jun Yang, and Yuguo Chen. "On Joining and Caching Stochastic Streams." In Proceedings of the 2005 ACM SIGMOD International Conference on Management of Data (SIGMOD '05), Baltimore, Maryland, USA, June 2005. Acceptance rate: 15.1%.
    Available for download: paper and technical report.
  • Adam Silberstein, Hao He, Ke Yi, and Jun Yang. "BOXes: Efficient Maintenance of Order-Based Labeling for Dynamic XML Data." In Proceedings of the 21st International Conference on Data Engineering (ICDE '05), Tokyo, Japan, April 2005. Acceptance rate: 12.9%.
    Available for download: paper and full version.
  • Hao He, Junyi Xie, Jun Yang, and Hai Yu. "Asymmetric Batch Incremental View Maintenance." In Proceedings of the 21st International Conference on Data Engineering (ICDE '05), Tokyo, Japan, April 2005. Acceptance rate: 12.9%.
    Available for download: paper.
  • Ke Yi, Hao He, Ioana Stanoi, and Jun Yang. "Incremental Maintenance of XML Structural Indexes." In Proceedings of the 2004 ACM SIGMOD International Conference on Management of Data (SIGMOD '04), Paris, France, June 2004. Acceptance rate: 16.0%.
    Available for download: paper.
  • Hao He and Jun Yang. "Multiresolution Indexing of XML for Frequent Queries." In Proceedings of the 20th International Conference on Data Engineering (ICDE '04), Boston, Massachusetts, USA, March 2004. Acceptance rate: 14.3%.
    Available for download: paper and full version.
  • Ke Yi, Hai Yu, Jun Yang, Gangqiang Xia, and Yuguo Chen. "Efficient Maintenance of Materialized Top-k Views." In Proceedings of the 19th International Conference on Data Engineering (ICDE '03), Bangalore, India, March 2003. Acceptance rate: 13.5%.
    Available for download: paper and full version.

Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.