Projects‎ > ‎

Perturbation Analysis of Database Queries

Supported by National Science Foundation Award IIS-1408846III: Medium: Collaborative Research: From Answering Questions to Questioning Answers (and Questions)---Perturbation Analysis of Database Queries

Duke University (NSF-IIS-1408846)
Stanford University (NSF-IIS-1408915)
University of Texas at Arlington (NSF-IIS-1408928)

People

Investigators

Collaborators
Graduate Students
Undergraduate Students
  • At Duke: Emre Sonmez
Alumni:
  • From Duke:
    • Undergraduate: Seokhyun (Alex) Song (2014-15), Eric Wu (2014), Kevin Wu (2014), Peggi Li (2014), Andrew Shim (2014), Yubo Tian (2016), Charles Xu (2016)
    • MS: Rohit Paravastu (2012), Rozemary Scarlat (2012)
    • PhD: You (Will) Wu (2015; first employment: Google)

Overview

In the age of data ubiquity, decision making is increasingly driven by data. Oftentimes, database queries are used to identify issues, debate strategies, make choices, and explain decisions. How these database queries are formulated can significantly influence the decision making process. A poor choice of query parameters---be it intentionally or accidentally---may give a biased view of the underlying data, and lead to decisions that are wrong, misguided, or "brittle" when reality deviates from assumptions. Database research has in the past focused on how to answer queries, but has not devoted much attention to how queries impact decision making, or how to formulate "good" queries from the outset. This project aims to fill this void. The key insight is perturbation analysis of data queries---i.e., studying how perturbations of the query form and parameters affect the query result. For example, slight query perturbations leading to very different results help identify potential pitfalls in decision making. In general, perturbation analysis of database queries reveals how queries affect the robustness and objectivity of decisions, and helps decision makers identify "good" queries that will influence their decisions.

Intellectual Merit: This project plans to carry out a systematic study of perturbation analysis of database queries. On the modeling front, the project proposes query response surface (QRS) over the parametric space as a framework for perturbation analysis. Intuitive notions of query "goodness" (for the purpose of supporting decisions), such as fairness and robustness, can be formulated as statistical, geometric, and topological properties of the QRS. The framework also allows practical problems to be formulated in terms of the QRS. For example, a brittle decision can be illustrated by identifying its pitfalls, which can be cast as an optimization problem of searching the QRS for slight perturbations with large result deviations; the problem of finding "good" queries that will influence a decision can be cast as that of finding points with desired properties in the relevant region of the QRS. On the algorithmic front, fundamental research problems arise in coping with the complexity of QRS and the vast space of perturbations. While there has been much study on perturbations of data, considering perturbations of queries poses novel challenges and compounds existing ones. The project will develop both efficient representations of QRS and fast algorithms for exploring and analyzing the QRS, using scalable techniques for indexing, optimization, and incremental evaluation that rely on sampling, approximation, and geometric insights. On the systems and applications front, this project plans to deliver the core features of perturbation analysis as a web service with a public API, and address the design and scalability challenges. The project will produce a general-purpose website for applying perturbation analysis of database queries, as well as websites customized for several domains of public interest. The websites will include a facet-driven interface and features that help collaboration and dissemination.

Broader Impacts: In today's data-driven society, there is increasing demand for the proposed research in many application domains such as public policy, urban planning, business intelligence, and health care. This project will significantly expand the functionality of database systems, making them easier to use (and harder to misuse) for a new generation of data-driven decision makers, especially those outside the traditional "data-heavy" disciplines such as computer science and statistics. This project will develop courses, seminars, and workshops targeting this much broader population of data-driven decision makers, to help train them in data and quantitative analysis, and in interpreting results critically.

Key Words: perturbations of database queries, perturbation analysis, sensitivity analysis

Progress

  • Year 1 project report [PDF]

Publications

Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the funding organizations.

(If you are having trouble seeing the publications above, please try this link instead.)

Links to Other Project Products

  • ClaimBuster is a machine learning tool that helps find political claims from text sources to fact-­check. We plan to use it to find factual statements in the 2016 presidential debates that are worth checking, with the goal of expanding its use to other types of campaigns in the future. Ultimately, ClaimBuster can be expanded to other campaigns to free fact-­checkers from the time-­consuming task of trying to find checkable claims, enabling them to spend more time on investigating the claims themselves.
    • ClaimBuster Data Collection WebsiteThis is the website for collecting labeled data for ClaimBuster. We collected the transcripts of all presidential debates in history and extracted 20788 sentences by presidential candidates. We use this website to collect labels for all sentences.
  • Database of Transportation DatasetsThis website is a repository of valuable datasets and journalism related to the vehicles we drive and the infrastructure that supports us. We want to provide a useful public service for those curious about data and transportation, and a place that showcases the best practices and examples in journalism and analysis. We are continuing to collect and curate such datasets.

Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the funding organizations.