Research Interests

My interests have always centered around software and software engineering, and this is probably what led me to initially study at Cal Poly, an engineering school.

I attended Cal Poly at an interesting time: the computer science department was in the midst of developing a software engineering major, the first of its kind in California.  With engineering on everyone's minds (especially the faculty), nearly every class seemed unusually introspective, including discussions of at least some aspect of the software engineering process.

Between studying in that environment and working as a software developer, I developed a strong interest in the entire software development process, and I developed a special interest in software tools.  My continuing focus has been on tools that improve automation and economy in software development.

Specification Inference

My dissertation research centered around the reverse engineering of specifications from software programs.  Specifications, often missing or incomplete, are critical for maintenance tasks, program understanding, and with the advent of scalable verification and bug-finding technology, error detection: bug-finding tools must know what to look for.

This problem is a bit unintuitive: it seems impossible to infer a correct specification about an entire class of programs solely from examples. In practice, however, we can shift our focus from entire systems to specifications of more tractable components, like object-oriented APIs.  In this restricted setting, certain beliefs, like the idea that frequent behavior in a program is likely to be correct, can lead us to likely specifications.

Symbolic Specification Miner

This is a tool that infers temporal properties.  It works by matching a behavioral template to elements of your program.  For example, you might pose a query of the form "What pairs of elements in my program often alternate during execution?"  The tool would observe your program and return, for example, "executions of the function lock alternate with unlock" as a probable specification.

This work extends previous work in the area with a new highly scalable symbolic algorithm that allows us to search for much richer specifications.  Efficiently implemented using Binary Decision Diagrams (BDDs), this tool can quickly find specifications that involve multiple (more than two) program elements.

For example, it can find triples of functions in your program that seem to behave like a "resource", in which calls to some function (akin to a read function) are consistently straddled two other functions (akin to open and close functions).  This can often reveal surprising and important relationships in your code.

Symbolic Mining of Temporal Specifications
Paper with Zhendong Su
Javert: Fully Automatic Mining of General Temporal Properties
Paper with Zhendong Su

Javert

Javert is also a tool for inferring temporal properties, but it is unique in that it does not require the user to specify templates (or any other configuration, for that matter).  Under the hood, it works by using the above symbolic miner to find instances of a fixed set of axiomatic micropatterns, which it then composes into larger, general specifications using sound inference rules.

The process operates much like a classic formal system: the micropatterns form axiom schemata, and theorems are formed deductively through applications of our inference rules.  The "theorems" of the system correspond to rich, likely true specifications.  In practice, examples from larger programs include true properties with tens of states and transitions.

OCD

OCD is a tool that directly finds likely temporal bugs.  It is a dynamic analysis tool that observes the execution of a program and both learns temporal properties (using the above techniques) and enforces them as well, finding bugs in the process.  Like Javert, it requires no user configuration, and it takes this a step further by implementing a closed-loop self-tuning system that sets all important parameters on the fly.

Perhaps the most exciting aspect of OCD is its unique ability to automatically find project-specific, semantic errors in a code base: it is actively learning about a program as it runs, not just checking known properties.

Online Inference and Enforcement of Temporal Properties
Paper with Zhendong Su

Analysis of Source Code

I enjoy writing software, and I even enjoy reading and debugging it.  Aside from my dissertation research, I have also worked on several other projects that are more related program source code.

Scalable Detection of Semantic Clones
Paper with Lingxiao Jiang and
 Zhendong Su

PDG-Deckard

Code clones are chunks of source code in a project that are similar.  They are usually formed by "copying and pasting" during development and can make maintenance difficult.  Academic research on clones includes a wide variety of clone detection tools and algorithms that can find copied code in large software projects.

PDG-Deckard is a new clone detection tool that looks deeply into a program to find clones: instead of just considering the visible syntax, it uses program dependence graphs (PDGs) to trace flows of data through the program and find semantically-related segments of source code.

Previous research into PDG-based clone detection suffered from very limited scalability; this tool easily scales to large projects containing several million lines of code.

DejaVu

DejaVu is a tool that locates code clone bugs.  Jointly developed with Microsoft Research, DejaVu works by finding instances of copied code that have diverged over time.  The central idea is that errors happen when a developer fixes in bug in copied code:  all of the copies also have a bug, but since copied code is difficult to keep track of, the bug fix is applied inconsistently.

The key challenges in this work included a) scaling up to 100 million lines of code and b) maintaining a high signal-to-noise ratio, i.e., not raising an excessive amount of false alarms.

The work was ultimately very successful: DejaVu was able to locate thousands of bugs in a pre-production code base at Microsoft, all in just a few hours.

Scalable and Systematic Detection of Buggy Inconsistencies
Paper with Yuan Yu, Junfeng Yang, Moises Goldszmidt, and Zhendong Su
A Study of the Uniqueness of Source Code
Paper with Zhendong Su

A Study of the Uniqueness of Software

This is a large-scale empirical study of software that answers the question "Just how unique is the source code we are writing?"  The experimental setup is simple: we took 6,000 open source projects (about a half-billion lines of code) and we tried to "assemble" each project from chunks of the others.

The thinking is that the "not so unique" parts of software would form ideal candidates for automation, possibly through new developer tools (think a super-powered ctrl-space in Eclipse that completes functions) or through automatic programming techniques like genetic programming.

This study was a substantial systems undertaking: the system collected nearly one million measurements over the course of several months.