Kim, S. and Ernst, M. D. 2007. Which warnings should I fix first?. In Proceedings of the the 6th Joint Meeting of the European Software Engineering Conference and the ACM SIGSOFT Symposium on the Foundations of Software Engineering (Dubrovnik, Croatia, September 03 - 07, 2007). ESEC-FSE '07. ACM, New York, NY, 45-54. DOI=
http://doi.acm.org/10.1145/1287624.1287633
Roberto's Review
The paper deals with some drawbacks of static analysis tools: the high rate of false positives (warnings that do not correspond to bugs) and the ineffectiveness of warning prioritization as it is implemented in these tools.
The presented solution is a history-based warning prioritization (HWP) technique. The basic idea behind the technique is mining the software changes and the warning removal history as a means to prioritize warning categories. Warning instances are taken as bug predictors. To do such, the prioritization algorithm increases the weight of the warning category that has removed warning instances associated with software changes. When a change fixes a bug, the weight increases even more than in a regular change.
The main contributions of the paper are the following. It shows the limitations of tool warning prioritization with false positive rates extracted from software history data. It derives a warning prioritization algorithm inspired in machine learning techniques that uses fix-changes and warning removal history as input. Different from previous work, the algorithm is generic, i.e., it is applicable to any warning category. The algorithm operates at a fine level of granularity, identifying true and false positives at the level of line of code.
On the other hand, although it is interesting to derive false positive rates from history data, it does not replace the expert as the authority to state which warnings are really bugs and which are not. This way, the evaluation seems a bit biased, since a technique based on software history is evaluated only with software history. Another issue of the paper is measuring the false positive rate only at revision n/2. Although it takes into account all the following revisions to mark buggy lines of code, it might be more appropriate if other revisions were taken into account. Finally, the choice of alpha, the parameter of the prioritization training algorithm, was somewhat arbitrary.
I believe this paper is interesting for showing that software history data might be used as an important factor to prioritize bug warnings. It shows the limitations of tool priorities and it opens a research line that might explore software history as a means of improving warning priorities in static analysis tools. But I also think some other data sources, such as source code complexity, might also be used to infer the priority of such warnings.
Some questions that might be interesting to discuss would be:
- Which factors would play a role in determining the priority of warnings in static analysis tools?
- Is the indirect measure of true and false positive warning rates based on software history appropriate?
- How would you design an experiment to use software history to measure false positive rates of warning messages?
- Do you think the training algorithm provides a good heuristic to change warning priority weights? How would you do it differently?
Alex's Review
Problem
Automatic bug-finding tools such as
FindBugs,
JLint, and
PMD generate large numbers of warnings, but most of these warnings are not useful. The prioritization of warnings provided by the tools does little to help the user find the important warnings that indicate real bugs.
Contributions
- A technique for estimating which warnings generated by a bug-finding tool applied to a historical code revision are false positives: If a warning occurred on a known "bug-related" line, it is judged to be a valid warning; otherwise, it is judged to be a false positive. A line is "bug-related" iff it is modified or removed in any future revision which appears (from its changelog) to be a bugfix.
- Doing bug marking on a line-by-line basis is claimed to be novel; previous bug-prediction approaches apparently did their training based on the properties of higher-level entities such as functions, modules and files.
- An algorithm for prioritizing categories of warnings based on the frequency with which those categories of warnings tend to lead to actual bugfixes in the revision history.
- Finer-grained prioritization levels than in existing bug tools.
Weaknesses
- Using a history-based warning validity metric to judge the precision of the existing tool warning prioritization and their novel history-based warning prioritization seems liable to bias the evaluation towards a favourable result for their technique. Their result would be more convincing if they used some other measurement to compare the prioritizations as well (perhaps even a study with real developers.)
- There are some points where citations are poorly used or missing:
- At the beginning of s. 2, a paper by Rutar et al. is cited in support of the claim that four bug-finding tools are "widely used". But Rutar et al.'s paper is a comparison of how the tools work on various Java programs; although it states they are "well-known", it doesn't show how widely they are used.
- "...it is widely reported that false positive rates for the first-presented warnings are critical to user acceptance of a bug-finding tool." (pp. 52-53) No citations of where this has been widely reported, although I suspect it comes from Kremenek and Engler's z-ranking paper (p. 296, item 1).
- If Kremenek and Engler's abovementioned claim ("In our experience, and from discussions with other practitioners, users tend to immediately discard a tool if the first two or three error reports are false positives, giving these first few slots an enormous importance") is generally true, the HWP results in Figure 9 don't look very good, since it appears that the first ~5 top warnings for Columba and Lucene (and the first ~2 for Scarab) are false positives.
- (Minor) This paper could have benefited from better proofreading; the level of typos and basic mistakes is high enough to be annoying.
Questions
- The authors use large revision sets (> 1000 revisions, > 500 used as training set) to train their tool. It would have been interesting to see a more detailed exploration of how the effectiveness of their algorithm varies depending on the size of the revision history. How well do you think it would work for smaller revision histories?
- It would have been interesting to see whether the category weightings they obtained varied considerably from codebase to codebase, or whether any general trends emerged. (By applying their analysis to lots of codebases, could we draw any conclusions about which categories of warning seem more useful in general?)
Belief
The approach seems like a good idea; historical bug fix information is a reasonable way to demonstrate that certain classes of warnings have proven important. However, I find the evaluation somewhat weak since (as mentioned above) the authors used a history-based precision measurement to judge the effectiveness of their history-based prioritization technique against tool prioritization, possibly biasing the evaluation towards a favourable result for their HWP approach. I wouldn't be convinced that their approach works well without further evaluation using different precision measurement techniques.