| Kim, S. and Ernst, M. D. 2007. Which warnings should I fix first?. In Proceedings of the the 6th Joint Meeting of the European Software Engineering Conference and the ACM SIGSOFT Symposium on the Foundations of Software Engineering (Dubrovnik, Croatia, September 03 - 07, 2007). ESEC-FSE '07. ACM, New York, NY, 45-54. DOI= http://doi.acm.org/10.1145/1287624.1287633 |
> > | Problem
Automatic bug-finding tools such as FindBugs , JLint , and PMD generate large numbers of warnings, but most of these warnings are not useful. The prioritization of warnings provided by the tools does little to help the user find the important warnings that indicate real bugs.
Contributions
- A technique for estimating which warnings generated by a bug-finding tool applied to a historical code revision are false positives: If a warning occurred on a known "bug-related" line, it is judged to be a valid warning; otherwise, it is judged to be a false positive. A line is "bug-related" iff it is modified or removed in any future revision which appears (from its changelog) to be a bugfix.
- Doing bug marking on a line-by-line basis is claimed to be novel; previous bug-prediction approaches apparently did their training based on the properties of higher-level entities such as functions, modules and files.
- An algorithm for prioritizing categories of warnings based on the frequency with which those categories of warnings tend to lead to actual bugfixes in the revision history.
- Finer-grained prioritization levels than in existing bug tools.
Weaknesses
- Using a history-based warning validity metric to judge the precision of the existing tool warning prioritization and their novel history-based warning prioritization seems liable to bias the evaluation towards a favourable result for their technique. Their result would be more convincing if they used some other measurement to compare the prioritizations as well (perhaps even a study with real developers.)
- (Minor) This paper could have benefited from better proofreading; the level of typos is high enough to be annoying.
Questions
- The authors use large revision sets (> 1000 revisions, > 500 used as training set) to train their tool. It would have been interesting to see a more detailed exploration of how the effectiveness of their algorithm varies depending on the size of the revision history. How well do you think it would work for smaller revision histories?
- It would have been interesting to see whether the category weightings they obtained varied considerably from codebase to codebase, or whether any general trends emerged. (By applying their analysis to lots of codebases, could we draw any conclusions about which categories of warning seem more useful in general?)
Belief |