| Kim, S. and Ernst, M. D. 2007. Which warnings should I fix first?. In Proceedings of the the 6th Joint Meeting of the European Software Engineering Conference and the ACM SIGSOFT Symposium on the Foundations of Software Engineering (Dubrovnik, Croatia, September 03 - 07, 2007). ESEC-FSE '07. ACM, New York, NY, 45-54. DOI= http://doi.acm.org/10.1145/1287624.1287633 |
| Weaknesses
- Using a history-based warning validity metric to judge the precision of the existing tool warning prioritization and their novel history-based warning prioritization seems liable to bias the evaluation towards a favourable result for their technique. Their result would be more convincing if they used some other measurement to compare the prioritizations as well (perhaps even a study with real developers.)
|
|
< < |
- (Minor) This paper could have benefited from better proofreading; the level of typos is high enough to be annoying.
|
> > |
- There are some points where citations are poorly used or missing:
- At the beginning of s. 2, a paper by Rutar et al.
is cited in support of the claim that four bug-finding tools are "widely used". But Rutar et al.'s paper is a comparison of how the tools work on various Java programs; although it states they are "well-known", it doesn't show how widely they are used.
- "...it is widely reported that false positive rates for the first-presented warnings are critical to user acceptance of a bug-finding tool." (pp. 52-53) No citations of where this has been widely reported, although I suspect it comes from Kremenek and Engler's z-ranking paper
(p. 296, item 1).
- If Kremenek and Engler's abovementioned claim ("In our experience, and from discussions with other practitioners, users tend to immediately discard a tool if the first two or three error reports are false positives, giving these first few slots an enormous importance") is generally true, the HWP results in Figure 9 don't look very good, since it appears that the first ~5 top warnings for Columba and Lucene (and the first ~2 for Scarab) are false positives.
- (Minor) This paper could have benefited from better proofreading; the level of typos and basic mistakes is high enough to be annoying.
|
|
Questions |
|
Belief |
|
> > | The approach seems like a good idea; historical bug fix information is a reasonable way to demonstrate that certain classes of warnings have proven important. However, I find the evaluation somewhat weak since (as mentioned above) the authors used a history-based precision measurement to judge the effectiveness of their history-based prioritization technique against tool prioritization, possibly biasing the evaluation towards a favourable result for their HWP approach. I wouldn't be convinced that their approach works well without further evaluation using different precision measurement techniques. |