CHI 2017 Papers and Notes Reviews of submission #1311: "High Costs and Small Benefits: A Field Study of How Users Experience Operating System Upgrades" ------------------------ Submission 1311, Review 1 ------------------------ Reviewer: primary (1AC) Overall rating: 3.5 (scale is 0.5..5; 5 is best) Expertise 3 (Knowledgeable) Recommendation . . . Between neutral and possibly accept; 3.5 Award Nomination If accepted, this paper would not be among the top 20% of papers presented at CHI Your Assessment of this Paper's Contribution to HCI This paper contributes to knowledge about how users experience operating system upgrades. The Review I share concerns with the reviewers about the limitations of this research as a result of limited sample size, biased sample (highly educated, technical), and the ability to draw strong conclusions given the diversity of upgrade types and devices in the second study. That being said, I think the paper has several strengths: it is well written, the methodology (sample aside) is well thought out and valid, and the authors do reflect upon the limitations of their approach. 1AC: The Meta-Review All of the reviewers agree that this is an interesting topic area: we do not have a good understanding as of yet of how users experience operating system upgrades. All reviewers share concerns about the limited sample size and the biased nature of the sample, particularly for the second study. Reviewers are divided in whether the limitations as a result of the sample size, bias, and breadth of platforms in the second study significantly limit the contribution of the paper [R2, R3] or whether the paper still makes a sufficient contribution given the limited research done in this area [R4]. The authors need to justify better why their study provides a strong enough contribution for a CHI paper given the limitations of the sample that was recruited. In addition to the sample concerns, the following items should be addressed in a rebuttal: - provide justification of use of only counts for valence/emotion [R2] - provide more reflection on findings in relation to memory and peak-end effect [R2] and re-evaluate the claims being made [R5] - re-evaluate claims re: individual differences at play as OS and type of upgrade do not seem to influence the changes experienced - given the limited sample size and device breadth/overlap, can the findings justify this claim? [R3] - Contrast Study 1 findings with those from Dimitras' group at U Montreal who has data on software update installation rates and delays world wide [R4] - in the intro, the extensive reliance on news articles makes it hard to discern what is academic fact vs popular press [R4] Post PC meeting: A third AC was added to this paper due to the variations in scores (both between the reviewers and the 1AC and 2AC). The 3AC read the paper and gave a positive review. After discussions amongst the ACs, there was general agreement that the paper made a strong enough contribution and it was recommended for acceptance at CHI. The authors received a great deal of feedback in the reviews and should implement all changes described in the rebuttal, as well as address any other comments raised in the reviews so that the final version of the paper is improved. I look forward to seeing this paper at CHI! Rebuttal response I thank the authors for their rebuttal. The scores of reviewers remain unchanged post rebuttal and this paper will be discussed at the PC meeting. ------------------------ Submission 1311, Review 5 ------------------------ Reviewer: secondary (2AC) Overall rating: 4 (scale is 0.5..5; 5 is best) Expertise 2 (Passing Knowledge) Recommendation Possibly Accept: I would argue for accepting this paper; 4.0 Award Nomination Your Assessment of this Paper's Contribution to HCI In my view the main contribution to HCI is in the design recommendations. However, some of the recommendations can be inferred from earlier work and perhaps others could have been uncovered from an interview study with a few users as they seem fairly obvious. Given the small and biased sample I'm not sure of what broader contribution is made. The Review I agree with some of the other reviewers that the sample is problematic, and that problems with sample size and possible bias limit the generalizability of the results. Given that grounded theory was used with two coders I'm surprised that there was no assessment of coding reliability reported. "Two members of the research team initially coded all the data from the observations sepa- rately, then compared the results and discussed the categories. We grouped high-level themes emerging from the data after multiple iterations and discussed them regularly with the rest of the research team." I also had a problem with how the peak-end effect was handled. "These results expand previous findings about peak-end effects and duration neglect in relation to memory. " I don't really see what the claimed (Conclusions) expansion of previous findings re peak-end effect is in this case. "Only two participants mentioned a peak moment, three mentioned the ending of the process, and three the beginning. However, when asked specifically about the ending, most participants (8) had incomplete recollections." I don't think one should be making major theoretical claims based on the comments of a few people. Overall, I think that the paper could be shortened considerably with the key findings being summarized in one or two tables. 1AC: The Meta-Review Rebuttal response I had a look at the rebuttal. I was concerned with the fact that they didn't do any assessment of the reliability of their coding. The response was as follows: \"Our lightweight grounded theory approach is more in line with thematic analysis, where inter-coder reliability (2AC) is uncommon. Our goal was not to make any statistical claims. As for counts of valence/emotions (R2), we were careful to not over-interpret them. They give an idea of the relative frequency, but we make no further claims.\" I wonder about a study that doesn't seek to make any statistical claims, but leaving that issue aside I would also think that one should always be interested in whether or not coding is reliable. If the coding depends on who is doing the coding or what the time of day is then I think there is a problem. Why would anyone report unreliable data, and if you don't check for reliability how can you know the data is reliable? My other main concern is the treatment of the peak-end effect. I don't see how you can make theoretical claims based on text excepts from 2 or 3 people. I also think that the paper is verbose and that the long list of text excerpts could be shortened. However, the other reviewers were generally more positive about the paper with one reviewer having a rating of 4.5. ------------------------ Submission 1311, Review 6 ------------------------ Reviewer: secondary (2AC) Overall rating: 4 (scale is 0.5..5; 5 is best) Expertise 3 (Knowledgeable) Recommendation Possibly Accept: I would argue for accepting this paper; 4.0 Award Nomination If accepted, this paper would not be among the top 20% of papers presented at CHI Your Assessment of this Paper's Contribution to HCI This paper contributes a nice mixed methods approach to analysing the way that people experience and remember OS updates. A value of the paper is the mixed methods and multiple studies, despite the fact that each could be improved. The Review I was brought in to look at this paper as an 3AC, because of the difference of opinion reported between the two main ACs. In particular, there were concerns about whether we could reliably accept the results because of a) the low N, and b) the lack of inter-rater reliability check. Overall, while both could be improved (more N or including an IRR kappa score) to help us have more confidence in the results, I found the a) the mixture of techniques used and the good choice of methods, and b) the process described in the iterative refinement of the codebook was enough for me to have confidence. Plenty of papers that present qualitative analyses do not take the IRR step, but plenty do - I would recommend you do this in the future for a good validity/reliability check to strengthen confidence in your results (take a sample of expressions and check that an independent reviewer would classify them the same way). I dont think its necessarily a problem that the N=14 are all doing different software updates. Although again, more people would be better - you are doing a 4 week diary study with these people, which is a very rich insight. Together, the two studies and then the discussion with experts was a really valuable combined approach. One thing, however: I would highly recommend you remove the words 'grounded theory' from your paper - its a frequent mistake to equate qualitative coding, or emergent thematic analysis, with grounded theory. But Grounded Theory is a much bigger methodological approach that includes thematic coding. I myself have made this mistake in the past (calling it GT) and it would be better that you dont perpetuate the mistake. You are doing both inductive and deductive thematic analysis, given that you use frameworks from literature for some aspects of it. So in conclusion, I recommend accept, and I am happy that this correlates to the high scores given by the most expert reviewers. 1AC: The Meta-Review Rebuttal response ------------------------ Submission 1311, Review 2 ------------------------ Reviewer: external Overall rating: 2.5 (scale is 0.5..5; 5 is best) Expertise 2 (Passing Knowledge) Recommendation . . . Between possibly reject and neutral; 2.5 Award Nomination If accepted, this paper would not be among the top 20% of papers presented at CHI Your Assessment of this Paper's Contribution to HCI This paper examines operating system upgrades from a user experience perspective. Most work related to operating system upgrades relates to issues of security, so this paper provides a unique perspective and contribution. The paper contains recommendations to improve the user experience of system upgrade processes. The Review Significance of contribution/Originality of work: This paper examines operating system upgrades from a user experience perspective. Most work related to operating system upgrades relates to issues of security, so this paper provides a unique perspective and contribution. The paper contains recommendations to improve the user experience of system upgrade processes. Validity of work: The data collection approach used observations, interviews, and a diary study. The work would have benefitted from more users for each device type in the diary study (i.e. only 2 iOS users). The data analysis used grounded theory and some theories of emotion from psychology to group and categorize data. However, the analysis consists mostly of counts of occurrences of valence types and emotion types. The use of only counts is not argued sufficiently. Presentation clarity: The paper is clear and well-written. Relevant previous work: Prior work is adequately reviewed. Other recommendations: Background memory and peak-end effects section is not sufficiently detailed to clearly indicate its relevance to the user experience of updates. Would have been preferred to have more users for each device type i.e. only 2 iOS users. Table 3 has an extra “1” in the last row, not corresponding to a listed positive emotion. Would have been preferred to have additional data analysis methods, besides counts of occurrences. Insufficient reflection on findings in relation to memory and peak-end effects. Why might you have found stronger impact of duration than expected? Several typos or missed words on the last page. Rebuttal response ------------------------ Submission 1311, Review 3 ------------------------ Reviewer: external Overall rating: 3.5 (scale is 0.5..5; 5 is best) Expertise 3 (Knowledgeable) Recommendation . . . Between neutral and possibly accept; 3.5 Award Nomination If accepted, this paper would not be among the top 20% of papers presented at CHI Your Assessment of this Paper's Contribution to HCI The paper covers a critical area: the user experience of software (specifically OS) upgrades. They expand on previous work by Vaniea and colleagues by extracting users' perceptions of the cost and benefits to upgrading and coming up with some interesting design recommendations. The Review This paper aims to extend the work of Vaniea by doing an in-depth analysis of the user experience of software updates, specifically OS updates. They performed 2 studies: the first was a simple retrospective study looking at users OS X upgrade history and determining the delay after initial release that they upgraded. Their second, larger study looked at the details of the upgrade experience of 14 participants, including surveys, observation of the actual upgrade, and post upgrade diary study. In both of their studies, participants had above average education, and probably technical expertise (not clear from the first study what the level of expertise was, except that 51% reported having very high technical skills). I really struggled with this paper. Overall, I believe the area is of critical import and their general approach is sound. There is no doubt that the security risks associated with software that has known defects is significant, and the user attitudes against updating (based on previous poor experiences) are one reason that there are so many unpatched and vulnerable systems on the internet. So, understanding user attitudes and improving the user experience by providing design suggestions based on well designed studies could have significant impact. However, the real flaw in this paper, which the authors point out themselves is the biased and small sample sizes. Even in the first study, which included 394 upgrades, they suggest that their results might conservative because "very high technical skills ... correlate with faster installations". This is certainly something that could be tested statistically if there was enough data with a broad enough set of skill levels. The situation gets even worse in the second study, which had only 14 participants, involved multiple platforms, multiple upgrade types (major vs minor), and was (if anything) even more biased in terms of education and technical skill levels. In fact, 50% of the participants had a background in computer science! At one point, the authors state: "The OS they upgraded or the type of upgrade they performed (minor vs. major) do not seem to influence the changes they experienced, suggesting that individual differences are at play instead." However, looking at the table of users and OS and upgrade types, there is so little overlap (only P3, P6 and P14 performed multiple upgrades) that I'm not sure how they can make that statement. Nevertheless, there were some interesting results that are well worth pursuing with larger, more balanced samples. In particular, I was struck by the fact that 11 of the 14 participants mentioned duration of the process rather than any of the particular pain points in the upgrade itself. The result that it was worth it in the end, despite the perception that costs outweighed benefits early on, is surprising, and suggest that perhaps some form of accommodation is happening. Overall, this is a paper that outlines a nice methodological approach, but suffers from too small a sample size spread over too many upgrade types (14 users, 17 upgrades, 5 platforms, 2 upgrade types). I'm sympathetic to the authors' plight of recruiting volunteers, but I can't help but believe that this seriously limits the value of this work. 14 users over one platform and 1 upgrade type might provide more value, or many more users (with a broader range educational background and technical expertise) would significantly strengthen this effort. Minor edits: page 10: follwed should be followed, principales should be principles Rebuttal response I have read the author's rebuttal and appreciate their detailed responses to the concerns raised by myself and the other reviewers. I do think that there is definite value in this paper, but I remain troubled by some of the sampling issues. In the rebuttal, the authors state that: "previous work suggests that expertise does not affect issues experienced during an upgrade." citing the 2016 Vaniea paper. However, the Vaniea sample size, while certainly biased for IT expertise was not as biased in terms of educational ability. Even then, the statement they are referencing: "Generally we found that people with more technical experience had similar issues to those with less experience, though the language they used when describing the issues tended to be more detailed." offered no evidence to support this claim. The differences in installation rates mentioned by Ion (2015) between experts and non-experts are substantiated more strongly, however it's important to recognize that the "expertise" being measured is computer security, not more general IT experience or education. After reading the Nappa study, however, I accept that it provides good evidence for increased rate of installing security patches in their 3 focused categories, and you should certainly reference that in your paper. The author's state that they "believe the long recruitment is a result in and of itself". I'm not sure what they mean by that -- perhaps that OS updates are hard? I'm afraid I'm unconvinced by their assertion that 14 subjects spread between 6 different cases are enough. My major concern is not the 14 subjects, it's that the 14 subjects are performing 6 different tasks, with very little overlap between the tasks. For example in the authors' discussion about the duration of the upgrade, they discuss the long wait times frustrating users. I deduce (although in the case of P14 I don't know) that that only referred to the major upgrades. By spreading your population across multiple upgrade types and multiple platforms, your conclusions are significantly weakened. In the end, I remain moderately upbeat about this work, and while I certainly appreciate the time the authors have put into their rebuttal, my major concerns with Study 2 in particular remain. ------------------------ Submission 1311, Review 4 ------------------------ Reviewer: external Overall rating: 4.5 (scale is 0.5..5; 5 is best) Expertise 4 (Expert ) Recommendation . . . Between possibly accept and strong accept; 4.5 Award Nomination Your Assessment of this Paper's Contribution to HCI The work explores how people experience and form opinions about software updates. The authors conduct to large and one smaller study. The first study uses log file information to measure how long users delay major upgrades to their Apple software. The second was an in-depth observation of users installing software followed by an experience sample. The final study was a short set of interviews with system administrators about how they manage updates and why updating is important. The Review Overall I think that this work should be accepted into CHI. I think if fits the goals of CHI well and presents valuable information. There are several notable limitations (see below); however, given the lack of research in the area I feel that work makes a significant contribution and has the potential to help future researchers and designers improve the usability of software updating. Pros * The three studies ask good questions aimed at the primary issues with updates. The strongest study here is really study 2 where they combined a participant observation with a longer term experience sample to get a better sense of both what the update was like and how peoples' experience with it continued over the next several weeks. Given the reliance of prior work on retrospective methods, I liked that the authors tested how people related the update a month later. * The work does a good job of acknowledging prior work and then building on it or complimenting it. The authors do a good job of pointing out where their results are in line or differ from the prior work. Given the somewhat small sample sizes this balance with prior work adds strength to the paper. * The observations in the second study are well described using clear language and pulling out key facts. The take aways of the work are clear and easy to identify. * I am a bit on the fence about study 1. On one hand the authors are targeting a good question and their technical methodology here appears sound. The results are also interesting to read, even if the sample is small the issues they are seeing are very real and help highlight the issue for the remainder of the paper. The primary issue is with their sampling. Participants were not compensated and are likely a biased sample. Which means that while the authors can get some interesting data, they cannot really answer the question of how long it takes to get software upgraded. There exist more comprehensive studies on this topic, for example, Tudor Dumitraș' group at U. Maryland has been looking at software update installation rates and delays world-wide. Cons * The number of news stories cited in the introduction was high. I understand that there is limited work in the area and in many cases the authors are citing the existence of software at all. But it became difficult trying to differentiate between academic fact and a news article saying that something had been annoying for some users. * Participants were uncompensated in all the studies. Given the high ask in the first two studies I imagine that the lack of compensation likely impacted the choice of participants. Namely that participants who believe in helping scientists out of good will are likely the majority of the participants. The samples used here are biased in many directions including the above, as well as the higher education levels, and connection to campus life. The biases are more of an issue in the first study as the goal was to identify quantitative numbers of how long people delay upgrades. The second study has less issues since the goal was more to observe what updates looked like for users. * It is somewhat of a minor point, but the interchangeable use of "update" and "upgrade" was very jarring to me when reading. I am so used to "upgrade" meaning a large change like Windows 7 to Windows 8, while an "update" tends to mean a smaller change like a regular Windows patch or a small adjustment to Chrome. Minor issues * p1 "or because of security flaws" <- something odd happened to this sentence * p4 "expresssions" Rebuttal response