CHI 2017 Papers and Notes

Reviews of submission #1311: "High Costs and Small Benefits: A Field
Study of How Users Experience Operating System Upgrades"

------------------------ Submission 1311, Review 1 ------------------------

Reviewer:           primary (1AC)
Overall rating:     3.5  (scale is 0.5..5; 5 is best)

Expertise

   3  (Knowledgeable)

Recommendation

    . . . Between neutral and possibly accept; 3.5 

Award Nomination

   If accepted, this paper would not be among the top 20% of papers presented at CHI

Your Assessment of this Paper's Contribution to HCI

   This paper contributes to knowledge about how users experience operating
   system upgrades.

The Review

   I share concerns with the reviewers about the limitations of this
   research as a result of limited sample size, biased sample (highly
   educated, technical), and the ability to draw strong conclusions given
   the diversity of upgrade types and devices in the second study.

   That being said, I think the paper has several strengths: it is well
   written, the methodology (sample aside) is well thought out and valid,
   and the authors do reflect upon the limitations of their approach.

1AC: The Meta-Review

   All of the reviewers agree that this is an interesting topic area: we do
   not have a good understanding as of yet of how users experience operating
   system upgrades. All reviewers share concerns about the limited sample
   size and the biased nature of the sample, particularly for the second
   study. Reviewers are divided in whether the limitations as a result of
   the sample size, bias, and breadth of platforms in the second study
   significantly limit the contribution of the paper [R2, R3] or whether the
   paper still makes a sufficient contribution given the limited research
   done in this area [R4].  The authors need to justify better why their
   study provides a strong enough contribution for a CHI paper given the
   limitations of the sample that was recruited.

   In addition to the sample concerns, the following items should be
   addressed in a rebuttal:
    - provide justification of use of only counts for valence/emotion [R2]
    - provide more reflection on findings in relation to memory and peak-end
   effect [R2] and re-evaluate the claims being made [R5]
    - re-evaluate claims re: individual differences at play as OS and type
   of upgrade do not seem to influence the changes experienced - given the
   limited sample size and device breadth/overlap, can the findings justify
   this claim? [R3]
    - Contrast Study 1 findings with those from Dimitras' group at U
   Montreal who has data on software update installation rates and delays
   world wide [R4]
    - in the intro, the extensive reliance on news articles makes it hard to
   discern what is academic fact vs popular press [R4]

   Post PC meeting:

   A third AC was added to this paper due to the variations in scores (both
   between the reviewers and the 1AC and 2AC). The 3AC read the paper and
   gave a positive review. After discussions amongst the ACs, there was
   general agreement that the paper made a strong enough contribution and it
   was recommended for acceptance at CHI.  

   The authors received a great deal of feedback in the reviews and should
   implement all changes described in the rebuttal, as well as address any
   other comments raised in the reviews so that the final version of the
   paper is improved.  I look forward to seeing this paper at CHI!


Rebuttal response

   I thank the authors for their rebuttal.  The scores of reviewers remain
   unchanged post rebuttal and this paper will be discussed at the PC
   meeting.


------------------------ Submission 1311, Review 5 ------------------------

Reviewer:           secondary (2AC)
Overall rating:     4  (scale is 0.5..5; 5 is best)

Expertise

   2  (Passing Knowledge)

Recommendation

   Possibly Accept: I would argue for accepting this paper; 4.0 

Award Nomination

   
Your Assessment of this Paper's Contribution to HCI

   In my view the main contribution to HCI is in the design recommendations.
   However, some of the recommendations can be inferred from earlier work
   and perhaps others could have been uncovered from an interview study with
   a few users as they seem fairly obvious. Given the small and biased
   sample I'm not sure of what broader contribution is made. 

The Review

   I agree with some of the other reviewers that the sample is problematic,
   and that problems with sample size and possible bias limit the
   generalizability of the results. 

   Given that grounded theory was used with two coders I'm surprised that
   there was no assessment of coding reliability reported. "Two members of
   the research team initially coded all the data from the observations
   sepa- rately, then compared the results and discussed the categories. We
   grouped high-level themes emerging from the data after multiple
   iterations and discussed them regularly with the rest of the research
   team."

   I also had a problem with how the peak-end effect was handled. 

   "These results expand previous findings about peak-end effects and
   duration neglect in relation to memory. "

   I don't really see what the claimed (Conclusions) expansion of previous
   findings re peak-end effect is in this case. 

   "Only two participants mentioned a peak moment, three mentioned the
   ending of the process, and three the beginning. However, when asked
   specifically about the ending, most participants (8) had incomplete
   recollections."

   I don't think one should be making major theoretical claims based on the
   comments of a few people. 

   Overall, I think that the paper could be shortened considerably with the
   key findings being summarized in one or two tables. 

1AC: The Meta-Review


Rebuttal response

    I had a look at the rebuttal. I was concerned with the fact that they
   didn't do any assessment of the reliability 
     of their coding. The response was as follows:

     \"Our
     lightweight grounded theory approach is more in line with thematic
     analysis, where inter-coder reliability (2AC) is uncommon. Our goal was
     not to make any statistical claims. As for counts of valence/emotions
     (R2), we were careful to not over-interpret them. They give an idea of
     the relative frequency, but we make no further claims.\"

     I wonder about a study that doesn't seek to make any statistical
   claims, but leaving that issue aside I would 
     also think that one should always be interested in whether or not
   coding is reliable. If the coding depends on 
     who is doing the coding or what the time of day is then I think there
   is a problem. Why would anyone report 
     unreliable data, and if you don't check for reliability how can you
   know the data is reliable? 

     My other main concern is the treatment of the peak-end effect. I don't
   see how you can make theoretical 
     claims based on text excepts from 2 or 3 people. 

     I also think that the paper is verbose and that the long list of text
   excerpts could be shortened. 

   However, the other reviewers were generally more positive about the paper
   with one reviewer having a rating of 4.5. 


------------------------ Submission 1311, Review 6 ------------------------

Reviewer:           secondary (2AC)
Overall rating:     4  (scale is 0.5..5; 5 is best)

Expertise

   3  (Knowledgeable)

Recommendation

   Possibly Accept: I would argue for accepting this paper; 4.0 

Award Nomination

   If accepted, this paper would not be among the top 20% of papers presented at CHI

Your Assessment of this Paper's Contribution to HCI

   This paper contributes a nice mixed methods approach to analysing the way
   that people experience and remember OS updates. A value of the paper is
   the mixed methods and multiple studies, despite the fact that each could
   be improved. 

The Review

   I was brought in to look at this paper as an 3AC, because of the
   difference of opinion reported between the two main ACs. In particular,
   there were concerns about whether we could reliably accept the results
   because of a) the low N, and b) the lack of inter-rater reliability
   check. 

   Overall, while both could be improved (more N or including an IRR kappa
   score) to help us have more confidence in the results, I found the a) the
   mixture of techniques used and the good choice of methods, and b) the
   process described in the iterative refinement of the codebook was enough
   for me to have confidence. Plenty of papers that present qualitative
   analyses do not take the IRR step, but plenty do - I would recommend you
   do this in the future for a good validity/reliability check to strengthen
   confidence in your results (take a sample of expressions and check that
   an independent reviewer would classify them the same way). I dont think
   its necessarily a problem that the N=14 are all doing different software
   updates. Although again, more people would be better - you are doing a 4
   week diary study with these people, which is a very rich insight.
   Together, the two studies and then the discussion with experts was a
   really valuable combined approach.

   One thing, however: I would highly recommend you remove the words
   'grounded theory' from your paper - its a frequent mistake to equate
   qualitative coding, or emergent thematic analysis, with grounded theory.
   But Grounded Theory is a much bigger methodological approach that
   includes thematic coding. I myself have made this mistake in the past
   (calling it GT) and it would be better that you dont perpetuate the
   mistake. You are doing both inductive and deductive thematic analysis,
   given that you use frameworks from literature for some aspects of it. 

   So in conclusion, I recommend accept, and I am happy that this correlates
   to the high scores given by the most expert reviewers. 

1AC: The Meta-Review


Rebuttal response


------------------------ Submission 1311, Review 2 ------------------------

Reviewer:           external
Overall rating:     2.5  (scale is 0.5..5; 5 is best)

Expertise

   2  (Passing Knowledge)

Recommendation

    . . . Between possibly reject and neutral; 2.5 

Award Nomination

   If accepted, this paper would not be among the top 20% of papers presented at CHI

Your Assessment of this Paper's Contribution to HCI

   This paper examines operating system upgrades from a user experience
   perspective. Most work related to operating system upgrades relates to
   issues of security, so this paper provides a unique perspective and
   contribution. The paper contains recommendations to improve the user
   experience of system upgrade processes.

The Review

   Significance of contribution/Originality of work: This paper examines
   operating system upgrades from a user experience perspective. Most work
   related to operating system upgrades relates to issues of security, so
   this paper provides a unique perspective and contribution. The paper
   contains recommendations to improve the user experience of system upgrade
   processes.

   Validity of work: The data collection approach used observations,
   interviews, and a diary study. The work would have benefitted from more
   users for each device type in the diary study (i.e. only 2 iOS users).
   The data analysis used grounded theory and some theories of emotion from
   psychology to group and categorize data. However, the analysis consists
   mostly of counts of occurrences of valence types and emotion types. The
   use of only counts is not argued sufficiently.

   Presentation clarity: The paper is clear and well-written.

   Relevant previous work: Prior work is adequately reviewed.

   Other recommendations: 
   Background memory and peak-end effects section is not sufficiently
   detailed to clearly indicate its relevance to the user experience of
   updates.

   Would have been preferred to have more users for each device type i.e.
   only 2 iOS users.

   Table 3 has an extra “1” in the last row, not corresponding to a
   listed positive emotion.

   Would have been preferred to have additional data analysis methods,
   besides counts of occurrences.

   Insufficient reflection on findings in relation to memory and peak-end
   effects. Why might you have found stronger impact of duration than
   expected?

   Several typos or missed words on the last page.

Rebuttal response


------------------------ Submission 1311, Review 3 ------------------------

Reviewer:           external
Overall rating:     3.5  (scale is 0.5..5; 5 is best)

Expertise

   3  (Knowledgeable)

Recommendation

    . . . Between neutral and possibly accept; 3.5 

Award Nomination

   If accepted, this paper would not be among the top 20% of papers presented at CHI

Your Assessment of this Paper's Contribution to HCI

   The paper covers a critical area: the user experience of software
   (specifically OS) upgrades.  They expand on previous work by Vaniea and
   colleagues by extracting users' perceptions of the cost and benefits to
   upgrading and coming up with some interesting design recommendations.

The Review

   This paper aims to extend the work of Vaniea by doing an in-depth
   analysis of the user experience of software updates, specifically OS
   updates.  They performed 2 studies: the first was a simple retrospective
   study looking at users OS X upgrade history and determining the delay
   after initial release that they upgraded.  Their second, larger study
   looked at the details of the upgrade experience of 14 participants,
   including surveys, observation of the actual upgrade, and post upgrade
   diary study.  In both of their studies, participants had above average
   education, and probably technical expertise (not clear from the first
   study what the level of expertise was, except that 51% reported having
   very high technical skills).

   I really struggled with this paper.  Overall, I believe the area is of
   critical import and their general approach is sound.  There is no doubt
   that the security risks associated with software that has known defects
   is significant, and the user attitudes against updating (based on
   previous poor experiences) are one reason that there are so many
   unpatched and vulnerable systems on the internet.  So, understanding user
   attitudes and improving the user experience by providing design
   suggestions based on well designed studies could have significant impact.

   However, the real flaw in this paper, which the authors point out
   themselves is the biased and small sample sizes.  Even in the first
   study, which included 394 upgrades, they suggest that their results might
   conservative because "very high technical skills ... correlate with
   faster installations".  This is certainly something that could be tested
   statistically if there was enough data with a broad enough set of skill
   levels.  

   The situation gets even worse in the second study, which had only 14
   participants, involved multiple platforms, multiple upgrade types (major
   vs minor), and was (if anything) even more biased in terms of education
   and technical skill levels.  In fact, 50% of the participants had a
   background in computer science! At one point, the authors state: "The OS
   they upgraded or the type of upgrade they performed (minor vs. major) do
   not seem to influence the changes they experienced, suggesting that
   individual differences are at play instead."  However, looking at the
   table of users and OS and upgrade types, there is so little overlap (only
   P3, P6 and P14 performed multiple upgrades) that I'm not sure how they
   can make that statement.

   Nevertheless, there were some interesting results that are well worth
   pursuing with larger, more balanced samples.  In particular, I was struck
   by the fact that 11 of the 14 participants mentioned duration of the
   process rather than any of the particular pain points in the upgrade
   itself.  The result that it was worth it in the end, despite the
   perception that costs outweighed benefits early on, is surprising, and
   suggest that perhaps some form of accommodation is happening.

   Overall, this is a paper that outlines a nice methodological approach,
   but suffers from too small a sample size spread over too many upgrade
   types (14 users, 17 upgrades, 5 platforms, 2 upgrade types).  I'm
   sympathetic to the authors' plight of recruiting volunteers, but I can't
   help but believe that this seriously limits the value of this work.  14
   users over one platform and 1 upgrade type might provide more value, or
   many more users (with a broader range educational background and
   technical expertise) would significantly strengthen this effort.

   Minor edits: page 10: follwed should be followed, principales should be
   principles

Rebuttal response

   I have read the author's rebuttal and appreciate their detailed responses
   to the concerns raised by myself and the other reviewers.  I do think
   that there is definite value in this paper, but I remain troubled by some
   of the sampling issues.  In the rebuttal, the authors state that:
   "previous work suggests that expertise does not affect issues experienced
   during an upgrade." citing the 2016 Vaniea paper.  However, the Vaniea
   sample size, while certainly biased for IT expertise was not as biased in
   terms of educational ability.  Even then, the statement they are
   referencing: "Generally we found that people with more technical
   experience had similar issues to those with less experience, though the
   language they used when describing the issues tended to be more
   detailed." offered no evidence to support this claim.  The differences in
   installation rates mentioned by Ion (2015) between experts and
   non-experts are substantiated more strongly, however it's important to
   recognize that the "expertise" being measured is computer security, not
   more general IT experience or education.  After reading the Nappa study, 
   however, I accept that it provides good evidence for increased rate of
   installing security patches in their 3 focused categories, and you should
   certainly reference that in your paper.  

   The author's state that they "believe the long recruitment is a result in
   and of itself".  I'm not sure what they mean by that -- perhaps that OS
   updates are hard?  I'm afraid I'm unconvinced by their assertion that 14
   subjects spread between 6 different cases are enough.  My major concern
   is not the 14 subjects, it's that the 14 subjects are performing 6
   different tasks, with very little overlap between the tasks.  For example
   in the authors' discussion about the duration of the upgrade, they
   discuss the long wait times frustrating users.  I deduce (although in the
   case of P14 I don't know) that that only referred to the major upgrades. 
   By spreading your population across multiple upgrade types and multiple
   platforms, your conclusions are significantly weakened.

   In the end, I remain moderately upbeat about this work, and while I
   certainly appreciate the time the authors have put into their rebuttal,
   my major concerns with Study 2 in particular remain.


------------------------ Submission 1311, Review 4 ------------------------

Reviewer:           external
Overall rating:     4.5  (scale is 0.5..5; 5 is best)

Expertise

   4  (Expert )

Recommendation

   . . . Between possibly accept and strong accept; 4.5 

Award Nomination

   
Your Assessment of this Paper's Contribution to HCI

   The work explores how people experience and form opinions about software
   updates. The authors conduct to large and one smaller study. The first
   study uses log file information to measure how long users delay major
   upgrades to their Apple software. The second was an in-depth observation
   of users installing software followed by an experience sample. The final
   study was a short set of interviews with system administrators about how
   they manage updates and why updating is important. 

The Review


   Overall I think that this work should be accepted into CHI. I think if
   fits the goals of CHI well and presents valuable information. There are
   several notable limitations (see below); however, given the lack of
   research in the area I feel that work makes a significant contribution
   and has the potential to help future researchers and designers improve
   the usability of software updating. 


   Pros 
   * The three studies ask good questions aimed at the primary issues with
   updates. The strongest study here is really study 2 where they combined a
   participant observation with a longer term experience sample to get a
   better sense of both what the update was like and how peoples' experience
   with it continued over the next several weeks. Given the reliance of
   prior work on retrospective methods, I liked that the authors tested how
   people related the update a month later. 

   * The work does a good job of acknowledging prior work and then building
   on it or complimenting it. The authors do a good job of pointing out
   where their results are in line or differ from the prior work. Given the
   somewhat small sample sizes this balance with prior work adds strength to
   the paper. 

   * The observations in the second study are well described using clear
   language and pulling out key facts. The take aways of the work are clear
   and easy to identify. 

   * I am a bit on the fence about study 1. On one hand the authors are
   targeting a good question and their technical methodology here appears
   sound. The results are also interesting to read, even if the sample is
   small the issues they are seeing are very real and help highlight the
   issue for the remainder of the paper.  The primary issue is with their
   sampling. Participants were not compensated and are likely a biased
   sample. Which means that while the authors can get some interesting data,
   they cannot really answer the question of how long it takes to get
   software upgraded. There exist more comprehensive studies on this topic,
   for example, Tudor Dumitraș' group at U. Maryland has been looking at
   software update installation rates and delays world-wide. 


   Cons 
   * The number of news stories cited in the introduction was high. I
   understand that there is limited work in the area and in many cases the
   authors are citing the existence of software at all. But it became
   difficult trying to differentiate between academic fact and a news
   article saying that something had been annoying for some users. 

   * Participants were uncompensated in all the studies. Given the high ask
   in the first two studies I imagine that the lack of compensation likely
   impacted the choice of participants. Namely that participants who believe
   in helping scientists out of good will are likely the majority of the
   participants. The samples used here are biased in many directions
   including the above, as well as the higher education levels, and
   connection to campus life. The biases are more of an issue in the first
   study as the goal was to identify quantitative numbers of how long people
   delay upgrades. The second study has less issues since the goal was more
   to observe what updates looked like for users.

   * It is somewhat of a minor point, but the interchangeable use of
   "update" and "upgrade" was very jarring to me when reading. I am so used
   to "upgrade" meaning a large change like Windows 7 to Windows 8, while an
   "update" tends to mean a smaller change like a regular Windows patch or a
   small adjustment to Chrome. 

   Minor issues 
   * p1 "or because of security flaws" <- something odd happened to this
   sentence 
   * p4 "expresssions"

Rebuttal response