Summative Usability Testing | Usability Body of Knowledge (Print)

Summative usability testing is summative evaluation of a product with representative users and tasks designed to measure the usability (defined as effectiveness, efficiency and satisfaction) of the complete product.

Summative usability testing is used to obtain measures to establish a usability benchmark or to compare results with usability requirements. The usability requirements should be task-based, and should tie directly to product requirements, including results from analytic tools such as personas, scenarios, and task analysis. Testing may validate a number of objective and subjective characteristics, including task completion, time on task, error rates, and user satisfaction.

The main purpose of a summative test is to evaluate a product through defined measures, rather than diagnosis and correction of specific design problems, as in formative evaluation. The procedure is similar to a controlled experiment, testing the product in a controlled environment. However, it is common to note usability problems that occur during testing, and to interview the participant after the task to obtain an understanding of the problems.

Authoritative References

Dumas, J.S. & Redish, J.C. (1999). A practical guide to usability testing. Exeter: intellect.
Tullis, T. & Albert, B. (2008). Measuring the user experience. San Francisco: Morgan Kaufmann.
Joseph S. Dumas (2002). User-based evaluations, The human-computer interaction handbook: fundamentals, evolving technologies and emerging applications. Mahwah: Lawrence Erlbaum Associates, Inc.
David W. Martin (2003). Doing Psychology Experiments. Belmont: Wadsworth Publishing.

Web Resources

Measuring Usability web site
ISO 9241-11 Guidance on usability
ISO/IEC 25062:2006 “Common Industry Format (CIF) for usability test reports”
When 100% Really Isn’t 100%: Improving the Accuracy of Small-Sample Estimates of Completion Rates Journal of Usability Studies, Issue 3, Volume 1, May 2006, pp. 136-150
Miles Macleod, Rosemary Bowden, Nigel Bevan, Ian Curson (1997). The MUSiC performance measurement method, Behaviour & Information Technology, Volume 16, 1997.

Published Studies

Quesenbery, W. (2004). Defining a summative usability test for voting systems - A report from the UPA 2004 workshop on voting and usability. http://www.slideshare.net/whitepapers/defining-a-summative-usability-test-for-voting-systems
Johnston, Gavin.; Johnson, Carolynn R. (2002) If You Build It Will They Come: Validity and Reliability in User Interaction and Design. UPA 2002 Conference.

Detailed Description

Originators/Popularizers

The theoretical background of this method can be found in scientific experiments, especially those applied in social sciences and psychology. In such an experiment, hypotheses are tested by modifying an independent variable in a controlled environment. The effects of this modification on one or several dependent variables are then measured and statistically analyzed. In the early 1980s such experiments were first transferred to usability testing and it is therefore quite hard to mark the exact point in time, when ‘summative testing’ was formally developed out of these methods. An important aspect is the separation from a user test, which tries to identify usability problems but does not qualify for statistical analysis of quantitative measurements. Such a test is often referred to as informal testing or formative evaluation. Summative usability testing is sometimes also referred to as user performance testing or formal evaluation and tries to fulfil the requirements of scientific experiments.

History

As described above, the history of this method can be found in social sciences and psychology and therefore goes back a long time in human history. The adaptation of the method to usability testing began in the early 80s and since then has been a long journey and therefore a lot of different definitions and slightly different approaches exist. The MUSiC project (1993) can be seen as one important step to formalize the method with respect to software evaluation.

International standards

ISO 9241-11 standardizes usability measures and also provides a general procedure for summative usability testing. The Common Industry Format for Usability Test Reports (now ISO/IEC 25062) marks also an important step in the method development, since it formalizes the output of the method.

ISO 20282 parts 2, 3 and 4 contain summative test methods to measure the ease of operation and installation of everyday products.

Benefits, Advantages and Disadvantages

Summary Advantages

The method offers empirical reliable data and therefore can be used to test hypotheses.
The central usability measures effectiveness, efficiency and user satisfaction can be measured.
Furthermore it offers the possibility to detect more complex usability flaws, which less formalized methods would hardly detect.
A correctly carried out summative test can simulate the real use of a product.
It can be used to underline marketing statements with empirical evidence.

Summary Drawbacks

The large number of participants required to get reliable data can be time consuming and expensive.
Does not provide so much support to enhance a product, since finding usability flaws is not the main focus.
The reliability of the results depends to a large extent on the correct planning, execution and analysis.
It can be difficult for people not involved in the study to rate the reliability and validity of a summative test.

Appropriate Uses

To establish a benchmark.
To find out whether requirements have been achieved.
To compare results with a competing product, interaction technique or earlier version.

The main goal of the method is to measure the usability of a product. This allows checking if usability goals are met and to be able to compare the product with competing products or earlier/different versions of it. Possible measurements are efficiency, effectiveness and user satisfaction which are normally measured by recording task completion times, success rate/accuracy and subjective user ratings derived from questionnaires.

As the term summative evaluation suggests, the method should be mainly applied in later stages of development. This allows integrating real tasks and, since the evaluation object is completed or nears completion, excluding possible interfering variables such as system crashes or incomplete functionality. It is also used in post development, e.g. to test if usability goals were met or for marketing purposes (testing vs. a competing product)

How To

This description highlights issues that are important for summative usability testing. A more detailed description of How to Do It can be found in Usability testing.

Procedure

Prerequisites/Constraints

Equipment: The requirements for a usability lab to run a summative test vary greatly. A mobile lab, meaning a laptop computer with recording software and a webcam can be sufficient, however big laboratories which include observation rooms for usability experts and developers can have advantages. The most important requirement however is to have a controlled environment in which the experiment takes place. For recording purposes there exist different software products, such as Techsmith Morae or Noldus Observer which record audio, video and screen for detailed post-analysis.
Participants: It is important to select participants from the expected target group. In many cases, university researches will rely on students as participants because of cost issues. However in many cases this is not sufficient, especially if the expected end-user is a specialist in his work area. The number of participant depends on the experimental design (see below).
Knowledge requirements: Knowledge requirements are manifold. First of all, the evaluator should have basic knowledge of different experimental designs and statistical analysis (see for example: David Martin’s “Doing Psychology Experiments” for an introduction). The reliability of a study relies to a large extent on the quality of the planning, the execution, and the analysis process. Furthermore the evaluator should not be involved in the development process of a product in order to avoid any bias. Experience is also a key issue in order to get valid and reliable results.

Planning

Define variables, hypothesis and measurements Independent variable: could be different products or interfaces that will be compared
Dependent variable: everything that can be measured
Hypothesis: pre-defined assumptions about the relation between independent and dependent variables, needed for statistical analysis.
Define the experimental design. If several conditions will be compared (e.g. different products), a more complex design has to be chosen (e.g. within-subjects or between-subjects design, mixed-designs). The latter also has an influence on the number of participants - twelve participants per condition are often considered as the minimum to allow for doing statistical analysis afterwards. In most of the cases a within subjects design will be appropriate. In this case, subjects work with all different occurrences of the independent variable, e.g. meaning all different types of interfaces/products. To avoid learning effects, interfaces and tasks have to be counterbalanced, meaning all kinds of possible conditions have to be applied in the experiment. If asymmetric learning effects occur, a within subjects design is not appropriate and the evaluator has to switch to a between subjects design, which means different groups for each occurrence of the independent variable. The drawback is that more participants are needed and subjective questions where users should compare both systems are not possible.
Define Tasks to be carried out by the user during the test. Tasks should be derived from the user-centred development process, e.g. out of scenarios or use-cases. Make sure that when comparing two or more products, each task goal can be achieved with all products
Define abort-criteria for each task (e.g. max time)
Select or create pre- and post-test questionnaire. E.g. use a satisfaction questionnaire as post-test questionnaire to assess the subjective users’ satisfaction. Pre-Test questionnaires are used to collect demographic data and pre-knowledge.
Select users who are representative of each user group

Running

The standard procedure can be divided into three parts.

First subjects are asked to fill in a pre-test questionnaire for demographic data, prior knowledge and other interesting information regarding the study. All variables collected can be useful for the statistical analysis.
In the second phase, subjects work with the system(s) (independent variable) and try to solve realistic tasks. Common dependent variables measured in this process are task time and task accuracy. Interaction steps or other system specific measures are also possible.
The third phase is used for subjective questionnaires, mostly standardized satisfaction questionnaires like QUIS, Sumi, SUS, Attrakdiff, etc, often supplemented with questions regarding the specific system.
Duration is typically one hour.

Participants and Other Stakeholders

Participants should be representative of the user population for whom the application is being designed.

Materials Needed

Running product/prototype
Observer/Logging Software makes measuring task time, etc. easier
Task descriptions for the user to carry out
Interview guidelines and questionnaires for participants
form of consent
if required: non-disclosure agreement to be signed by participants

Common Problems

Mistakes regarding the experimental design as well as the statistical analysis are quite common. It is useful to refer to literature from psychology to avoid mistakes.

Data Analysis Approach

Statistical procedures such as analysis of variance (ANOVA) or Chi-Square testing are commonly used to test for statistical significance of the reported results (e.g. to compare whether the difference in task-completion time between two products is due to chance or due to the different product).
The results shoud be reported with confidence intervals.

Next Steps

Results can be used to decide whether a product is ready for deployment (usability goals met) or should be redesigned or improved in some aspects. They can also be used for marketing purposes (e.g. our product is better than product X).

Special Considerations

Costs and Scalability

High costs and time duration. In most cases, single-subject experiments - meaning one participant per experimental session have to be conducted. Since min. 12 participants are needed, the time demand is quite high.

Ethical and Legal Considerations

Participants should fill in an informed consent form which specifies what is done with the data. Data should only be used for analysis purpose and not be distributed in any way.

Related Links