1EdTech Guidelines for Developing
Accessible Learning Applications

version 1.0 white paper

9. Guidelines for Accessible Testing and Assessment

The guidelines in this section are intended to promote access to tests and assessments by people with disabilities. We use the terms "assessment" and "test" virtually interchangeably, though "assessment" is often considered a more general term that includes tests. While distance learning presents important new opportunities for delivery of tests and assessments, much of this section pertains to almost any computer-based delivery or even non-computer-based delivery methods.

Since online assessment or testing takes place in a context where media is presented, even if that media is just text, most of the principles and guidelines pertaining to content described in other sections of this document also apply here. This section outlines some of the differences between the needs of testing and assessment and those of content-only delivery.

Much work has been done throughout the world towards making media accessible to all, including these very guidelines. Testing and assessment involve a step beyond media because they are more intimately tied-in with processes, activity and organizations. Many of the central problems in achieving widespread accessibility of online assessment systems are, at this stage, problems needing research to clarify the issues and identify appropriate technological solutions. This section presents the state of current issues and some principles that can be identified to point the way for future work.

This document is addressed to various stakeholders who provide the assessment, including assessment designers, content authors, validation bodies, organizations providing assessment delivery, and so forth. Different stakeholders may be involved in different parts of the process and concerned with very different kinds of assessment with different constraints. For example, a test designed to detect dyslexia in young children and a self test for Computer Science undergraduates to diagnose their own programming skills may need to handle accessibility very differently.

As discussed below, one over-arching consideration is that in high-stakes testing, where the consequences for individuals are great, there needs to be a very careful consideration as to the potential effects of accessibility provisions on the validity of the testing.

There are two broad classes of assessment and their accessibility requirements differ.

Low-Stakes Assessment

Low-stakes assessment is a form of assessment encompassed by the immediate process of learning, often in a very short feedback loop, such as exercises or quizzes. Sometimes this is called "formative" assessment or even just "feedback". The essential characteristics are immediacy and the lack of serious consequences contingent on performance.

Making low-stakes assessment accessible means providing equivalent content modalities and access mechanisms wherever possible. For example, buttons in a form or other interactive controls should not be labeled only with images and without accompanying text. A good solution would provide the label information in a form that allows for runtime generation of equivalents or alternative renderings. Controls must also be accessible through the keyboard as well as the mouse. The overall goal is to make systems and content as flexible and adaptable as possible. All of the solutions explained elsewhere in this guide apply at this media level. The goals of this kind of assessment are intimately tied up with learning. In general, accessibility requirements can be in conflict with the need to ensure assessments are valid, but here we are chiefly concerned with learning and so this is not a serious issue.

High-Stakes Assessment

High-stakes assessment has consequences that may make a serious impact on the life-course of the participant. An example might be a university entrance examination.

It is important that a high-stakes assessment be fair to all candidates and not offer advantages to one group over another. The goal is to make systems and content as accessible as possible while adhering to validity requirements. Ideally, for example, an examination candidate should not be excluded from an examination or have performance disadvantaged where the candidate's disability is unrelated to the skills or knowledge being tested.

Assessment providers need to be able to control, in a flexible manner, which alternatives are available in order to protect the validity of inferences made from the test scores. It may be that allowing one set of alternatives has no consequences for fairness for particular candidates for that assessment while allowing others does. What is needed is fine control and rigor in the use of alternatives, and building-in this kind of control can affect design choices for both the content, as well as the delivery system and authoring tools. Taking the images and alternatives problem described above as an example, one delivery system may provide for particular alternatives to be enabled or disabled under system control at runtime whilst another may not. For some group of candidates and assessments this may or may not be important, but the widest reuse of accessible assessment material for high-stakes assessment can only be obtained with the provision of appropriate mechanism. There are however many other considerations in building-in such a mechanism.

Developers and educators should make good use of alternative media and delivery technologies when designing assessment, systems and tools. They will likely need to allow assessors greater control over alternatives as well as better use of accommodations and easier integration of disabled users. Accommodations refer to changes in the content, format, and/or administration procedure for individuals who are unable to take assessments under standard (or default) conditions.

9.1 Testing and Assessment Challenges

There are several challenges in developing and delivering tests and assessments that can be used by people with disabilities, including:

tests and assessments too often lack important accessibility features.
accessibility features may conflict with validity considerations.
some testing organizations may lack a coherent rationale for test access decisions.
reuse of test content may be limited by a variety of factors.

9.2 Principles for the Accessibility of Testing and Assessment

The following are four principles to help guide the implementation of solutions for ensuring access to testing and assessment by people with disabilities. Developers should:

Consider accessibility standards from the earliest design stages.
Follow standards for testing and assessment that promote test validity.
Build a coherent rationale for the assessment.
Develop a reuse strategy that includes test content, designs, and more.

The relative emphasis among these principles is different for low-stakes than for high-stakes assessment.

9.2.1 Consider Accessibility Standards from the Earliest Design Stages

From the earliest design stages, test developers consider accessibility standards. For example, they need to consider including text to accompany images as well as captions and auditory descriptions to accompany video. Such alternative content is usually best created by those individuals who are best acquainted with the original content. Designing alternative content early in the process can help eliminate costly retrofits. Details about such accessibility standards are found throughout other parts of these guidelines.

9.2.2 Follow Standards for Testing and Assessment That Promote Test Validity

Validity is a preeminent concern in tests and assessments. The Standards for Educational and Psychological Testing, (hereafter called Standards), published by the American Educational Research Association (AERA), American Psychological Association (APA), and National Council on Measurement in Education (NCME) defines validity as:

The degree to which accumulated evidence and theory support specific interpretations of test scores entailed by proposed uses of a test (Standards, 1999).

According to Willingham and Cole (1997):

Validity is the all-encompassing technical standard for judging the quality of the assessment process. Validity includes, for example, the accuracy with which a test measures what it purports to measure, how well it serves its intended function, other consequences of test use, and comparability of the assessment process for different examinees. (p. 228).

Validity is a major theme in chapter 10 of the Standards, which focuses on testing individuals with disabilities. Standard 10.1 (the first of 12 disability-related standards) is essentially about validity. It states: "In testing individuals with disabilities, test developers, test administrators, and test users should take steps to ensure that the test score inferences accurately reflect the intended construct rather than any disabilities and their associated characteristics extraneous to the intent of the measurement" (Standards, 1999, p. 106).

While accessibility features are essential for overcoming threats to validity, some accessibility features can actually pose threats to validity. Oftentimes, the nature of the threat and the proper solution can be ascertained through basic logical analysis. For example, if a physical disability would prevent a person from recording their responses on a math test when administered with the default response method, then the incompatibility between their disability and the default response format constitutes a major threat to validity. It is fairly obvious that providing an alternative response method such as using an AT or a human scribe to record the answers would likely be a proper accommodation. Providing such an accommodation could improve validity.

But sometimes an accessibility feature can diminish validity. Suppose, for example, that an art test that is intended to measure a student's ability to write descriptions of artwork (such as paintings), and that the questions present visually displayed pictorial images of the artwork as prompts. Standards for accessibility might require that text descriptions of the artwork be made available to the test taker. However, to do so would make it virtually impossible to detect the test taker's proficiency in producing the text descriptions, thereby posing a threat to validity. For a test taker who is blind, a better approach might be to provide a tactile graphic of the artwork rather than the text description. Or alternatively, if the questions with pictorial images constitute only a small portion of the total art test, one might consider not administering such questions to test takers who are blind. Thus, for any given test and test-taker, not every accessibility feature necessarily promotes validity. Accessibility features need to be examined for their impact on validity.

Another example concerns the use of a readaloud accomodation in a test of reading comprehension. The test performance of a person who is blind might benefit from having content read aloud to them by synthesized speech or by prerecorded or live human speech. However, suppose that this test of reading comprehension includes decoding words from letters as part of the construct (intent of measurement). In this case, a readaloud accommodation would essentially prevent student performance from serving as a source of evidence about decoding ability since hearing the words spoken would not require decoding in order to answer questions correctly. This scenario would suggest that providing the readaloud accommodation may invalidate the test and that, therefore, the readaloud accommodation should not be allowed. However, the decision about whether to actually allow a readaloud accommodation might involve additional considerations. For example, one might ask additional questions:

How central is decoding to what the test is attempting to measure? If decoding is not central, then this factor might tend to favor allowing the accommodation.
What does research say about what this particular test actually measures (as opposed to what it is intended to measure)? If research showed that known differences in decoding ability did not actually affect performance on this test by people without disabilities, this might suggest that the test was not actually measuring decoding (even though the test was intended to measure it). These circumstances would tend to favor allowing the readaloud accommodation for the student who is blind, simply as a matter of equity. (However, this discrepancy between the intended and actual influence of decoding on test performance for nondisabled test takers would suggest that decoding is under-represented in the test scores, which constitutes another validity-related problem.)
Are there other ways of making the desired inferences about proficiency? If the student reads Braille, then presenting the test in Braille format might be a better alternative to a readaloud accommodation, since the test taker would need to possess decoding skill in order to perform well.

The need to examine many factors in making decisions regarding test takers with disabilities points to the need to be able to call on experts when needed. As suggested earlier, the rigor required for accommodation decisions will tend to be greater in high-stakes than in low-stakes testing. Thus, in many instances, basic logical analysis of what the test is intended to measure will make clear what kinds of accommodations will promote validity; in other cases, a more careful analysis and consultation with experts may be essential.

9.2.3 Build a Coherent Evidentiary Argument for the Assessment

To the extent possible, one should seek to build a coherent rationale for the claims that one makes about the proficiencies (i.e., knowledge, skills, and abilities) of test takers, including test takers with disabilities. Evidence-centered assessment design (Mislevy, Steinberg, & Almond, in press; Hansen, Mislevy, & Steinberg, in press), a design method that emphasizes explicit evidentiary argument, may help strengthen the technical quality of a test or assessment. In the case of individuals with disabilities, this argument must encompass not only claims about student proficiencies, but also about comparability of scores based on different formats or conditions of administration for people with disabilities. Evidence-centered design focuses on developing a coherent argument involving:

(a) the specific claims that one wishes to make about test-taker proficiencies,

(b) the test-taker behaviors that reveal those proficiencies,

(d) the provisions that address alternative explanations.

Consider, for example, a person who consistently answers test questions incorrectly. Generally, one would claim that the test taker lacks the targeted proficiency. However, an alternative explanation for poor performance would be that the test taker has a disability (e.g., blindness) that prevents the test taker from manifesting their true proficiency level under the default testing conditions (regular-sized visually-displayed text). Addressing such an alternative explanation might involve ensuring that (a) candidates are made aware how they can request testing accommodations, (b) the requests for accommodations are examined by a panel of qualified experts, and (c) if the accommodation is approved, the test taker has adequate opportunity to become familiar with the accommodation before using it in the actual test. Obviously, the efforts to address alternative explanations are all the more important in high-stakes settings. Note that the activities to address alternative explanations are diverse and that they often occur in the pre-assessment phase of testing, before the candidate has answered any questions.

Evidence-centered assessment design, particularly its domain modeling stage, can provide a systematic approach for examining accommodations and their appropriateness. Provision of appropriate accommodations seek to match test-taker background characteristics (e.g., blindness, knowledge of Braille) to task features (e.g., readaloud or Braille versions) that are intended to overcome access hurdles without conflicting with the claims that one wishes to make about test-taker proficiencies. The explicit modeling of proficiencies (both those targeted for measurement and those not targeted) can assist in identifying features of the task performance situation that are likely to result in sound measurement.

Evidence centered design can also make explicit the constraints on the reporting of test scores. Social policy considerations, for example, can be important constraints on reporting of test results. Reporting information about accommodations might help a test score user interpret scores (Lewis, Patz, Sheinker, & Barton, 2002; Standard 10.11 of Standards, 1999). Yet laws and other rulings designed to address the possibility of discrimination against test takers with disabilities may restrict the inclusion of certain information in reports. Evidence-centered design's support for analyzing, modeling, and representing social policy, legal, and other constraints can promote the cohesion of various parts of an assessment argument.

As in any design process, the quality of the outputs from an evidence-centered assessment design process is dependent on the quality of the inputs. Some information that would be useful as inputs to the design process, such as studies of the impact of accommodations test performance and validity can be difficult obtain due to factors such as small sample sizes, variability in examinees, and variability in accommodations (NRC, 1997; Standards, 1999; Pitoniak & Royer, 2001). While an evidence-centered approach cannot replace the generation of new knowledge through research and development, it can help leverage the value of existing knowledge through sharing and reuse between and among test developers and stakeholders and across related testing applications.

9.2.4 Develop a Reuse Strategy That Includes Test Content, Designs, and More

Reuse of assessments designs and of test content is a fundamental strategy for promoting efficient and valid assessments. Striving for reuse impels one to identify commonalities across assessment settings, audiences, and delivery systems.

One important kind of reuse that deserves careful attention is reuse of test content. For example, one of several goals in the design of an "accessible" test might be that the text portion of the test content could be reused across several presentation formats such as (a) visually displayed text for test takers who are deaf or nondisabled, (b) synthesized speech for test takers who are blind, and (c) Braille for individuals who are deaf-blind or blind. Ideally, there would be nothing peculiar to a specific presentation format that would require a different source text for each of these presentation formats. While complete reuse of test content may be a good design goal, one also needs to be ready for the possibility that content -- text content in this example -- might not be completely re-usable across test formats. For example, acronyms that might make sense in print to a sighted test taker may voice improperly in synthesized speech used by a person who is blind. Ensuring proper access by the person who is blind and receiving content via audio might require, for example, that letters be separated by spaces to facilitate proper voicing. In some content areas, different formats may require supplementary media such as tactile (raised-line) drawings of math figures and the on-screen content may need to alert the user about the availability of such. Specialized directions may be necessary for different formats. These kinds of differences between formats may reduce certain kinds of test content reuse.

It follows that a sensible strategy for reuse should include test content but should also extend further, for example, to include reuse of assessment designs. As suggested earlier, an assessment design that comprises an explicit argument can promote greater reuse. Furthermore, strategies for reuse should not overlook the reuse of personnel and other test development and delivery resources.

9.3 Delivery and Authoring Tool Principles for Testing and Assessment

Delivery and authoring tools need to support mechanisms for authoring, storing, and using accessible assessment content. Many of these mechanisms are identical to those described throughout this document, but some concerns are specific to the authoring and delivery of assessments.

Note: This is for work in a future version, but may be expected to include:

information about purpose.
information about accessibility properties of content.
control information for enabling and disabling media and other alternatives.
possibly the means for recording data about actual execution uses of alternatives (tracking).

9.4 Content Development Principles for Testing and Assessment

Enabling flexible, accessible assessment content will require encoding additional information in the meta-data about each assessment item and its alternatives. Other concerns are specific to the development of assessments.

9.4.1 Provide Information about Purpose

Note: This is for work in a future version, but may be expected to include purpose information which is needed at multiple levels including:

the level of an assessment.
the level of an item so as to allow selection of appropriate items.
at the level of a media alternative.

9.4.2 Provide Information about Accessibility of Content at Varying Levels

Note: This is for work in a future version.

Resources:

AERA, APA, & NCME. [Standards] (1999). Standards for Educational and Psychological Testing. Washington DC: American Educational Research Association, American Psychological Association, National Council on Measurement in Education.

BSI BS 7988 A Code of Practice for the use of information technology for the delivery of assessments. The document is available to order from British Standards Institution (BSI).

Lewis, D. M., Patz, R. J., Sheinker, A., Barton, K. (2002). Reconciling standardization and accommodation: Inclusive norms and reporting using a taxonomy for testing accommodations. Paper presented at the annual meeting of the American Educational Research Association, New Orleans, Louisiana.

Mislevy, R.J., Steinberg, L.S., & Almond, R.G. (in press). On the structure of educational assessments. Measurement: Interdisciplinary Research and Commentary.

National Research Council [NRC] (1997). Educating one and all: Students with disabilities and standards-based reform. National Academy Press.

Pitoniak, M. J., & Royer, J. M. Testing accommodations for examinees with disabilities: A review of psychometric, legal, and social policy issues. Review of Educational Research, 71 (1), 53-104.

Willingham, W. W., & Cole, N. S. (1997). Gender and fair assessment. Mahwah, New Jersey: Lawrence Erlbaum Associates.

Willingham, W. W., Ragosta, M., Bennett, R. E., Braun, H., Rock, D. A., Powers, D. E. (1988). Testing handicapped people. Needham Heights, Mass.: Allyn and Bacon.

NEXT | CONTENTS

1EdTech Consortium

1EdTech Guidelines for Developing Accessible Learning Applications Section 9

1EdTech Guidelines for Developing Accessible Learning Applications