Thinking Critically about Performance Assessment and Education Reform
Richard Shepard and Dean W. Owen Jr.
The recent and widespread educational reform movements now underway in the U.S. represent a significant shift in contemporary thinking with regard to the fundamental ways in which children are educated and assessed. Nationally funded initiatives such as Goals 2000 have echoed many of these changes in response to concerns that the U.S. may fall behind or lose its competitive edge in what has become an increasingly interdependent and captitalistic global econony. Other national organizations have called or an emphasis on specific aspects of reform; for example, the National Council of Teachers of Mathematics and the American Psychological Association have promoted critical thinking and problem solving skills, cooperative learning and learner-centered principles as part of their recommendations for improving the quality of public education (e.g. APA, 1993; NCTM, 1989). Still others have focused attention on the need for improved writing skills (e.g., Young & Fulwiler, 1986) and the use of alternative forms of assessment (e.g., Gifford & O'Connor, 1992). The combined effect of these varied forces on statewide reform has resulted in reorganizational efforts which are often a disjointed collection of philosophies, instructional methods, and assessment tools.
In light of this, it seems likely for there to be a variety of unintended consequences and/or contradictory outcomes emerging from such reform initiatives. The widespread acceptance and application of so-called performance assessments, both for individual as well as institutional evaluation, represents a particularly volatile aspect of reform. While such measures have been tauted as more accurate assessors of complex learning, it would be imprudent to embrace them without critically examining what may be potentially counterproductive side effects.
The purpose of this paper is to examine several of the assumptions underlying education reform in general, and performance assessment in particular, and to highlight areas of practical or theoretical inconsistency. Specific assumptions to be addressed will include (1) the belief that present reform efforts are completely new and original, (2) the belief that all students can become proficient in all areas of academic performance, (3) the belief that students should be evaluated using authentic, real-life tasks, and (4) the belief that all instruction and evaluation should be geared toward deeper, more complex forms of learning.
Assumption #1: Current assessment initiatives are unique, contemporary innovations.
The institution of American education has evolved much like the geology of the earth. Rather than a slow and progressive change, American education seems to have undergone a series of tumultuous upheavals with dormant periods in between. This is most certainly true if one were to consider the emergence of periodic "reform" movements which seem to arise in response to legislative or social pressures. Cuban (1990) describes public officials' eagerness to reform schools as having continued largely unabated during this century and especially since World War II. In fact the reform movement may be clearly traced to the efforts of pedagogical reformers nearly a century and a half ago when criticism of teachers' practices began to appear during the 1840's and 1850's (Cuban, 1990).
Willson (1991) characterized the last three decades as a period of high stakes testing which began during the 1970's when the issue of minimal competency and basic skills arose as a target for scrutiny. Students were denied diplomas, teachers and administrators were evaluated on the basis of class means, and superintendents had their districts compared. But, as Resnick and Resnick (1992) indicated, the link between testing and educational reform is not new. It has been a feature of efforts to improve American schools since at least the end of the nineteenth century. Since this link is so firmly established, testing theory and practice have been further developed and refined with each subsequent reform movement.
During the 1980's, the focus shifted as the result of the use of nationally standardized tests, especially the Scholastic Aptitude Test (SAT), as the basis for many reports documenting the failure of the American Educational System. Out of this movement came a more critical view of the assessment procedure which was accused of focusing on relatively simple basic skills using the traditional multiple choice format. It was argued that such tests fail to measure higher order or more complex forms of learning; and that the form of the test, although convenient, objective, and cost efficient, limited the students to primitive responses which were not good and complete indicators of complex achievement.
A possible solution to this limitation was found in the adoption of the concepts of "competency-based education," "mastery learning," and "outcome-based education." As with so many educational concepts, these are neither new nor innovative but concepts which have re-emerged on a number of occasions. Many of these concepts were developed during the 1920's and came to influence educational and assessment practices throughout the 1930's (Tower, 1992). Although they largely disappeared during the 1940's and 50's, it would seem that once again these terms are re-emerging to influence the thinking of a new generation of legislators, policy makers, school administrators, and teachers who are being asked to rethink the way they teach and assess student performance.
Assumption #2: All students can be academically successful.
This highly democratic notion is a pillar of most reform efforts and, presumably, countermands traditional instructional and assessment practices. Spady and Marshall (1991), for example, defend their "outcome-based" approach on the grounds that all students have the ability to learn and succeed if they are given enough time and instructional support. In their view, current methods favor the brightest, fastest and most advantaged simply for the sake of organizational and administrative expedience. While this view is intuitively appealing and attractive, it may be more idealistic than realistic. The notion that "All men are created equal" may be quite different than "All men are equal"; evidence from more than a century of reseach in American Psychology provides evidence of significant differences in ability, aptitude, interest, and personality among individuals. As Hanna (1993) has pointed out, individual differences among students exist and the magnitide of these differences can be quite large. In fact the impact of quality instruction may actually exaggerate these differences rather than diminish them (Feldt & Brennan, 1989).
Traditional standardized tests have also been criticized recently on similar grounds; current psychometric approaches to test construction emphasize the "static assessment of differences rather than the assessment of changes due to learning" (Shepard, 1991, p.6). According to Taylor (1994) this "measurement" model of testing is based on the theory of individual differences which "requires differentiation between and ranking of people rather than the establishment of clear standards or expectations for learners. Excellence is determined by whether an examinee outranks other examinees" (p. 242). By contrast, performance-based methods compare students against pre-determined standards which, in theory, allows for all to be successful.
Embedded within the "success for all" philosophy is the belief that students must be provided with instructional and assessment opportunities which maximize their unique learning styles and strengths (e.g. Minnesota State Department of Education, 1991). Gardner's (1983) theory of multiple intelligences addresses this issue directly and has been used by many reformers as a guiding framework in the design of new teaching methods and testing instruments.
Gardner (1992) has been critical of traditional assessments for their over-reliance on "logical-mathematical" and "linguistic" competencies at the expense of other forms of intelligence. Newer assessment methods, such as portfolios, have been promoted as one way to accommodate individual differences by providing students with multiple modes to express what has been learned through writing, audio/video recordings, or artistic forms of expression (Archbald & Newman, 1992).
The contradiction here is that, in spite of the fact that performance-based measures (e.g., portfolios) are believed to be more authentic and to provide a broader spectrum of student performances, the realities are that students must be proficient at expressing themselves principally through the medium of writing. Due in part to the writing-across-the-curriculum movement begun in the 1980's (Young & Fulwiler, 1986), the belief that students must be able to communicate their ideas via writing has become an integral part of virtually all performance-based tasks. The heavy reliance on this single capability may actually limit student expression as well as provide a clear bias for those students possessing these unique language skills.
Rather than expanding opportunities to demonstrate learning through
alternative mediums, in practice, most performance measures restrict
students to an even more specialized form of linguistic intelligence.
Teachers in mathematics are already decrying this practice with
many of their most able students receiving poor evaluations simply
because they are unable to translate what they "know"
into writing. Bias against individuals with different cultural
backgrounds may also be exacerbated through the use of such performance-based
measures (see, for example, Dunbar, Koretz, & Hoover, 1991;
Zwick, Donoghue, & Grima, 1993).
Assumption #3: Student assessment should engage students in real-life, authentic situations.
A direct consequence of this assumption has been the rejection of objectively scored, recognition type methods of assessment (i.e., multiple choice). Because they are indirect measures of learning that tend to emphasize rote memorization and recognition memory, many reformers have viewed multiple choice tests as antithetical to the "thinking curriculum" (Resnick & Resnick, 1992). But considering the problems currently being encountered with the validation of performance measures (e.g., Shavelson, Baxter,& Pine, 1992) and the fact that multiple choice tests still correlate at around mid-range values (t =.5) with performance-based measures (Willson, 1991), the total rejection of objective tests may be premature. Objective measures are still the most efficient means for testing for content coverage (an inherent weakness of performance tests) and may even be adapted so as to reflect the thought processes underlying a particular response (Gronlund, 1985; Norris, 1989).
Due in part to the objections of those who suggest that traditional norm-referenced, objectively scored tests are neither direct nor adequate measures of complex learning, many reformers have advocated performance based or "authentic" assessment as a more realistic approach to measuring an individual's academic growth. Wiggens (1989), for example argues for authentic tests by stating that, "All tests should involve students in the actual challenges, standards, and habits needed for success in the academic disciplines or in the workplace: conducting original research, analyzing the research of others in the service of one's research, arguing critically, and synthesizing divergent viewpoints. Within reasonable and reachable limits, a real test replicates the authentic intellectual challenges facing a person in the field (Such tests are usually also the most engaging.)" (p. 706).
The evaluation of authentic tasks typically emphasizes the "process" a student went through enroute to a solution or finished product rather than the correctness of the responses (Brown, Campione Webber, & McGilly, 1992; Resnick & Resnick, 1992; Taylor, 1994). An emphasis is also placed on students working collaboratively and cooperatively on projects rather than independently and competitively (Archbald & Newman, 1992).
An interesting philosophical conflict arises when one compares
the impetus for the desired outcomes of reform efforts with the
underlying principles which are guiding the actual implementation.
Consider, for example, the importance of personally constructed
meaning and "process" in performance-based instruments
and the greater reliance on group interaction and cooperation.
Because the implicit message to students is much more humanistic
than traditional approaches (i.e., promoting personal development
and sensitivity to the strengths and needs of others), one may
reasonably ask if such methods do not, in fact, promote attitudes
which are inherently antagonistic to the realities of a capitalistic,
competition-oriented society. Indeed, students have lamented for
generations the disparity between the idealistic (and often unrealistic)
world of school and the hard realities of the world beyond the
classroom. Future students may once again be disillusioned to
find that employers are not as interested in the process used
as they are in the accuracy and relative quality of an individual's
work. Economic mandates such as "the bottom line" and
"survival of the fittest" may contrast greatly with
the reform methods designed to simulate authentic situations.
While the deleterious effects of a competitive school environment
have been emphasized by many, it may, in some respects, be a more
"authentic" atmosphere in which to educate future capitalists
who will live and work in an increasingly competitive society.
Assumption #4: Testing and instruction should be comprised of activities which promote deeper, more complex learning.
As yet another reaction against traditional methods of testing and instruction, this assumption refutes the view of learning as a gradual accumulation of bits of information and skills which build into more complex skills and understanding. According to this perspective, the behavioristically oriented beliefs of traditional psychometrics (Shepard, 1991; Willson, 1989) should be supplanted by a more cognitive approach to curriculum design and assessment. As Resnick & Resnick (1992) point out, complex thinking and reasoning are required even when learning the most basic and elementary of skills.
While it cannot be denied that education should focus more of its time and resources on critical thinking and application types of learning, the wholesale rejection of the "superficial" forms of learning (i.e., knowledge and comprehension level, see Bloom, Englehart, Furst, Hill & Krathwohl, 1956) undermines the realities of how and what individuals "know" about the world around them. Under ideal conditions, it might be desirable for people to integrate, synthesize, and apply new learning, but the "authentic" reality is that most knowledge is acquired through simple and repeated exposure to information. Every day millions of adults read newspapers and magazines, or attend to some form of electronic media to learn about current events. Often the resultant learning from these activities is nothing more than an awareness of a recent political, social, or scientific development; it is neither temporally feasible nor cognitively expedient to comprehend all that one is exposed to at a deep, application level of understanding. This does not, however, mean that all such learning is useless to the individual. For example, nearly everyone has some casual familiarity with basic facets of everyday life such as the operation of an automobile or even the human body. Despite this familiarity, few are qualified to prescribe drugs, perform surgery, or even change the oil in their automobile. However, this general familiarity does allow most people to recognize problem situations and then to seek the advice and assistance of an expert.
The issue here is that, in an effort to promote more complex learning, proponents of reform may leave students, parents and teachers with a false and/or negative view toward simple, more factual forms of learning. It cannot be emphasized strongly enough that a large proportion of what people know could be categorized as superficial or knowledge level (see Bloom, et. al, 1956) learning and that this type of understanding is neither good nor bad but simply a fact of cognitive life. Exposure to a wide range of information provides the background or context for deeper, more complex forms of understanding. Quite often, it is the combining or connecting of the factual knowledge which results in new insights. To preach an application-only philosophy of education belies decades of research in verbal learning which has demonstrated that humans are uniquely equipped to absorb massive amounts of information through the medium of language (e.g., Ausubel, 1980; Vygotsky, 1962). More recent theories of learning and memory also acknowlege the importance of repeated exposure to stimuli and the making of certain responses "automatic" (e.g., Logan, 1988; 1990). This type of learning becomes problematic only when instruction and testing limit children to this level of understanding exclusively."
Conclusion
While the need for occasional re-evaluation of educational systems is both necessary and highly desirable, the recent reform movement would appear to have been largely initiated by political demands for increased accountability and may have been adopted in a hasty fashion. In an article discussing the beliefs of traditional measurement specialists, Shepard (1991) makes the following statements: "My argument is that hidden assumptions about learning should be examined precisely because they are covert. What we believe about learning and the intended effect of testing on learning should be considered directly, not 'smuggled in' by the adoption of a popular test theory." (p. 9). Interestingly, this argument is also an apt summary of the present article; that is, that unless the tacitly held assumptions of reformers are thoroughly articulated and integrated, and their unintended as well as intended consequences projected, the latest round of reform runs the risk of becoming fragmented, directionless, and vulnerable to the caprices of special interest groups. Well-intentioned movements of the past have been victimized by reactionary forces for similar reasons with many excellent programs being lost because of miscommunication and confusion.
The purpose of this paper was not to indict the current reform efforts nor to advocate a return to traditional education but to critically examine certain fundamental assumptions which underpin these movements. Unless these issues are examined and discussed in an open and honest fashion, the potential for unanticipated and contradictory outcomes will remain high. At present, the reform movement in American education is not unlike the situation facing the Hebrews just after crossing the Red Sea. After taking those first tentative steps away from the "bondage" of traditional methods and philosophies, restructuring efforts are at a critical crossroad. Without prudent reflection and clearly refined goals the entire reform movement runs the risk of either returning to the "servitude" of its former masters or wandering aimlessly in a scholastic wilderness of confusion and self-doubt for another generation.
References
American Psychological Association (1993). Learner-centered psychological principles: Guidelines for school redesign and reform. Washington, D.C.: American Psychological Association.
Archbald, D.A. & Newmann, F.M. (1992). Approaches to assessing academic achievement. In H. Berlak, F. Newman, E. Adams, D. Archbald, T. Burgess, J. Raven & T. Romberg (Eds), Toward a new science of educational testing and assessment (pp 139-180). Albany, NY: SUNY Press.
Ausubel, D.P. (1980). The facilitation of meaningful verbal learning in the classroom. In J. Hartly (Ed.) The psychology of written communication. New York: Nichols.
Bloom, B.S., Englehart, N.D., Furst, E.J., Hill, W.H. & Krathwohl, D.R. (1956).Taxonomy of educational objectives: Handbook I. Cognitive domain. New York: McCay.
Brown, A.L., Campione, J.C., Webber, L.S., & McGilley, K. (1992). Interactive learning environments: A new look at assessment and instruction. In B.R. Gifford and M.C. O'Connor (Eds.), Changing assessments: Alternative views of aptitude, achievement, and instruction (pp.121-211). Norwell, MA: Kluwer Academic Publishers.
Cuban, L. (1990). Reforming again, again, and again. Educational Researcher, 19 (1), 3-13.
Dunbar, S.B., Koretz, D., & Hoover, H.D. (1991). Quality control in the development and use of performance assessments. Applied Measurement in Education, 4, 289-304.
Feldt, L. S. & Brennan R. L. (1989). Reliability. In Linn, R. L. (Ed.) Educational Measurement (2nd ed.) Washington, DC: American Council on Education.
Gardner, H. (1983). Frames of Mind. New York: Basic Books.
Gardner, H. (1992). Assessment in context: The alternative to standardized testing. In B.R. Gifford & M. O'Connor (Eds.), Changing Assessments: Alternative Views of Aptitude, Achievement, and Instruction (pp.77-119). Norwell, MA: Kluwer Academic Publishers.
Gifford, B.R. & O'Connor, M.C, (1992). Changing assessments: Alternate views of aptitude, achievement and instruction. Boston, MA: Kluwer Academic Publishers.
Gronlund, N.E. (1985). Measurement and Evaluation in Teaching. New York, NY: MacMillan Publishing Company.
Hanna, G.S. (1993). Better Teaching Through Better Measurement. Fort Worth, TX: Harcourt Brace Jovanovich College Publishers.
Logan, G.D. (1988). Toward an instance theory of automatization. Psychological Review, 492-527.
Logan, G.D. (1990). Repetition priming and automaticity: Common underlying mechanisms? Cognitive Psychology, 1-35.
Minnesota Department of Education (1991). Introduction to education that is outcome-based. St. Paul, MN: State of Minnesota Printing Office.
National Council of Teachers of Mathematics (1989). Curriculum and evaluation standards for school mathematics. Reston, VA: National Council of Teachers of Mathematics.
Norris, S.P. (1989). Can we test validly for critical thinking? Educational Researcher, 18 (9), 21-26.
Resnick, L.B. & Resnick, D.P. (1992). Assessing the thinking curriculum: New tools for educational reform. In B. Gifford & M. O'Connor (Eds.), Changing Assessments: Alternative Views of Aptitude, Achievement, and Instruction (pp.37-75). Norwell, MA: Kluwer Academic Publishers.
Shavelson, R.J., Baxter, G.P., & Pine, J. (1992). Performance assessments: Political rhetoric and measurement reality. Educational Researcher, 21 (4), 22-27.
Shepard, L.A. (1991). Psychometricians' beliefs about learning. Educational Researcher, 20 (26), 2-16.
Spady, W.G. & Marshall, K.J. (1991). Beyond traditional outcome-based education. Educational Leadership, 49, 67-72.
Taylor, C. (1994). Assessment of measurement or standards: The peril and promise of large-scale assessment reform. American Educational Research Journal, 31 (2), 231-262.
Towers, J. (1992). Some concerns about outcome-based education. Journal of Research and Development in Education, 25, 89-95.
Vygotsky, L.S. (1962). Thought and language. E. Hanfrann & G. Vaker (Trans.). Cambridge, MA: MIT Press.
Wiggins, G. (1989). A true test: Toward more authentic and equitable assessment. Phi Delta Kappan, 79, 703-713.
Willson, V.L. (1989). Cognitive and developmental effects on item performance in intelligence and achievement tests for young children. Journal of Educational Measurement 26,103-119.
Willson, V.L. (1991). Performance assessment, psychometric theory, and cognitive learning theory: Ships crossing in the night. Contemporary Education, 62 (4), 250-254.
Young, V. & Fulwiler, T. (1986). Writing Across the Disciplines. Upper Montclair, NJ: Boynton/Cook.
Zwick, R., Donoghue, J.R., & Grima, A. (1993). Assessment of differential item functioning for performance tasks. Journal of Educational Measurement, 30 (3), 233-251.