LANGUAGE ASSESSMENT: Should We be Testing for Teamwork?
Karen Stanley, editor
As students, we often complained about bad tests and occasionally appreciated tests that we felt were fair or (even!) contributed to the process of learning. What none of us ever did was think about what it took to create a good test.
Now, however, as ESL/EFL practitioners, research and experience have made us aware of the complexities of designing language assessment instruments. Basic issues of validity and reliability, important as they are, are just part of test construction. Over the last 20 years, as language assessment methodologies have evolved, a wide range of other considerations have emerged. These include such issues as authenticity, potential political (mis)use of test results, the responsibility of the test developer, purpose(s) to which a test is put, and alternative methods of assessment.
As the concept of teamwork has been increasingly recognized to be an important success factor in many professional environments, it has also acquired importance in a variety of academic curricula and become a criterion in hiring decisions. With this increased focus, it is no surprise that such ability has also been considered relative to language assessment. To what degree, then, should the ability to work collaboratively be incorporated into assessment instruments?
What follows are selected posts from an April 2001 discussion of this topic on the LTEST-L email list for language testing and assessment. Contributors whose email addresses are listed welcome comments from readers. [-1-]
Hameed Esmaeili <HameedE@aol.com>
1. You may have noticed that a new game has been added to the Olympics – Synchronizing Diving: two swimmers dive together in harmony and are beautiful to watch. They receive separate and joint scores for their performance. However, it is the team, the two together, that is ranked. This is in addition to a lot of other games when group or teamwork counts. Teamwork has been there since the first Olympics, but it seems that it is becoming more and more attractive and consequently a major trend. This shows humans’ interests. They see some kind of beauty in it. Is this beauty, this pure authenticity in modern life, applied and reflected in language tests?
The answer is: No.
2. There are numerous articles, studies, theories in applied linguistics where team work, collaboration, grouping, joint papers, joint projects, group learning, recasting, … are emphasized. This is now a major trend in the field of language studies. Is this trend applied and reflected in language tests (at least in most of them)?
The answer is: No
3. When one is admitted to a college or university (even to elementary, middle, and high schools), most of the time they find themselves as a member of a team. Most professors, instructors, and teachers like their students to work with each other, do joint projects, and meanwhile learn from each other.
Is this ‘future need’ included in ‘needs analyses’?
The answer is: No.
Is this real life situation applied and reflected in language tests?
The answer is: No.
There seems to be a disparity and gap between teaching and testing.
4. Looking at people’s resumes, one notices a new frequently noted qualification: Ability to work with others as a team. Is this highly appreciated qualification applied and reflected in language tests?
The answer is No.
5. This generation, ‘Generation X’, as they are called, have their own particular interests. These talented young people make the majority that takes language tests. Are their interests (those that should be highlighted and encouraged by educators) applied and reflected in language tests?
The answer is: No.
6. Is ‘creativity’ applied in language tests? …
So, what is the problem? If it is believed that language tests should reflect real-life situations, be authentic, and motivate, why don’t language testers include these in their tests? One might say that the major problem (obstacle) is that ‘animal’, construct validity. But is it really construct validity or the way it is conceptualized that acts as a barrier? The latter is correct. [-2-]
What will language tests look like in the future?
The current theoretical frameworks and the way ‘validity’ in general and ‘construct validity’ in particular are viewed need to be revisited and revised.
I hope there will be a future professional battlefield of theories to see that ‘validity’ is indeed a dynamic concept.
University of Iowa
While I agree with some of your points, I think you cast your net a bid too far in condemning much of language testing for not including ‘teamwork.’ You neglect the very important issue of test purpose. Very often, we are assessing someone in order to make a decision about that individual, e.g., admission into a university or program, placement into a program, graduation, certification, etc. In so doing, we might set up an assessment scenario that requires the candidate to interact with others or to at least simulate such interaction, i.e., engage in ‘teamwork’ activity. But the ultimate purpose for the assessment and decision from the test score are both undertaken with respect to the individual test taker.
So I guess my reaction to your posting is that any consideration of construct validity must take into account the purpose and context of the assessment, where very often we want to interpret scores and/or make decisions about an individual.
Daniel Eignor <deignor@ETS.ORG>
Hameed and others, this sounds very much like issues that I’ve heard Pam Moss at the University of Michigan discuss in a variety of contexts, but usually in the context of portfolio assessment. I’m not sure it’s construct validity issues that are the problem here, but rather the commonly used psychometric models, which are built on independent tasks and independent (but standardized) performances.
Hameed Esmaeili <HameedE@aol.com>
Dan Eignor said:
… this sounds very much like issues that I’ve heard Pam Moss at the University of Michigan discuss in a variety of contexts, but usually in the context of portfolio assessment. I’m not sure it’s construct validity issues that are the problem here, but rather the commonly used psychometric models, which are built on independent tasks and independent (but standardized) performances.
I wish I knew more about what Pam Moss is discussing. We all know what is done now has a theoretical foundation and it is easy to trace it. There are some views prior to applying any psychometric model, and those views are what I tried to address. [-3-]
Moss, P.A. (1996). Enlarging the dialogue in educational measurement: Voices from interpretive research traditions. Educational Researcher, 25, 20-28.
Moss, P.A. (1994). Can there be validity without reliability? Educational Researcher, 23, 5-12.
Moss, P.A. (1992). Shifting conceptions of validity in educational measurement: Implications for performance assessment. Review of Educational Research, 62, 229-258.
Also, I believe some other members of LTRC have directly discussed the issues covered in the above manuscripts with Pam and may be able to suggest other sources to read. Finally, the particular comment I made about psychometric models is based in part on material in a section of Pam’s 1994 Educational Researcher article where she compares hermeneutic and psychometric approaches.
Dennis Roberts <dmr@PSU.EDU>
seems like one can argue … for the most part … that if one is not able to demonstrate some skill on one’s own … then, it is kind of hard to think that their contribution to a GROUP activity would just suddenly blossom …
or, if we think about it in the reverse … what if a person could ONLY perform and contribute WHEN being a member of a group? is this person going to have many many problems engaging in societal activities? probably yes …
now, i am not going to push this too far but … it starts with the individual … and flows TO group settings …
tests that force you to act on your own behalf … ARE realistic (if they are good tests) … but, the main difficulty is that we ALSO need some balance with how people can work in more group oriented situations …
this is not making tests more realistic … it is including OTHER kinds OF realistic tests in our overall assessment plans …
balance is the key …
Hameed Esmaeili <HameedE@aol.com>
…balance is the key …
This is a very good point and to make this balance happen, what should be done? To see if the current theoretical frameworks in assessment allow us to do so and allow us to be more flexible in developing language tests that more reflect real-life situations. [-4-]
At the time of this post, at The Department of Linguistics and Modern English Language, Lancaster University, UK
Just to add to Craig’s answer to Hameed Esmaeili.
We have to remember that we are testing *language* and that other skills (such as ability to work in a team) must not end up as the be-all and end-all of the assessment.
Where possible, it is certainly true that an oral test should motivate students by comprising tasks which are similar to those that are so important in real life. Working on a task in a group may be very realistic and may provide plenty of scope for the use of a wide range of language activity, but we must remember that these tasks, and any ensuing interaction among the candidates, are vehicles for language use; we are testing people’s ability to use language, not their ability to work with others in a team.
Caroline Clapham said:
‘we are testing people’s ability to use language, not their ability to work with others in a team’.
And why are we not assessing what people will really be doing in their future? So, based on current frameworks, we think that we would be assessing different constructs, and that was the point that I tried to make. To see if we are really right in defining construct in language studies.
‘We have to remember that we are testing *language* and that other skills (such as ability to work in a team) must not end up as the be-all and end-all of the assessment.’
I do agree with you and this is a very good point. We all know that ‘language’ is used by people, in contexts, and for some specific purposes. And that makes difficult to deal with ‘language’ while not including other things involved. And this again goes back to our understanding of construct.
Hameed Esmaeili wrote:
> Caroline Clapham said:
> ‘We have to remember that we are testing *language* and that other skills (such as ability to >work in a team) must not end up as the be-all and end-all of the assessment.’
> I do agree with you and this is a very good point. We all know that ‘language’ is used by people, in >contexts, and for some specific purposes. And that makes difficult to deal with ‘language’ while not >including other things involved. And this again goes back to our understanding of construct.
Most language testing DOES include the contexts and specific purposes you are referring to. These are important considerations in determining the test ‘tasks’ which encourage the performances (on the part of test takers) that best manifest the language abilities underlying them. Through construct definition, we clearly and specifically define such abilities, their components, and their relationship to each other, so that one testing situation can be distinguished from other seemingly similar testing situations. So, these “other things” your are referring to ARE involved in language testing, it is just that the ultimate purpose is to test the “underlying ability to use language”. [-5-]
And, this makes the analogy you draw between sports competitions and language testing questionable, because in such games it is the product/performance per se which is being evaluated and scored. Skaters, for example, lose marks if they slip while landing a jump regardless of how capable they are in their moves and how many times they have got it right in their previous attempts, it is just that single performance that counts. In most language testing this is rarely the case.
Andrew Finch <email@example.com>
Kyungpook National University, Korea
>we are testing people’s ability to use
>language, not their ability to work with others in a team
I think Hameed has raised an interesting issue here. It’s not a new concept that language is an act of communication that takes place in a social context. We are preparing people to negotiate acts of communication in such contexts, and we must test those skills.
Language learning must involve education of the whole person and testing must focus on this aspect of learning. Many tests are designed to imitate the “real” world, however, scientific research has shown that in this world of business and social life, cooperation is a much more effective survival strategy than competition. With cooperation, everyone can be successful; with competition, there must be a winner and a loser, and the winner must watch his/her back as the competition continues.
Vygotsky has shown that a group of students can learn together what none of them can learn individually. What is the problem with helping students acquire social learning skills and critical thinking skills as part of their classroom experience? This seems to me much more important than the particular subject matter. and the testing must be about those skills. Testing language ability is only 10% successful as a predictor of success in university, so let’s focus on the larger picture. There seems to be a suspicion about individuals not pulling their weight in the cooperative learning situation, and getting through on the backs of others. If so, it means that that social responsibility has not been a part of the classroom experience. [-6-]
John H.A.L. de Jong <John.HAL.dejong@WXS.NL>
Language Testing Services, Netherlands
Assessment in the future…? Or brave new world?
Indeed, almost all human activity involves social skills. Doctors, lawyers, stockbrokers, teachers, priests and p…’s all need these skills to go about their business successfully.
There are more skills that could enhance the probability of success in many trades and occupations, such as mathematical literacy, psychological insight, and spatial skills. In addition being creative (but not too adventurous), determined (but not too stubborn), audacious (but not too daring), open-minded (but not too liberal) will help to gain acceptance from colleagues and clients.
What is more, research has shown that physical appearance is a good predictor of success: a tall (but not too tall), slim (but not too skinny), healthy (but not overly muscular or tanned) appearance will certainly help to open more doors and speed up progress on the career path.
And who knows, maybe even intelligence would help.
So should we than include all these aspects in tests for barristers, physicians, bankers, politicians, language teachers, university students and US presidents?
Berscheid, E. and E. Walster. 1974. “Physical Attractiveness.” In Advances in Experiment Social Psychology, ed. L. Berkowitz, 7: 157-215. New York: Academic Press.
Collins, M. and L. Zebrowitz. 1995. “The Contributions of Appearance to Occupational Outcomes in Civilian and Military Settings.” Journal of Applied Social Psychology 25: 29-163.
Mazur, A., J. Mazur, and C. Keating. 1984. “Military Rank Attainment of a West Point Class: Effects of Cadets? Physicals Features.” American Journal of Sociology 90:125-150
Patzer, Gordon. 1985. The Physical Attractiveness Phenomenon. New York: Plenum.
Perusse, D. 1993. “Cultural and Reproductive Success in Industrial Societies: Testing the Relationship at the Proximate and Ultimate Levels.” Behavioral and Brain Sciences 16: 267-322.
In my current incarnation as a visiting professor, teaching graduate students curriculum and language testing to applied linguistics and M.B.A. types here in Thailand, I suggested (only half in jest) they had better hone their golf chatter, party small talk, and funny repartees, because that’s where the promotions and getting ahead really are decided. [-7-]
When I was a grad student at OISE/UT, I had no hesitation to suggest to Alister Cumming’s class that if I really wanted to test the language ability of someone who wanted to be a Russian spy, I would hire a 12 year old Russian boy or girl to talk to the candidate for 5 minutes and make the judgement.
I was appalled during final exams here in Thailand the scorn the professors exhibited when they found outside the exam room all these crib sheets written in Thai for the English reading test. Here was an entire group of future doctors, engineers, and nurses, pooling their resources to make sure everyone did well, and all the teachers did was to suggest it was “cheating”. To me it was a natural response to the challenge to get ahead in a non-competitive, cooperative, helpful spirit.
Why every year are there newly published methodologies in language approaches and pedagogy, with fancy titles, flashy colourful books, catchy “isms’, while language tests are stodgy white paper and HB pencil things that haven’t really changed since …
Yeah, I know…
In his comment on this issue, Andrew Finch said, amongst other things:
‘Testing language ability is only 10% successful as a predictor of success in university, so let’s focus on the larger picture.’
Actually, since academic success is caused by so many interacting factors (eg academic knowledge, diligence, study skills, personality, physical comfort, freedom from homesickness etc etc etc) this is a good reason for us NOT to try to ‘focus on the larger picture’. (In fact, the mixed metaphor implied by ‘focusing’ on a ‘larger picture’ indicates the danger!) I think that (L2) language testers will do a greater service to those in universities who use their test results by NOT trying to go beyond their remit and their competence. Obviously, their tests should build in – in a general and generalisable sense – an element of what Keith Johnson would call ‘real operating conditions’, but what admissions tutors need is a description of what the language performance of a candidate is likely to be in a range of situations. It is up to them to decide how this profile matches their needs, and might combine with other factors from the ‘larger picture’ to presage success or struggle. [-8-]
Dan Douglas <dandoug@IASTATE.EDU>
We’re certainly getting into Tim McNamara’s notion of a continuum between a “weak” and “strong” sense of second language performance tests (McNamara 1996, p. 43). As you know, Tim suggests that performance tests toward the strong end of the continuum will represent real-world tasks (e.g., working in a team, using a specific language as the medium) and performance judged on real-world criteria (e.g., ability to successfully work in a team). He argues that this is not strictly a language test at all, since the focus of assessment is not language ability but rather successful performance, and is in any case problematic because, owing to a lack of psychological (and often material) context, such situations cannot be fully simulated in test conditions. This amounts to what I have referred to as the “there is no airplane” syndrome, with reference to a test of English for air traffic control.
Nancy Hornberger has also written about this issue in her discussion of her own experience using her English-Spanish interlanguage to get her driving license renewed in Peru (Hornberger 1989). She in fact got the new license after many problems and frustrations, but observes: “…my communicative competence in these events resides not in the fact of my obtaining the license I set out to get (which would be a kind of performance criterion) but rather in the knowledge and ability that allowed me to suit my language use to the events in which I found myself” (p. 228).
We all know that there’s many a slip between cup and lip and sometimes events just get out of hand in the real world for reasons entirely removed from our own communicative competence (or whatever type of competence) – when the “plane” the air traffic control test-taker is “controlling” “crashes,” does this suggest a lack of communicative language ability on her part? Perhaps; perhaps a strategic breakdown; perhaps “pilot error”; perhaps “mechanical failure”. As language testers, we need to focus on communicative language ability as the object of assessment, closer to the “weak” end of McNamara’s performance assessment continuum, and upon the test-taker’s ability to assess a communicative situation established by rich input in a test session, set a communicative goal for dealing with the situation, and then marshal language resources to achieve the goal. Our goal as testers should always be to make inferences about the state of test-takers’ language ability and their ability to use language in situations that share characteristics of real-life language situations (you’ve all read Bachman and Palmer 1996, right?).
Of course, all this begs the question of what the ability is like that underlies communicative performances – whether there’s a multi-componential competence that must be adapted, as Hornberger suggests, to deal with constantly changing situations, or whether there are in some currently ineffable sense, multiple competencies related to contexts of use. But that’s for another L-TEST thread – in perhaps about 50 years… [-9-]
Tim McNamara <firstname.lastname@example.org>
The University of Melbourne
As Dan says, this debate is very interesting to me as it touches on things I’ve been thinking about ever since I developed the contextualized test of English for health professionals that has been used in Australia since 1987 (the Occupational English Test). I had a chance to compare the assessment of communication skills in role-play simulations of workplace communicative tasks (e.g. with patients) that I was proposing with similar assessments of doctors-in-training in a local medical school. I found that the criteria used in the latter were radically different. I raised this with the professor in charge of the medical school assessment and she responded, ‘Oh, I see, you’re interested in language, not in communication.’ This was a blow for me as the goal of my test development project was a procedure that would ‘test the ability of candidates to communicate effectively in the workplace’. I still don’t really see a rationale for distinguishing the communication skills of native and non-native speakers in this setting, as they have the same onerous responsibility of communicating effectively with patients and colleagues to ensure adequate health care. This means I’m not sure we should be restricting ourselves to the ‘weak’ end of the continuum I proposed, but it also means that the criteria for assessment become critically important, as John points out. We should be using our best understanding of the nature of face-to-face communication as the basis for deciding on these criteria. In this we should be looking to colleagues outside our own field to help us understand – and what they tell us, for example about the shared burden of communicative success, is problematic for procedures designed to report on ‘individual language ability’. I’m not sure where we go, but I do know we have a problem.
© Copyright rests with authors. Please cite TESL-EJ appropriately.
Editor’s Note: Dashed numbers in square brackets indicate the end of each page in the paginated ASCII version of this article, which is the definitive edition. Please use these page numbers when citing this work.