Добавил:
Upload Опубликованный материал нарушает ваши авторские права? Сообщите нам.
Вуз: Предмет: Файл:
Framework_EN.pdf
Скачиваний:
26
Добавлен:
25.03.2015
Размер:
1.82 Mб
Скачать

Appendix A: developing proficiency descriptors

the criteria. Systematic piloting and feedback may follow in order to refine the wording. Groups of raters may discuss performances in relation to the definitions, and the definitions in relation to sample performances. This is the traditional way proficiency scales have been developed (Wilds 1975; Ingram 1985; Liskin-Gasparro 1984; Lowe 1985, 1986).

Qualitative methods:

These methods all involve small workshops with groups of informants and a qualitative rather than statistical interpretation of the information obtained.

No 4. Key concepts: formulation: Once a draft scale exists, a simple technique is to chop up the scale and ask informants typical of the people who will use the scale to

(a) put the definitions in what they think is the right order, (b) explain why they think that, and then once the difference between their order and the intended order has been revealed, to (c) identify what key points were helping them, or confusing them. A refinement is to sometimes remove a level, giving a secondary task to identify where the gap between two levels indicates that a level is missing between them. The Eurocentres certification scales were developed in this way.

No 5. Key concepts: performances: Descriptors are matched to typical performances at those band levels to ensure a coherence between what was described and what occurred. Some of the Cambridge examination guides take teachers through this process, comparing wordings on scales to grades awarded to particular scripts. The IELTS (International English Language Testing System) descriptors were developed by asking groups of experienced raters to identify ‘key sample scripts’ for each level, and then agree the ‘key features’ of each script. Features felt to be characteristic of different levels are then identified in discussion and incorporated in the descriptors (Alderson 1991; Shohamy et al. 1992).

No 6. Primary trait: Performances (usually written) are sorted by individual informants into rank order. A common rank order is then negotiated. The principle on which the scripts have actually been sorted is then identified and described at each level – taking care to highlight features salient at a particular level. What has been described is the trait (feature, construct) which determines the rank order (Mullis 1980). A common variant is to sort into a certain number of piles, rather than into rank order. There is also an interesting multi-dimensional variant on the classic approach. In this version, one first determines through the identification of key features (No 5 above) what the most significant traits are. Then one sorts the samples into order for each trait separately. Thus at the end one has an analytic, multiple trait scale rather than a holistic, primary trait one.

No 7. Binary decisions: Another variant of the primary trait method is to first sort representative samples into piles by levels. Then in a discussion focusing on the boundaries between levels, one identifies key features (as in No 5 above).

209

Appendix A: developing proficiency descriptors

However, the feature concerned is then formulated as a short criterion question with a Yes/No answer. A tree of binary choices is thus built up. This offers the assessor an algorithm of decisions to follow (Upshur and Turner 1995).

No 8. Comparative judgements: Groups discuss pairs of performances stating which is better – and why. In this way the categories in the metalanguage used by the raters is identified, as are the salient features working at each level. These features can then be formulated into descriptors (Pollitt and Murray 1996).

No 9. Sorting tasks: Once draft descriptors exist, informants can be asked to sort them into piles according to categories they are supposed to describe and/or according to levels. Informants can also be asked to comment on, edit/amend and/or reject descriptors, and to identify which are particularly clear, useful, relevant, etc. The descriptor pool on which the set of illustrative scales was based was developed and edited in this way (Smith and Kendall 1963; North 1996/2000).

Quantitative methods:

These methods involve a considerable amount of statistical analysis and careful interpretation of the results.

No 10. Discriminant analysis: First, a set of performance samples which have already been rated (preferably by a team) are subjected to a detailed discourse analysis. This qualitative analysis identifies and counts the incidence of different qualitative features. Then, multiple regression is used to determine which of the identified features are significant in apparently determining the rating which the assessors gave. Those key features are then incorporated in formulating descriptors for each level (Fulcher 1996).

No 11. Multidimensional scaling: Despite the name, this is a descriptive technique to identify key features and the relationship between them. Performances are rated with an analytic scale of several categories. The output from the analysis technique demonstrates which categories were actually decisive in determining level, and provides a diagram mapping the proximity or distance of the different categories to each other. This is thus a research technique to identify and validate salient criteria (Chaloub-Deville 1995).

No 12. Item response theory (IRT) or ‘latent trait’ analysis: IRT offers a family of measurement or scaling models. The most straightforward and robust one is the Rasch model named after George Rasch, the Danish mathematician. IRT is a development from probability theory and is used mainly to determine the difficulty of individual test items in an item bank. If you are advanced, your chances of answering an elementary question are very high; if you are elementary your chances of answering an advanced item are very low. This simple fact is developed into a scaling methodology with the Rasch model, which can be used to calibrate items to the same scale. A development of the

210

Appendix A: developing proficiency descriptors

approach allows it to be used to scale descriptors of communicative proficiency as well as test items.

In a Rasch analysis, different tests or questionnaires can be formed into an overlapping chain through the employment of ‘anchor items’, which are common to adjacent forms. In the diagram below, the anchor items are shaded grey. In this way, forms can be targeted to particular groups of learners, yet linked into a common scale. Care must, however, be taken in this process, since the model distorts results for the high scores and low scores on each form.

 

 

 

 

 

 

Test C

 

 

 

 

 

 

 

 

 

 

 

Test B

 

 

 

 

 

 

 

 

 

 

Test A

 

 

 

No 12.

 

 

 

 

 

The advantage

is that it can provide sample-free, scale-free

 

measurement,

scaling that is independent of the samples or the

 

tests/questionnaires

the analysis. Scale values are provided which

 

remain constant

groups provided those future subjects can be

 

considered

the same statistical population. Systematic

 

shifts in values

.g. due to curriculum change or to assessor

 

training) can

and adjusted for. Systematic variation between

 

types of learners

can be quantified and adjusted for (Wright and

 

Masters 1982;

.

 

 

 

There are a

in which Rasch analysis can be employed to

 

scale descriptors:

 

 

 

No 12.

(a)

Data from

techniques Nos 6, 7 or 8 can be put onto an

 

 

arithmetic

Rasch.

No 12.

(b)

Tests can

developed to operationalise proficiency descriptors

 

 

in particular

. Those test items can then be scaled with Rasch

 

 

and their

taken to indicate the relative difficulty of the

 

 

descriptors

al. 1992; Carroll 1993; Masters 1994; Kirsch 1995;

 

 

Kirsch

1995).

 

 

No 12.

(c)

Descriptors

as questionnaire items for teacher assessment of

 

 

their

do X?). In this way the descriptors can be

 

 

calibrated

an arithmetic scale in the same way that test

 

 

items are

banks.

No 12.

(d)

The scales

included in Chapters 3, 4 and 5 were developed

 

 

in this

described in Appendices B, C and D have

 

 

used

to scale descriptors, and to equate the resulting

 

 

scales of

each other.

No 12. In addition to

in the development of a scale, Rasch can also be

 

used to analyse

which the bands on an assessment scale are

 

actually used

to highlight loose wording, underuse of a band,

 

or overuse of

revision (Davidson 1992; Milanovic et al.

 

1996; Stansfield

1996; Tyndall and Kenyon 1996).

211

Appendix A: developing proficiency descriptors

Users of the Framework may wish to consider and where appropriate state:

the extent to which grades awarded in their system are given shared meaning through common definitions;

which of the methods outlined above, or which other methods, are used to develop such definitions.

212

Gipps, C. 1994: Beyond testing. London, Falmer Press.

Kirsch, I.S. 1995: Literacy performance on three scales: definitions and results. In Literacy, economy and society: Results of the first international literacy survey. Paris, Organisation for Economic Cooperation and development (OECD): 27–53.

Kirsch, I.S. and Mosenthal, P.B. 1995: Interpreting the IEA reading literacy scales. In Binkley, M., Rust, K. and Wingleee, M. (eds.) Methodological issues in comparative educational studies: The case of the IEA reading literacy study. Washington D.C.: US Department of Education, National Center for Education Statistics: 135–192.

Linacre, J. M. 1989: Multi-faceted Measurement.

Chicago: MESA Press.

Liskin-Gasparro, J. E. 1984: The ACTFL proficiency guidelines: Gateway to testing and curriculum. In: Foreign Language Annals 17/5, 475–489.

Lowe, P. 1985: The ILR proficiency scale as a synthesising research principle: the view from the mountain. In: James, C.J. (ed.): Foreign Language Proficiency in the Classroom and Beyond.

Lincolnwood (Ill.): National Textbook Company.

Lowe, P. 1986: Proficiency: panacea, framework, process? A Reply to Kramsch, Schulz, and particularly, to Bachman and Savignon. In:

Modern Language Journal 70/4, 391–397.

Masters, G. 1994: Profiles and assessment.

Curriculum Perspectives 14,1: 48–52.

Milanovic, M., Saville, N., Pollitt, A. and Cook, A. 1996: Developing rating scales for CASE: Theoretical concerns and analyses. In Cumming, A. and Berwick, R. Validation in language testing. Clevedon, Avon, Multimedia Matters: 15–38.

Appendix A: developing proficiency descriptors

Promotion of teacher ‘standards-oriented assessment’ in relation to common reference points built up by networking. Discussion of problems caused by vague descriptors in the English National Curriculum. Cross-curricula.

Simple non-technical report on a sophisticated use of Rasch to produce a scale of levels from test data. Method developed to predict and explain the difficulty of new test items from the tasks and competences involved – i.e. in relation to a framework.

More detailed and technical version of the above tracing the development of the method through three related projects.

Seminal breakthrough in statistics allowing the severity of examiners to be taken into account in reporting a result from an assessment. Applied in the project to develop the illustrative descriptors to check the relationship of levels to school years.

Outline of the purposes and development of the American ACTFL scale from its parent Foreign Service Institute (FSI) scale.

Detailed description of the development of the US Interagency Language Roundtable (ILR) scale from the FSI parent. Functions of the scale.

Defence of a system that worked well – in a specific context – against academic criticism prompted by the spread of the scale and its interviewing methodology to education (with ACTFL).

Brief report on the way Rasch has been used to scale test results and teacher assessments to create a curriculum profiling system in Australia.

Classic account of the use of Rasch to refine a rating scale used with a speaking test – reducing the number of levels on the scale to the number assessors could use effectively.

213

Appendix A: developing proficiency descriptors

Mullis, I.V.S. 1981: Using the primary trait system for evaluating writing. Manuscript No. 10-W-51. Princeton N.J.: Educational Testing Service.

North, B. 1993: The development of descriptors on scales of proficiency: perspectives, problems, and a possible methodology. NFLC Occasional Paper, National Foreign Language Center, Washington D.C., April 1993.

North, B. 1994: Scales of language proficiency: a survey of some existing systems, Strasbourg, Council of Europe CC-LANG (94) 24.

North, B. 1996/2000: The development of a common framework scale of language proficiency. PhD thesis, Thames Valley University. Reprinted 2000, New York, Peter Lang.

North, B. forthcoming: Scales for rating language performance in language tests: descriptive models, formulation styles and presentation formats. TOEFL Research Paper. Princeton NJ; Educational Testing Service.

Classic account of the primary trait methodology in mother tongue writing to develop an assessment scale.

Critique of the content and development methodology of traditional proficiency scales. Proposal for a project to develop the illustrative descriptors with teachers and scale them with Rasch from teacher assessments.

Comprehensive survey of curriculum scales and rating scales later analysed and used as the starting point for the project to develop illustrative descriptors.

Discussion of proficiency scales, how models of competence and language use relate to scales. Detailed account of development steps in the project which produced the illustrative descriptors – problems encountered, solutions found.

Detailed analysis and historical survey of the types of rating scales used with speaking and writing tests: advantages, disadvantages, pitfalls, etc.

North, B. and Schneider, G. 1998: Scaling descriptors for language proficiency scales.

Language Testing 15/2: 217–262.

Pollitt, A. and Murray, N.L. 1996: What raters really pay attention to. In Milanovic, M. and Saville, N. (eds.) 1996: Performance testing, cognition and assessment. Studies in Language Testing 3. Selected papers from the 15th Language Testing Research Colloquium, Cambridge and Arnhem, 2–4 August 1993. Cambridge: University of Cambridge Local Examinations Syndicate: 74–91.

Overview of the project which produced the illustrative descriptors. Discusses results and stability of scale. Examples of instruments and products in an appendix.

Interesting methodological article linking repertory grid analysis to a simple scaling technique to identify what raters focus on at different levels of proficiency.

Scarino, A. 1996: Issues in planning, describing and monitoring long-term progress in language learning. In Proceedings of the AFMLTA 10th National Languages Conference: 67–75.

Scarino, A. 1997: Analysing the language of frameworks of outcomes for foreign language learning. In Proceedings of the AFMLTA 11th National Languages Conference: 241–258.

Criticises the use of vague wording and lack of information about how well learners perform in typical UK and Australian curriculum profile statements for teacher assessment.

As above.

214

Schneider, G and North, B. 1999: ‘In anderen Sprachen kann ich’ . . . Skalen zur Beschreibung, Beurteilung und Selbsteinschätzung der fremdsprachlichen Kommunikationsfähigkeit. Bern/Aarau: NFP 33/SKBF (Umsetzungsbericht).

Appendix A: developing proficiency descriptors

Short report on the project which produced the illustrative scales. Also introduces Swiss version of the Portfolio (40 page A5).

Schneider, G and North, B. 2000: ‘Dans d’autres As above. langues, je suis capable de …’ Echelles pour la description, l’évaluation et l’auto-évaluation

des competences en langues étrangères. Berne/ Aarau PNR33/CSRE (rapport de valorisation)

Schneider, G and North, B. 2000: Fremdsprachen können – was heisst das? Skalen zur Beschreibung, Beurteilung und Selbsteinschätzung der fremdsprachlichen Kommunikationsfähigkeit. Chur/Zürich, Verlag Rüegger AG.

Full report on the project which produced the illustrative scales. Straightforward chapter on scaling in English. Also introduces Swiss version of the Portfolio.

Skehan, P. 1984: Issues in the testing of English for specific purposes. In: Language Testing 1/2, 202–220.

Shohamy, E., Gordon, C.M. and Kraemer, R. 1992: The effect of raters’ background and training on the reliability of direct writing tests. Modern Language Journal 76: 27–33.

Smith, P. C. and Kendall, J.M. 1963: Retranslation of expectations: an approach to the construction of unambiguous anchors for rating scales. In: Journal of Applied Psychology, 47/2.

Criticises the norm-referencing and relative wording of the ELTS scales.

Simple account of basic, qualitative method of developing an analytic writing scale. Led to astonishing inter-rater reliability between untrained non-professionals.

The first approach to scaling descriptors rather than just writing scales. Seminal. Very difficult to read.

Stansfield C.W. and Kenyon D.M. 1996: Comparing the scaling of speaking tasks by language teachers and the ACTFL guidelines. In Cumming, A. and Berwick, R. Validation in language testing. Clevedon, Avon, Multimedia Matters: 124–153.

Takala, S. and F. Kaftandjieva (forthcoming). Council of Europe scales of language proficiency: A validation study. In J.C. Alderson (ed.) Case studies of the use of the Common European Framework. Council of Europe.

Tyndall, B. and Kenyon, D. 1996: Validation of a new holistic rating scale using Rasch multifaceted analysis. In Cumming, A. and Berwick, R. Validation in language testing.

Clevedon, Avon, Multimedia Matters: 9–57.

Use of Rasch scaling to confirm the rank order of tasks which appear in the ACTFL guidelines. Interesting methodological study which inspired the approach taken in the project to develop the illustrative descriptors.

Report on the use of a further development of the Rasch model to scale language self-assessments in relation to adaptations of the illustrative descriptors. Context: DIALANG project: trials in relation to Finnish.

Simple account of the validation of a scale for ESL placement interviews at university entrance. Classic use of multi-faceted Rasch to identify training needs.

215

Appendix A: developing proficiency descriptors

Upshur, J. and Turner, C. 1995: Constructing rating scales for second language tests. English Language Teaching Journal 49 (1), 3–12.

Wilds, C.P. 1975: The oral interview test. In: Spolsky, B. and Jones, R. (Eds): Testing language proficiency. Washington D.C.: Center for Applied Linguistics, 29–44.

Sophisticated further development of the primary trait technique to produce charts of binary decisions. Very relevant to school sector.

The original coming out of the original language proficiency rating scale. Worth a careful read to spot nuances lost in most interview approaches since then.

216

Соседние файлы в предмете [НЕСОРТИРОВАННОЕ]