WHEADON, CHRISTOPHER,BRIAN (2011) An Item Response Theory Approach to the Maintenance of
Standards in Public Examinations in England. Doctoral thesis, Durham University.
| PDF - Accepted Version 2246Kb |
Abstract
Abstract
Every year outcomes from public examinations in the UK rise: politicians congratulate pupils on their hard earned achievement; the media questions whether this achievement is real; those responsible for administrating the examinations defend their standards; various subject councils and employers decry the failings of candidates with high grades; admissions officers from the elite universities report their struggle with the decrease in discrimination in grades achieved; and academics debate what it means to compare standards from one year to the next. The debate cannot be easily resolved because examination results are put to many purposes some of which are more suited to certain definitions of comparability than others. In procedural terms, however, it should be relatively straightforward to evaluate the strength of the evidence that is put forward on the comparability of standards against various definitions.
Broadly, solely in terms of discrimination, the statistical evidence in the maintenance of standards over time and between qualifications can be evaluated by reference to measures such as model fit, significance and effect size. An evaluation of the literature suggests that predictive statistical models, where employed in the maintenance of standards to meet definitions of cohort referencing, tend to be robust. Beyond discrimination, measures of performance standards are required to support inferences drawn from grades on what candidates can actually do. These are, and have been for many years, underpinned by processes reliant on human judgement. An evaluation of the literature suggests that judgement provides very weak evidence and is subject to unknown bias. The combination of statistical and judgemental evidence is poorly specified, has no theoretical basis and is therefore impossible to evaluate. If anything more than pure cohort referencing is required from public examinations in the UK there is clearly a need to explore models with a sound theoretical basis whose evidence can be evaluated in terms of model fit, significance and effect size.
The task of maintaining a performance standard can essentially be reduced under test theory to making comparisons between persons that are independent of the items on the basis of which these comparisons are made. Test theory however has been sparingly applied to comparability issues in UK public examinations. This study considers which test theory model would be most suited to the examinations in use in the UK, examines issues of model fit under frequentist and Bayesian frameworks, compares the results from different test equating methods and the practical issues of implementing a test equating design under the given constraints of the UK examination system.
To begin with the Rasch model and the One Parameter Logistic Model were fitted to operational data gathered from examinations in a range of subject domains where marking reliability would not be considered as a potential confound. In this framework the Rasch model requirement of a single discrimination parameter across items appeared overly restrictive. Further, potential issues with model fit were highlighted related to dimensionality, guessing and weak local independence. More complex models were therefore pursued under a Bayesian framework. The Posterior Predictive Model Checking Procedures and Deviance Information Criterion confirmed that a model which allowed discrimination to vary across items, such as the two-parameter Item Response Theory model, would produce better model predictions. Use of the Multi-Class Mixture Rasch Model suggested that multidimensionality due to a confounding speededness factor could result in misleading inferences being drawn from unidimensional models. The Testlet Response Theory model showed enhanced predictions where weak local independence was correctly specified; however it proved difficult to specify where this weak local independence was expected. When tests from one of the examinations particularly affected by speededness were equated OPLM proved more robust to the confounding speededness factor than the Rasch model.
A Post-equating Non-Equivalent Groups Design was then set up as an experiment using a set of relatively simple Science examinations and candidates at a later stage in their programme of study than those who would take the live examinations in order to understand some of the practical issues involved in equating designs. The study found that item parameters were not stable across samples due to context effects, school effects and maturity effects. These results were partly due to the scale of study, which, though small, still produced reasonably sensible outcomes. It is suggested that more care paid to the context of linking items, their underlying construct, and the sampling of schools would yield more robust results. Finally, a qualitative exploration of views related to test equating designs suggested that teachers, pupils and examiners would not reject the possibility of embedding equating items into live tests.
For examinations where marking reliability is not considered an issue the results reported here suggest that the use of test theory could provide a unified theoretical framework for the maintenance of standards in UK public examinations which would allow the strength of the evidence presented to be evaluated. This would represent a substantial improvement over the current situation in which no comprehensive or coherent evaluation can be made. The time and investment required, however, to introduce such a framework is also substantial. A suitable technical infrastructure is required as well as psychometric expertise. The alternative is to revert to an examinations system that is essentially cohort referenced and focuses on discrimination between candidates in any one year rather than attempting to quality assure, as it cannot do, performance standards from one year to the next.
Item Type: | Thesis (Doctoral) |
---|---|
Award: | Doctor of Philosophy |
Keywords: | Item Response Theory; Examinations; Assessment; A-levels; GCSEs; Rasch; test-equating; IRT; OPLM; Winbugs; Bayesian modelling |
Faculty and Department: | Faculty of Social Sciences and Health > Education, School of |
Thesis Date: | 2011 |
Copyright: | Copyright of this thesis is held by the author |
Deposited On: | 28 Mar 2011 10:23 |