Aslantas, Ismail (2021) How good are value-added measures of teacher performance? A review and empirical investigation using data from Turkey. Doctoral thesis, Durham University.
Education systems all over the world aim to provide good quality education for their citizens. This would require a good supply of quality teachers. The role of teachers is now more complex than ever before. Consequently, evaluating the quality of a teacher has also become more complex. While we may feel that we know intuitively what an effective teacher looks like, there is little consensus on how best to measure or capture the essence of a good teacher. Classroom observations protocols, interviews and surveys with teachers and pupils are commonly used to assess teachers. Increasingly, governments and schools are using standardised pupil test scores in teacher performance appraisal as a way of estimating how much difference teachers can make to student attainment by comparing the progress students make. This is seen as perhaps more objective or fair because students' test scores are considered objective measures. Such evaluation of teachers, also known as value-added models or VAMs, are increasingly used to measure teacher effectiveness for high-stake decisions, such as teachers' salaries and promotions. Teachers are rewarded or penalise based on these value-added measures.
VAMs have attracted considerable attention in recent years. Many researchers have raised concerns about their validity and reliability. There are also concerns about VAM's ability to predict the effectiveness of teachers consistently. Value-added measures of teachers are known to vary from year to year and from subject to subject. Different value-added models can also produce different estimates of teacher performance depending on the student achievement test scores used.
This study adds to the current debates by examining the stability of VAMs to see whether teacher effectiveness can be predicted consistently using different parameters, such as observable student, teacher/classroom and school characteristics, the number of student test scores obtained over time, and the data analysis methods used. Value-added measures can only be useful in estimating teacher effectiveness if they produce consistent results for the same teachers across time for different students.
This new research begins with a systematic review of the existing literature examining the stability of different VAMs as a measure of teacher performance. Of 1,439 results, 50 studies met the inclusion criteria to be included in the synthesis. Each of these studies was given a padlock rating in terms of the trustworthiness of its findings based on four criteria (such as research design and threats to validity) using a bespoke assessment tool. Studies were rated from 0 (very weak) to 4 padlocks (the most secure that can be expected). Since the main research question (stability of estimates) is descriptive, correlational/comparative studies are appropriate in design. Most studies retrieved were correlational/comparative in design. Some of them rated the highest, as they were large-scale, allowed random teacher-student allocations, and had low attrition. The majority of the studies in the review were rated 3 padlocks as they employed administrative/panel data where students are not randomly assigned to teachers in value-added estimates and/or were smaller or had higher attrition.
The strongest studies revealed that using one prior attainment score is sufficient to predict teacher performance. Using additional prior test scores does not increase the stability of value-added teacher effectiveness estimates consistently. Including student, school and teacher/classroom-level variables adds little to the predictive power of teacher performance assessment models. This suggests that these variables are not good predictors of teacher effectiveness. The systematic review found no evidence that any particular data analysis method is better in its ability to estimate teachers' effectiveness reliably.
Most studies in the review were conducted in the US using national administrative data. To see if the findings also apply in other contexts, longitudinal data of five teaching subjects (maths, Turkish science, history, and English) from one province in Turkey was then used to test the stability of value-added estimates. The data included 35,435 Grade 8 students (age 13-14, equivalent to Year 9 in the UK), matched to 1,027 teachers. To test how much progress in student academic achievement is available to be attributed to a teacher from one year to the next, a series of regression analyses were run. Models included contextual predictors at student-, school-, and teacher/classroom-level.
Consistent with the findings of the systematic review, the results show that the best predictor of students' later test scores is their prior attainment. Using additional years' test scores instead of a single prior-year attainment score contributed little to improving value-added teacher effectiveness estimates. Including other factors, such as student, teacher, and school characteristics in the model also explains very little in the variations in students' test scores once the prior attainment is taken into account (although the data on teacher characteristics was limited in the dataset). Correlation analyses suggested that there was no meaningful relationship between teacher effectiveness scores and the teacher/classroom-level variables.
Interestingly, teacher experience, regardless of whether it refers to their total experience or only that in their current schools, is negatively related to teacher effectiveness scores. In other words, more experienced teachers tend to have lower effectiveness scores on the value-added estimate. There was no evidence that teachers are more effective in smaller classes. Only a modest correlation was found between class size and teacher effectiveness. Intriguingly, students in large classes tend to have more “effective” teachers in value-added terms (except in history), although the difference is minimal.
The analysis also found that teachers’ previous effectiveness scores had little or no relationship with their current effectiveness scores, regardless of teaching subjects. Consistent with the literature in the review, this study also found that teacher effectiveness scores based on value-added estimates vary substantially across years. This means that the same teacher can be considered “effective” in one year and “ineffective” in other. This casts doubt on the reliability and meaningfulness of value-added measures.
As with previous studies in the systematic review, there is no evidence from the Turkish data that any single value-added approach is superior to any other approach regarding the ability to consistently estimate teachers’ effectiveness. There is no advantage in using more sophisticated statistical models.
The findings of this study suggest that regardless of the number of test scores, or variables used or data analysis methods, there is no consistent or reliable way of measuring teacher effectiveness. This highlights the danger of using value-added models in measuring teacher effectiveness. Studies suggested that some of the inconsistencies could be the result of measurement error and the timing of the test. There is, therefore, the risk of misclassifying teachers as “effective” or “ineffective”. Some teachers may be deemed 'effective' on one test but not another simply based on when the tests are scheduled. These findings have important implications for policy and practice. Value-added models should not be used to make high stake personnel decisions. They may have some value for research purposes or to provide formative feedback to headteachers about a class or a teacher as part of a larger set of evidence.
One major limitation of VAMs is that they measure teacher performance using tests designed to measure student performance. The assumption is that student performance is directly related to teacher quality. While there has been a lot of research on developing teacher quality, measuring teacher quality is itself problematic. The issue of measuring teachers performance has been one of the leading issues in education policies. A critical question that needs to be asked is not how effective teacher are, but what is the purpose of evaluating teacher performance? If such an exercise aims to differentiate “effective” from “non-effective” teachers since there is no reliable method or no methods that have been robustly tested and shown to work in identifying effective teachers, why are we still doing it? To improve teachers’ effectiveness and keep them updated with robustly tested and proven teaching approaches, it might be better to provide teachers with training, professional development to develop pedagogic skills, social and personal relationship skills, behavioural management, and subject knowledge. Assuming that classroom teachers have gone through teacher training and are certified, then they should be qualified to teach. If they are not deemed “effective”, it is perhaps the failure of the selection and training process more so than the quality of the individual teacher.
Another major limitation of VAMs is that they are comparative and zero-sum. For a teacher to be deemed effective, another must be deemed ineffective. Thus, if all teachers were actually effective (or ineffective), a VAM would still assess up to half of them to be ineffective (or effective). They are not fit for purpose.
|Item Type:||Thesis (Doctoral)|
|Award:||Doctor of Philosophy|
|Faculty and Department:||Faculty of Social Sciences and Health > Education, School of|
|Copyright:||Copyright of this thesis is held by the author|
|Deposited On:||12 Oct 2021 11:18|