The use of AES for high-stakes testing in education has generated significant backlash, with opponents pointing to research that computers cannot yet grade writing accurately and arguing that their use for such purposes promotes teaching writing in reductive ways (i.e. As early as 1982, a UNIX program called Writer's Workbench was able to offer punctuation, spelling, and grammar advice. Online Writing Evaluation Service uses the e-rater engine to provide both scores and targeted feedback.
It is now a product from Pearson Educational Technologies and used for scoring within a number of commercial products and state and national exams. Lawrence Rudner has done some work with Bayesian scoring, and developed a system called BETSY (Bayesian Essay Test Scoring s Ystem).
Rising education costs have led to pressure to hold the educational system accountable for results by imposing standards.
The advance of information technology promises to measure educational achievement at reduced cost. Eventually, Page sold PEG to Measurement Incorporated By 1990, desktop computers had become so powerful and so widespread that AES was a practical possibility.
Its objective is to classify a large set of textual entities into a small number of discrete categories, corresponding to the possible grades—for example, the numbers 1 to 6.
Therefore, it can be considered a problem of statistical classification.The intent was to demonstrate that AES can be as reliable as human raters, or more so.This competition also hosted a separate demonstration among 9 AES vendors on a subset of the ASAP data.In this system, there is an easy way to measure reliability: by inter-rater agreement.If raters do not consistently agree within one point, their training may be at fault.It is reported as three figures, each a percent of the total number of essays scored: exact agreement (the two raters gave the essay the same score), adjacent agreement (the raters differed by at most one point; this includes exact agreement), and extreme disagreement (the raters differed by more than two points).Expert human graders were found to achieve exact agreement on 53% to 81% of all essays, and adjacent agreement on 97% to 100%.Before computers entered the picture, high-stakes essays were typically given scores by two trained human raters.If the scores differed by more than one point, a third, more experienced rater would settle the disagreement.Several factors have contributed to a growing interest in AES.Among them are cost, accountability, standards, and technology.