Zing Forum

Reading

Robustness of LLM Automated Scoring Systems: An Empirical Analysis Against Construct-Irrelevant Factors

This article provides an in-depth analysis of a recent study on the robustness of large language model (LLM) automated scoring systems, exploring their performance when faced with construct-irrelevant factors such as meaningless text padding, spelling errors, changes in writing complexity, and off-topic responses. The study found that unlike traditional scoring systems, LLM systems have a unique penalty mechanism for text repetition and exhibit high sensitivity to off-topic content.

LLM自动评分教育测评鲁棒性构造无关因素情境判断测试对抗性攻击大语言模型
Published 2026-03-27 01:29Recent activity 2026-03-28 06:48Estimated read 6 min
Robustness of LLM Automated Scoring Systems: An Empirical Analysis Against Construct-Irrelevant Factors
1

Section 01

[Introduction] Key Findings of the Robustness Study on LLM Automated Scoring Systems

This article conducts an empirical analysis on the robustness of LLM automated scoring systems, exploring their performance when faced with construct-irrelevant factors such as meaningless text padding, spelling errors, changes in writing complexity, and off-topic responses. The study found: Unlike traditional systems, LLM systems have a unique penalty mechanism for text repetition; they are highly sensitive to off-topic content; and they show significant robustness to spelling errors, adjustments in writing complexity, and the addition of some meaningless text (e.g., ability prompt sentences, scenario restatements, formulaic clichés).

2

Section 02

Research Background and Motivation

Automated scoring systems have been used in educational assessment for decades, evolving from manual feature engineering to neural network/Transformer models, with scoring performance comparable to humans. However, they have long been plagued by vulnerability to construct-irrelevant factors (text features unrelated to the assessed ability). Early studies found that text repetition, specific vocabulary injection, etc., can interfere with scoring; with the rise of LLMs, their limitations (such as hallucinations) have made robustness research more prominent.

3

Section 03

Research Design and Methodology

A dual-architecture LLM scoring system (combining "LLM as judge" feature extraction and transparent regression algorithms) was used to evaluate four key ability dimensions of students in open-ended short-answer questions of Situational Judgment Tests (SJT): personal internal skills, interpersonal skills, social ethical responsibility, and critical thinking and problem-solving. The sample consisted of 26,571 responses from 910 students, with a stratified random sample of 545 responses covering 30 questions and different quality levels.

4

Section 04

Experimental Design and Key Findings

Experiment 1: Impact of Meaningless Text

  • Original text repetition: LLM systems have a penalty effect on repeated text (opposite to the score-increasing effect of early Transformer systems);
  • Ability prompt sentences, scenario restatements, formulaic clichés: Minimal change in scores, significant robustness.

Experiment 2: Impact of Writing Complexity

  • Spelling errors: Scores remain stable even with a 50% character error rate, high tolerance;
  • Adjustment of reading difficulty: Changes in vocabulary/sentence complexity do not affect scores.

Experiment 3: Impact of Off-topic Responses

  • Highly sensitive to off-topic content, significantly lowering scores (difficult for traditional systems to identify).
5

Section 05

Research Significance and Implications

  1. New finding on text length manipulation: The penalty for repeated content in LLM systems is an "anti-cheating" feature, possibly stemming from sensitivity to semantic redundancy;
  2. Construct-relevant design: Through prompt engineering and feature extraction, LLMs can focus on specific ability dimensions and ignore irrelevant factors such as language proficiency;
  3. Off-topic detection ability: Better at evaluating content relevance rather than surface features.
6

Section 06

Limitations and Future Directions

Limitations: Based on a specific dual-architecture system, results may not be generalizable to other LLM scoring architectures; focused on low-stakes formative assessment, application to high-stakes exams needs verification; did not cover all adversarial attacks (e.g., complex prompt injection).

Future directions: Explore the robustness of more LLM architectures; verify application in high-stakes scenarios; study strategies to deal with complex adversarial attacks.