Zing Forum

Reading

Predicting Downstream Performance of Large Language Models Using Proxy Metrics

This paper proposes a method to construct proxy metrics based on token-level statistics (entropy, top-k accuracy, expert token ranking), which consistently outperforms baseline methods based on loss and computation across three scenarios: model selection, data selection, and training-phase prediction.

代理指标性能预测模型选择数据选择LLM训练token统计
Published 2026-05-19 00:17Recent activity 2026-05-19 11:30Estimated read 5 min
Predicting Downstream Performance of Large Language Models Using Proxy Metrics
1

Section 01

【Main Floor/Introduction】A New Breakthrough in Predicting LLM Downstream Performance Using Proxy Metrics

This paper proposes a method to construct proxy metrics based on token-level statistics (entropy, top-k accuracy, expert token ranking) derived from expert-written solutions. It consistently outperforms traditional baseline methods based on loss and computation across three scenarios: model selection, data selection, and training-phase prediction, providing a low-cost and efficient means of performance prediction for key decisions in LLM development.

2

Section 02

Background: Limitations of Traditional Performance Prediction Methods

Traditional prediction signals have two core issues:

  1. Cross-entropy loss: A globally averaged metric with weak correlation to downstream capabilities (Spearman Rho is only 0.36), unable to distinguish tokens critical to the task;
  2. Direct downstream evaluation: High computational cost, sparse evaluation (only a few training checkpoints), and lack of differentiation in early training stages.
3

Section 03

Method: Core Idea and Statistics of Proxy Metrics

The core idea is to use expert-written solutions as evaluation samples, extract token-level statistics from the model's next-token distribution, and aggregate them:

  1. Entropy: Measures the model's uncertainty in predicting the next token; lower values indicate higher confidence;
  2. Top-k accuracy: Whether the token chosen by the expert appears in the model's top-k predictions;
  3. Expert token ranking: The position of the expert's token in the model's prediction ranking.
4

Section 04

Evidence: Evaluation Results Across Three Scenarios

It performs significantly across three key scenarios:

  1. Cross-family model selection: Spearman Rho reaches 0.81, far exceeding cross-entropy's 0.36;
  2. Pre-training data selection: Cost is only 1/10000 of direct evaluation, reliably ranks candidate corpora and expands the Pareto frontier;
  3. Training-phase prediction: Extrapolates downstream accuracy across an 18x computational span with half the error of existing methods.
5

Section 05

Analysis: Reasons for the Effectiveness of Proxy Metrics

The effectiveness stems from three points:

  1. Expert trajectories provide high-quality signals that are highly aligned with downstream task objectives;
  2. Token-level statistics capture fine-grained information, distinguishing performance differences between different tokens;
  3. High computational efficiency: Only a small number of expert samples are needed for forward propagation, allowing frequent computation to provide continuous feedback.
6

Section 06

Technical Details: Key Implementation Points

Implementation considerations:

  1. Expert trajectory selection: High-quality and diverse expert solutions are required;
  2. Statistic aggregation: Combining multiple statistics can improve prediction performance;
  3. Sample efficiency: A small number of samples (dozens) are sufficient to provide reliable predictions.
7

Section 07

Outlook: Limitations and Future Directions

Current limitations and exploration directions:

  1. Automated generation of expert trajectories (replacing manual writing);
  2. Exploring more token-level statistics;
  3. Cross-domain generalization (e.g., code generation, multimodal understanding).
8

Section 08

Conclusion: Method Value and Application Significance

Proxy metrics provide a new perspective for LLM development, performing outstandingly in three key scenarios. They have become an indispensable part of the development toolbox, helping developers accelerate model iteration and reduce development costs.