# Predicting Downstream Performance of Large Language Models Using Proxy Metrics

> This paper proposes a method to construct proxy metrics based on token-level statistics (entropy, top-k accuracy, expert token ranking), which consistently outperforms baseline methods based on loss and computation across three scenarios: model selection, data selection, and training-phase prediction.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-05-18T16:17:15.000Z
- 最近活动: 2026-05-19T03:30:47.392Z
- 热度: 144.8
- 关键词: 代理指标, 性能预测, 模型选择, 数据选择, LLM训练, token统计
- 页面链接: https://www.zingnex.cn/en/forum/thread/llm-arxiv-2605-18607v1
- Canonical: https://www.zingnex.cn/forum/thread/llm-arxiv-2605-18607v1
- Markdown 来源: floors_fallback

---

## 【Main Floor/Introduction】A New Breakthrough in Predicting LLM Downstream Performance Using Proxy Metrics

This paper proposes a method to construct proxy metrics based on token-level statistics (entropy, top-k accuracy, expert token ranking) derived from expert-written solutions. It consistently outperforms traditional baseline methods based on loss and computation across three scenarios: model selection, data selection, and training-phase prediction, providing a low-cost and efficient means of performance prediction for key decisions in LLM development.

## Background: Limitations of Traditional Performance Prediction Methods

Traditional prediction signals have two core issues:
1. Cross-entropy loss: A globally averaged metric with weak correlation to downstream capabilities (Spearman Rho is only 0.36), unable to distinguish tokens critical to the task;
2. Direct downstream evaluation: High computational cost, sparse evaluation (only a few training checkpoints), and lack of differentiation in early training stages.

## Method: Core Idea and Statistics of Proxy Metrics

The core idea is to use expert-written solutions as evaluation samples, extract token-level statistics from the model's next-token distribution, and aggregate them:
1. Entropy: Measures the model's uncertainty in predicting the next token; lower values indicate higher confidence;
2. Top-k accuracy: Whether the token chosen by the expert appears in the model's top-k predictions;
3. Expert token ranking: The position of the expert's token in the model's prediction ranking.

## Evidence: Evaluation Results Across Three Scenarios

It performs significantly across three key scenarios:
1. Cross-family model selection: Spearman Rho reaches 0.81, far exceeding cross-entropy's 0.36;
2. Pre-training data selection: Cost is only 1/10000 of direct evaluation, reliably ranks candidate corpora and expands the Pareto frontier;
3. Training-phase prediction: Extrapolates downstream accuracy across an 18x computational span with half the error of existing methods.

## Analysis: Reasons for the Effectiveness of Proxy Metrics

The effectiveness stems from three points:
1. Expert trajectories provide high-quality signals that are highly aligned with downstream task objectives;
2. Token-level statistics capture fine-grained information, distinguishing performance differences between different tokens;
3. High computational efficiency: Only a small number of expert samples are needed for forward propagation, allowing frequent computation to provide continuous feedback.

## Technical Details: Key Implementation Points

Implementation considerations:
1. Expert trajectory selection: High-quality and diverse expert solutions are required;
2. Statistic aggregation: Combining multiple statistics can improve prediction performance;
3. Sample efficiency: A small number of samples (dozens) are sufficient to provide reliable predictions.

## Outlook: Limitations and Future Directions

Current limitations and exploration directions:
1. Automated generation of expert trajectories (replacing manual writing);
2. Exploring more token-level statistics;
3. Cross-domain generalization (e.g., code generation, multimodal understanding).

## Conclusion: Method Value and Application Significance

Proxy metrics provide a new perspective for LLM development, performing outstandingly in three key scenarios. They have become an indispensable part of the development toolbox, helping developers accelerate model iteration and reduce development costs.
