Zing Forum

Reading

CR²: Cost-Aware and Risk-Controllable LLM Inference Routing for Mobile Edge Scenarios

CR² is a two-stage device-edge routing framework that achieves flexible trade-offs between latency, energy consumption, and accuracy in wireless edge deployments through edge gating and conformal risk control calibration, reducing deployment costs by 16.9% compared to baseline methods.

大语言模型边缘计算模型路由成本优化移动AI推理优化共形风险控制设备端AI
Published 2026-05-12 19:50Recent activity 2026-05-13 11:24Estimated read 6 min
CR²: Cost-Aware and Risk-Controllable LLM Inference Routing for Mobile Edge Scenarios
1

Section 01

CR² Framework Overview: A Cost-Risk Balancing Solution for Mobile Edge LLM Inference

CR² is a cost-aware and risk-controllable LLM inference routing framework for mobile edge scenarios. It adopts a two-stage device-edge architecture (device-side edge gating + edge-side utility selector) and integrates a conformal risk control calibration mechanism to achieve flexible trade-offs between latency, energy consumption, and accuracy, reducing deployment costs by 16.9% compared to baseline methods.

2

Section 02

Practical Challenges of LLM Inference in Mobile Edge Scenarios

The application scenarios of large language models (LLMs) are expanding from cloud data centers to mobile edges, but resource constraints in edge environments pose unique challenges: edge devices have limited computing/memory resources and cannot run large models directly; routing decisions need to balance the quality of local processing with the latency and energy consumption of edge calls; existing solutions are mostly designed for centralized cloud environments and do not consider the dynamic characteristics of wireless edges, leading to poor performance in actual deployments.

3

Section 03

Core Two-Stage Architecture Design of CR²

CR² uses a two-stage device-edge routing architecture: the first stage is a lightweight edge gate on the device side, which predicts the optimal utility of local execution by combining user cost weights; the second stage is an edge-side utility selector that evaluates the benefits of routing to a stronger model and makes the final decision. This design enables fast processing of most simple queries on the device side, reducing unnecessary network overhead.

4

Section 04

Conformal Risk Control: CR²'s Risk Assurance Mechanism

CR² achieves explicit risk control through the Conformal Risk Control (CRC) calibration mechanism: before deployment, it uses validation data to select a threshold that meets the target risk level, ensuring that the false acceptance risk (device-side incorrect acceptance of low-quality outputs) is controlled within the preset confidence level; it supports users to adjust risk preferences according to scenarios (e.g., conservative for medical scenarios, lenient for real-time dialogue scenarios).

5

Section 05

CR² Experimental Performance: Empirical Results of Cost Optimization and Risk Control

In real edge deployment scenarios, CR² dominates the accuracy-cost Pareto frontier: at the same accuracy level, the normalized deployment cost is reduced by 16.9% compared to the best baseline; the edge gate can accurately predict whether local execution is sufficient based on device-side signals; the actual false acceptance rate of CRC calibration is highly consistent with the target value, verifying the effectiveness of risk control.

6

Section 06

Practical Deployment Considerations and Flexibility of CR²

CR² adapts to practical deployment needs: the edge gate is lightweight and can run on various edge devices; CRC calibration only needs to be completed once before deployment, simplifying operation and maintenance; it supports personalized cost weight settings for multiple users to meet different latency-quality preferences; when collaborating with speculative decoding, the small model on the device side can serve as both a gate and a draft model, reducing computational overhead.

7

Section 07

Limitations of CR² and Future Research Directions

CR² currently has limitations: it relies on the distribution consistency between validation data and deployment data; it assumes that there is a clear capability hierarchy between device-side and edge-side models; dynamic network condition estimation remains challenging. Future research can explore online adaptive calibration, support for complex capability structures, and intelligent routing strategies combined with network prediction models.