Reading

ThreadWeaver: Enabling Parallel Reasoning for Large Language Models Like Weaving

The ThreadWeaver framework, jointly launched by Meta and UC Berkeley, reduces inference latency by 1.53x without sacrificing accuracy through adaptive parallel reasoning technology, opening a new path for optimizing large model inference efficiency.

ThreadWeaver并行推理大语言模型MetaUC Berkeley推理优化强化学习P-GRPOLLM推理加速

Published 2026-04-08 12:39Recent activity 2026-04-08 12:51Estimated read 5 min

ThreadWeaver: Enabling Parallel Reasoning for Large Language Models Like Weaving

Section 01

[Main Floor] ThreadWeaver: A New Parallel Reasoning Framework for Large Models, 1.53x Latency Reduction with Consistent Accuracy

Meta and UC Berkeley jointly launched the ThreadWeaver framework, which reduces inference latency by 1.53x without sacrificing accuracy through adaptive parallel reasoning technology, opening a new path for optimizing large model inference efficiency. The core of the framework is to decompose serial reasoning into parallel 'threads', while exploring different solution paths for the problem and merging the results.

Section 02

[Background] Inference Latency Becomes a Bottleneck for Large Model Applications, Serial Decoding is the Key Limitation

As large language models' capabilities improve, inference computational overhead increases. Mainstream autoregressive generation (decoding tokens one by one in sequence) leads to latency proportional to output length, with complex tasks requiring tens of seconds of waiting. Traditional optimizations (quantization, pruning, speculative decoding) have not addressed the structural bottleneck of serial decoding, making breaking through this a focus.

Section 03

[Method] ThreadWeaver Core: Adaptive Parallel Reasoning and Complete Technical Architecture

The core idea of ThreadWeaver is to decompose serial reasoning into parallel 'threads'. The technical architecture includes three components: 1. Parallel trajectory format (using tags like to organize reasoning structure, ensuring thread independence); 2. Five-stage inference state machine (compatible with existing optimization techniques such as prefix caching); 3. Trie-based training and P-GRPO reinforcement learning (avoiding information leakage, stably optimizing accuracy and speed).

Section 04

[Evidence] Performance Verification: Consistent Accuracy, Average 1.53x Latency Reduction

In six mathematical reasoning benchmark tests, ThreadWeaver (taking Qwen3-8B as an example) has accuracy close to the serial baseline (e.g., AIME24 reaches 79.9% vs baseline's 78.3%). Latency is reduced by an average of 1.53x, with a maximum speedup of 1.92x on OlympiadBench. Real-scenario test (4 GPUs, 50 MATH500 questions): serial time takes 162.34 seconds, parallel only 142.21 seconds, speedup of 1.14x.

Section 05

[Data Generation] Scaling Strategy from 1k to 17k Samples

High-quality parallel data is key. Two-stage generation: 1. Rewrite existing serial reasoning chains using strong models to eliminate thread dependencies; 2. Self-training expansion, with format and answer filtering. The training set expands from 1k to 17k, format correctness rate increases from 56.4% to 77.0%, and accuracy reaches 79.9% after RL training.

Section 06

[Limitations and Outlook] Current Limitations and Future Application Expansion Directions

Limitations: May produce redundant computations (e.g., repeated verification); mainly targeted at mathematical reasoning, need to migrate to scenarios like code generation and long text creation. Future directions: Combine with chain-of-thought and tool usage to enhance practical value.

Section 07

[Conclusion] ThreadWeaver Opens a New Direction for Large Model Inference Architecture

ThreadWeaver represents an important direction for LLM inference from sequential generation to structured parallel exploration. Its open-source implementation provides a technical foundation for the community, and we look forward to more innovations emerging.

ThreadWeaver: Enabling Parallel Reasoning for Large Language Models Like Weaving

[Main Floor] ThreadWeaver: A New Parallel Reasoning Framework for Large Models, 1.53x Latency Reduction with Consistent Accuracy

[Background] Inference Latency Becomes a Bottleneck for Large Model Applications, Serial Decoding is the Key Limitation

[Method] ThreadWeaver Core: Adaptive Parallel Reasoning and Complete Technical Architecture

[Evidence] Performance Verification: Consistent Accuracy, Average 1.53x Latency Reduction

[Data Generation] Scaling Strategy from 1k to 17k Samples

[Limitations and Outlook] Current Limitations and Future Application Expansion Directions

[Conclusion] ThreadWeaver Opens a New Direction for Large Model Inference Architecture

Continue Reading

Splinter: A Lock-Free Zero-Copy Shared Memory KV and Vector Storage Library That Eliminates Socket and Memcpy Overhead for LLM Inference

Azure GPU Virtual Machine Practice: Complete Solution for Local Deployment of 70B+ Large Models Using 4x V100

ClawDeFi Agent Skill: Building a Scalable DeFi Smart Agent System

LiteMind: A Unified Multimodal AI Development Framework to Simplify LLM Application Building Processes