# Doc2Atom: A Compositional Parametric Memory Framework Revolutionizing Long-Document Reasoning

> This paper proposes Doc2Atom, which decomposes documents into semantically typed knowledge atoms and compiles them into independent micro-LoRA adapters to enable query-specific dynamic composition. It outperforms the Doc-to-LoRA baseline on six QA benchmarks while reducing the memory cost of document internalization.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-06-10T17:58:20.000Z
- 最近活动: 2026-06-11T03:30:07.825Z
- 热度: 152.5
- 关键词: 上下文蒸馏, LoRA, 长文档处理, 知识原子, 参数化记忆, 文档问答, 组合式推理, 内存优化, LLM效率
- 页面链接: https://www.zingnex.cn/en/forum/thread/doc2atom
- Canonical: https://www.zingnex.cn/forum/thread/doc2atom
- Markdown 来源: floors_fallback

---

## Introduction: Core Breakthroughs of Doc2Atom in Revolutionizing Long-Document Reasoning

### Original Authors and Source
- **Original Authors/Maintainers**: Paper author team (standard arXiv authorship)
- **Source Platform**: arXiv
- **Original Title**: Doc-to-Atom: Learning to Compile and Compose Memory Atoms
- **Original Link**: http://arxiv.org/abs/2606.12400v1
- **Publication Time**: 2026-06-10

### Core Insights
This paper proposes the **Doc2Atom** compositional parametric memory framework, which decomposes documents into semantically typed knowledge atoms and compiles them into independent micro-LoRA adapters to achieve query-specific dynamic composition. This framework outperforms the Doc-to-LoRA baseline on six QA benchmarks while significantly reducing the memory cost of document internalization, revolutionizing the way long-document reasoning is done.

## Background: Challenges in Long-Document Processing and Limitations of Existing Methods

## Computational Dilemma of Long-Document Processing
Large Language Models (LLMs) face a quadratic complexity bottleneck in their attention mechanism when processing long documents; as input sequences grow, computational and memory costs increase sharply.

### Rise of Context Distillation
To address this issue, the "context distillation" method compresses document information into model parameters, avoiding long-sequence processing during inference. The core is to pre-internalize documents into parameters, and only load compressed representations during inference.

### Limitations of Doc-to-LoRA
Doc-to-LoRA generates a document-specific LoRA adapter via a single forward pass, but has three major issues:
1. **Irrelevant query interference**: A single adapter mixes multi-topic information, leading to scattered answers or hallucinations;
2. **Limited compositional recall**: Difficult to combine multiple parts of information to handle complex queries;
3. **Poor scalability for long documents**: Information volume growth exceeds the capacity of a single adapter.

## Doc2Atom Framework: Knowledge Atomization and Dynamic Composition Design

## Core Idea: Knowledge Atomization
Doc2Atom decomposes documents into **knowledge atoms**—semantically typed sub-units, each containing coherent concepts and semantic labels, which can be independently compiled into parameters and dynamically combined.

## System Architecture
1. **Document Decomposer**: Segments documents into atoms based on semantics, annotates types, and optimizes boundaries;
2. **Atom Compiler**: Compiles each atom into a lightweight micro-LoRA adapter, associated with a source retrieval key;
3. **Query Router**: Analyzes queries, selects relevant atoms, and assembles a composite adapter to inject into the base model.

## End-to-End Training
Trained via multi-objective distillation:
- Atom quality: Ensure atoms accurately encode segment information;
- Routing accuracy: Train the router to select relevant atoms;
- Compositional ability: Handle multi-atom composition for complex queries;
- Efficiency optimization: Minimize computational costs.
Training data is automatically generated, including atom-question-answer pairs, complex queries, and negative samples.

## Experimental Validation: Performance and Efficiency Advantages of Doc2Atom

## Benchmark Datasets
Validated on six QA benchmarks: Natural Questions, HotpotQA, MS MARCO, NarrativeQA, QASPER, DocRED.

## Key Results
1. **Performance improvement**: Outperforms Doc-to-LoRA on all benchmarks, with an average increase of over 10% (e.g., HotpotQA +12.7%, NarrativeQA +15.2%);
2. **Memory efficiency**: Parameters for storing the same information are reduced by 40-60%, and only a few micro-LoRA adapters are loaded during inference, with more obvious advantages for long documents.

## Ablation Studies
- Atomization itself improves performance, proving decomposition reduces interference;
- Dynamic routing further enhances performance;
- Semantic type annotation contributes significantly (performance drops by 15% without annotation);
- Micro-LoRA is more efficient than standard LoRA.

## In-depth Analysis: Sources of Doc2Atom's Effectiveness

## Four Key Advantages
1. **Information isolation**: Atoms physically isolate irrelevant information, eliminating interference;
2. **Compositional flexibility**: Dynamic routing combines atoms on demand to handle simple/complex queries;
3. **Parameter efficiency**: Micro-LoRA requires only hundreds of parameters, with total parameters far lower than a single adapter;
4. **Interpretability**: Selected atoms can be viewed to understand the basis for the model's answers.

## Application Scenarios: Diverse Practical Domains of Doc2Atom

## Core Application Scenarios
1. **Enterprise knowledge base QA**: Dynamically combine atoms for products, technologies, customer cases, etc.;
2. **Legal document analysis**: Adapt to structured atoms like contract clauses and precedents;
3. **Academic paper assistant**: Combine atoms for abstracts, methods, experiments, etc., on demand;
4. **Multi-document reasoning**: Unified indexing of cross-document atoms, supporting cross-document information combination.

## Limitations and Future Research Directions

## Current Limitations
1. **Decomposition quality**: Automatic decomposition may be imprecise;
2. **Type system**: Predefined/learned type systems have limited coverage;
3. **Routing errors**: The router may select wrong atoms;
4. **Training cost**: End-to-end training requires large resources.

## Future Directions
1. **Adaptive decomposition**: Learn optimal decomposition strategies;
2. **Hierarchical atoms**: Support hierarchical structures from chapters → paragraphs → sentences;
3. **Cross-document association**: Identify semantic associations between atoms from different documents;
4. **Incremental updates**: Support partial updates of documents;
5. **Multimodal extension**: Cover multimodal documents like images and tables.

## Conclusion: Implications of Doc2Atom for Long-Document Reasoning

Doc2Atom represents an important advancement in the field of context distillation, solving the fundamental limitations of monolithic adapters through atomization and dynamic composition. Its "LEGO brick"-style information organization approach opens up new possibilities for long-document reasoning. As LLMs expand their applications in knowledge-intensive tasks, Doc2Atom will become a key infrastructure for efficiently utilizing massive document information.
