This is the core innovation of Bangkong, which includes five key components:
Cosine-Clustered Embeddings
Traditional word embedding initialization usually uses random distribution, while Bangkong groups tokens according to domains (mathematics, code, reasoning, general) and initializes them with prototype vectors on the unit sphere. Tokens in the same domain are closer in the embedding space at the start, and this geometrically structured initialization allows the model to learn domain-specific semantic relationships faster.
Attention Head Specialization
Different reasoning modes (causal reasoning, sequence reasoning, numerical reasoning, etc.) require different attention patterns. Bangkong creates fixed bias tensors for each attention head and applies them to the attention output via forward hooks. This pre-configured specialization mechanism enables the model to handle specific reasoning modes at the early stage of training.
Hierarchical Memory
Bangkong introduces a three-layer differentiable memory system that simulates different time scales of human cognition:
- Scratchpad Memory: 64 slots for immediate context computation and storing short-term working memory
- Context Memory: 128 slots for mid-term information retention at the session/topic level
- Semantic Memory: 256 slots for long-term knowledge storage and retrieval
This hierarchical architecture allows the model to distinguish between different types of information and manage them appropriately based on their time horizons, significantly improving reasoning and context management capabilities.
Meta-Learning Priors
Using MAML (Model-Agnostic Meta-Learning) and Reptile algorithms, the system learns initialization weights that can quickly adapt to new tasks. The prior generator produces LoRA adapter weights from knowledge concept embeddings, enabling the model to adjust rapidly when facing new tasks.
Energy-Based Consistency
During forward propagation, the system verifies and regularizes the consistency of hidden states through an energy model, ensuring that the model's outputs across different layers and time steps remain logically coherent.