The technical architecture of Meta-Attention includes several key components:
Variational Posterior Network: Outputs distribution parameters for the three attention mechanisms for each token. This is a lightweight network that usually adds only a small number of parameters.
Dirichlet Prior Design: The prior design considers computational cost, favoring more efficient attention mechanisms (e.g., linear attention) unless task performance requires full attention.
ELBO Training Objective: The training objective balances task performance and routing efficiency; this trade-off can be controlled by adjusting hyperparameters.
Soft-to-Hard Routing Scheduling: Soft routing (probabilistic weighting) is used in the early stages of training to ensure gradient flow, and gradually transitions to hard routing (discrete selection) in later stages to maximize efficiency gains.