Reading

Finsler Transformer: Replacing Quadratic-Complexity Attention Mechanism with Geodesic Flow on Finsler Manifold

This article introduces an applied mathematics research project called Finsler Transformer, which replaces the O(T²) attention mechanism of traditional Transformers with a learned geodesic flow on Finsler manifold. It transforms context processing from explicit computation to geometric deformation, aiming to build a linear-complexity autoregressive generator.

芬斯勒几何Transformer注意力机制测地线黎曼几何线性复杂度生成模型微分几何自然语言处理深度学习架构

Published 2026-06-10 12:40Recent activity 2026-06-10 12:53Estimated read 9 min

Finsler Transformer: Replacing Quadratic-Complexity Attention Mechanism with Geodesic Flow on Finsler Manifold

Section 01

Introduction to the Finsler Transformer Project

Project Basic Information

Original Author/Maintainer: dledbetter123
Source Platform: GitHub
Original Title: LedbetterFinslerTransformer
Original Link: https://github.com/dledbetter123/LedbetterFinslerTransformer
Release Time: June 2026

Core Innovation

This project proposes the Finsler Transformer architecture, which replaces the O(T²) attention mechanism of traditional Transformers with a learned geodesic flow on Finsler manifold. It transforms context processing from explicit computation to geometric deformation, with the goal of building a linear-complexity autoregressive generator.

Section 02

Bottlenecks and Geometric Limitations of Traditional Transformers

Complexity Issue of Traditional Attention

Standard self-attention requires calculating the correlation between every pair of tokens in the sequence, leading to O(T²) time/space complexity, which limits the ability to process long sequences.

Limitations of Geometric Space

Euclidean Space: Implicitly assumes symmetric and direction-independent distance, conflicting with the directionality of language (e.g., "A implies B"≠"B implies A").
Riemannian Geometry: Allows distance to change with position but maintains direction symmetry, easily leading to over-smoothing (tokens lose their specific identities).

Necessity of Finsler Geometry

Finsler geometry is an extension of Riemannian geometry, whose norm F(x,v)≠F(x,-v) assigns direction-dependent costs to motion, perfectly aligning with the directional characteristics of language.

Section 03

Core Methods: Randers Metric and Geodesic Trajectory

Randers Metric (Non-Riemannian Finsler Metric)

Adopt the simplest non-Riemannian Finsler metric: $$F(x, y) = \sqrt{a_{ij}(x) y^i y^j} + b_i(x) y^i, \quad |b|_a < 1$$

$a_{ij}(x)$: Learned Riemannian background metric (baseline semantic similarity)
$b_i(x)$: Learned 1-form (guides the "wind direction" for the next token)

Core Concept: Sentence as Geodesic

The sequence is a continuous trajectory on the Finsler manifold, and each token is a point on the trajectory
Attention is reflected as cumulative spatial deformation during geodesic travel
Goal: Build an O(T) complexity autoregressive generator

Technical Implementation

Geodesic Equation: $$\frac{d^2 x^i}{dt^2} + \Gamma^i_{jk} \frac{dx^j}{dt} \frac{dx^k}{dt} = 0$$ ($\Gamma^i_{jk}$ is the Christoffel symbol)
Parameter Learning: Optimize $a_{ij}(x)$ and $b_i(x)$ parameters via backpropagation
Numerical Integration: Use Runge-Kutta method, leapfrog method, etc., to solve the geodesic equation

Section 04

Comparison with Existing Linear-Complexity Models

Linear Attention (Linear Transformer/Performer)

Uses kernel tricks or random feature maps to reduce dimensionality to O(T), but still operates in Euclidean space without leveraging language directionality

State Space Model (Mamba)

Achieves O(T) by compressing historical information through hidden states, but Finsler Transformer preserves the sequence's geometric structure and encodes context into spatial curvature instead of fixed state vectors

Graph Neural Networks

Treats sequences as graphs for message passing; Finsler Transformer can be seen as a continuous graph structure where edge weights are dynamically determined by geometric metrics

Section 05

Theoretical Advantages and Potential

Computational Efficiency

Generating the next token only requires one step along the geodesic, with O(1) complexity (relative to context length)

Inductive Bias

The directionality and hierarchical structure of language are built into the geometry, improving sample efficiency and generalization ability

Interpretability

Visualize the geodesic path of sentences in the semantic space to analyze the model's understanding of semantic relationships

Long-Range Dependencies

Naturally emerge through the global geometric structure of the manifold; semantically related distant tokens have shorter geodesic distances

Section 06

Current Challenges and Open Problems

Metric Learning Stability: Need to maintain mathematical constraints such as positivity and triangle inequality
Numerical Computation Overhead: Numerical integration for solving geodesic equations may be higher than matrix multiplication
Architecture Compatibility: Integration with other Transformer components (feed-forward networks, layer normalization) needs further research
Training Stability: The new geometric framework brings optimization challenges, requiring specialized training techniques

Section 07

Potential Application Scenarios

Ultra-Long Context Modeling: Process million-level sequences (document understanding, code analysis, genome modeling)
Streaming Generation: Continuous generation without reprocessing historical context
Multimodal Fusion: Unified Finsler manifold representation for different modal data
Continual Learning: Integrate new knowledge by adjusting local metrics

Section 08

Summary and Future Directions

Finsler Transformer is a fundamental rethinking of the attention mechanism, transforming sequence modeling from discrete computation to continuous geometry. Although still in the research stage, it represents a paradigm shift in deep learning architectures from "explicit computation" to "implicit structure". If successful, it will not only solve the long-sequence bottleneck but also build the essential characteristics of language into the model structure, opening up new paths for the next generation of generative models.