# Yeti: A Compact and Efficient Structure Tokenizer for Multimodal Protein Generation

> Yeti is a protein structure tokenizer based on Lookup-Free Quantization (LFQ), achieving reconstruction accuracy comparable to ESM3 with only 1/10 the number of parameters, and demonstrating strong generative capabilities in a from-scratch trained multimodal model.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-05-11T04:49:47.000Z
- 最近活动: 2026-05-12T06:19:50.171Z
- 热度: 125.5
- 关键词: 蛋白质结构, 多模态模型, 分词器, 无查找量化, 流匹配, 蛋白质生成, ESM3, AI for Science
- 页面链接: https://www.zingnex.cn/en/forum/thread/yeti
- Canonical: https://www.zingnex.cn/forum/thread/yeti
- Markdown 来源: floors_fallback

---

## [Introduction] Yeti: A Compact and Efficient Multimodal Protein Structure Tokenizer

Yeti is a protein structure tokenizer based on Lookup-Free Quantization (LFQ), achieving reconstruction accuracy comparable to ESM3 with only 1/10 the number of parameters, and demonstrating strong generative capabilities in a from-scratch trained multimodal model. It aims to address the core challenge of structural representation in multimodal protein AI, providing an efficient foundational component for protein design to move from prediction to creation.

## Background: Core Challenges of Multimodal Protein AI

Proteins are the basic executors of life activities, and their functions are determined by their three-dimensional structures. Achievements like AlphaFold have advanced structural prediction, but protein design requires generating novel proteins, which demands models to understand sequences, structures, and functional annotations and perform cross-modal transformations. The core challenge lies in the representation of structural information: protein structures are continuous 3D coordinates that cannot be directly input into discrete sequence models; existing structure tokenizers focus too much on reconstruction accuracy while ignoring the needs of generative tasks. An excellent tokenizer must balance reconstruction accuracy, generative fluency, and cross-modal reasoning ability.

## Core Design of Yeti: Lookup-Free Quantization and Flow Matching Objective

Yeti adopts a concise and efficient design, with core technologies including:
1. **Lookup-Free Quantization (LFQ)**：No need to maintain a large codebook; directly learn discrete representations through mathematical transformations, reducing the number of parameters and improving codebook utilization.
2. **Flow Matching Objective**：End-to-end training, optimized for multimodal learning objectives, more stable and efficient than traditional diffusion models, and naturally suitable for generative tasks.

## Performance and Generative Capability: Evidence of Small Size with Great Power

Yeti's efficiency is outstanding: with only 1/10 the number of parameters of ESM3, it achieves excellent performance:
- **Codebook Utilization and Diversity**: Best codebook utilization across multiple datasets; discrete representations are compact with high information density, and generated token sequences are rich in diversity, avoiding mode collapse.
- **Reconstruction Accuracy**: Second-best accuracy in structural reconstruction tasks, balancing compression and fidelity.
- **Generative Capability Verification**: A multimodal model trained from scratch (without pre-trained weights) using Yeti as the structure encoder can generate both reasonable sequences and 3D structures, with results comparable to models 10 times larger.

## Technical Details: Yeti's Workflow and LFQ Innovations

Yeti's processing flow: Input structures are encoded into continuous latent vectors → LFQ layer discretizes them into structural tokens; during training, real token sequences are recovered from noise, and during inference, new structures are generated by iteratively denoising random noise.
LFQ Innovations: Traditional vector quantization uses Euclidean distance to find nearest neighbors (difficult gradient propagation, low codebook utilization). LFQ achieves end-to-end differentiable training through random perturbation and straight-through estimators, with regularization terms encouraging uniform use of quantization centers.

## Application Prospects: Unlocking New Possibilities for Protein Design

Yeti's compactness and efficiency make it suitable for resource-constrained scenarios (training multimodal models on a single GPU). Its application prospects include:
- **Function-Oriented Design**: Train conditional generative models combined with functional annotations to produce proteins with specific enzymatic activity, binding affinity, or stability.
- **Sequence-Structure Co-Optimization**: Support co-generation in sequence and structure spaces, breaking the limitation of traditional methods that fix one side and optimize the other.
- **Multimodal Reasoning**: In the future, it can integrate more modalities such as dynamic information and experimental data to build a more comprehensive protein understanding model.

## Comparison and Summary: Yeti's Unique Value and Future Significance

Compared with existing works like ESM3 and FoldToken, Yeti is explicitly optimized for generative tasks (ESM3's tokenizer mainly serves reconstruction and representation learning); its minimalist design proves that small models can achieve great results through algorithmic innovation, promoting the popularization of AI for Science. Yeti provides an efficient, compact, and generation-friendly structural representation solution for multimodal protein AI, demonstrating the importance of designing tokenizers for generative tasks, and will play a key role in protein design moving from prediction to creation.