Reading

Yeti: A Compact and Efficient Structure Tokenizer for Multimodal Protein Generation

蛋白质结构多模态模型分词器无查找量化流匹配蛋白质生成ESM3AI for Science

Published 2026-05-11 12:49Recent activity 2026-05-12 14:19Estimated read 8 min

Yeti: A Compact and Efficient Structure Tokenizer for Multimodal Protein Generation

Section 01

[Introduction] Yeti: A Compact and Efficient Multimodal Protein Structure Tokenizer

Yeti is a protein structure tokenizer based on Lookup-Free Quantization (LFQ), achieving reconstruction accuracy comparable to ESM3 with only 1/10 the number of parameters, and demonstrating strong generative capabilities in a from-scratch trained multimodal model. It aims to address the core challenge of structural representation in multimodal protein AI, providing an efficient foundational component for protein design to move from prediction to creation.

Section 02

Background: Core Challenges of Multimodal Protein AI

Proteins are the basic executors of life activities, and their functions are determined by their three-dimensional structures. Achievements like AlphaFold have advanced structural prediction, but protein design requires generating novel proteins, which demands models to understand sequences, structures, and functional annotations and perform cross-modal transformations. The core challenge lies in the representation of structural information: protein structures are continuous 3D coordinates that cannot be directly input into discrete sequence models; existing structure tokenizers focus too much on reconstruction accuracy while ignoring the needs of generative tasks. An excellent tokenizer must balance reconstruction accuracy, generative fluency, and cross-modal reasoning ability.

Section 03

Core Design of Yeti: Lookup-Free Quantization and Flow Matching Objective

Yeti adopts a concise and efficient design, with core technologies including:

Lookup-Free Quantization (LFQ)：No need to maintain a large codebook; directly learn discrete representations through mathematical transformations, reducing the number of parameters and improving codebook utilization.
Flow Matching Objective：End-to-end training, optimized for multimodal learning objectives, more stable and efficient than traditional diffusion models, and naturally suitable for generative tasks.

Section 04

Performance and Generative Capability: Evidence of Small Size with Great Power

Yeti's efficiency is outstanding: with only 1/10 the number of parameters of ESM3, it achieves excellent performance:

Codebook Utilization and Diversity: Best codebook utilization across multiple datasets; discrete representations are compact with high information density, and generated token sequences are rich in diversity, avoiding mode collapse.
Reconstruction Accuracy: Second-best accuracy in structural reconstruction tasks, balancing compression and fidelity.
Generative Capability Verification: A multimodal model trained from scratch (without pre-trained weights) using Yeti as the structure encoder can generate both reasonable sequences and 3D structures, with results comparable to models 10 times larger.

Section 05

Technical Details: Yeti's Workflow and LFQ Innovations

Yeti's processing flow: Input structures are encoded into continuous latent vectors → LFQ layer discretizes them into structural tokens; during training, real token sequences are recovered from noise, and during inference, new structures are generated by iteratively denoising random noise. LFQ Innovations: Traditional vector quantization uses Euclidean distance to find nearest neighbors (difficult gradient propagation, low codebook utilization). LFQ achieves end-to-end differentiable training through random perturbation and straight-through estimators, with regularization terms encouraging uniform use of quantization centers.

Section 06

Application Prospects: Unlocking New Possibilities for Protein Design

Yeti's compactness and efficiency make it suitable for resource-constrained scenarios (training multimodal models on a single GPU). Its application prospects include:

Function-Oriented Design: Train conditional generative models combined with functional annotations to produce proteins with specific enzymatic activity, binding affinity, or stability.
Sequence-Structure Co-Optimization: Support co-generation in sequence and structure spaces, breaking the limitation of traditional methods that fix one side and optimize the other.
Multimodal Reasoning: In the future, it can integrate more modalities such as dynamic information and experimental data to build a more comprehensive protein understanding model.

Section 07

Comparison and Summary: Yeti's Unique Value and Future Significance

Compared with existing works like ESM3 and FoldToken, Yeti is explicitly optimized for generative tasks (ESM3's tokenizer mainly serves reconstruction and representation learning); its minimalist design proves that small models can achieve great results through algorithmic innovation, promoting the popularization of AI for Science. Yeti provides an efficient, compact, and generation-friendly structural representation solution for multimodal protein AI, demonstrating the importance of designing tokenizers for generative tasks, and will play a key role in protein design moving from prediction to creation.

Continue Reading

Keep going with more reads from the same topic.

Nornir MCP Server: An Enterprise-Grade Bridge for Integrating Large Language Models into Network Automation

Nornir MCP Server is an enterprise-level server based on the Model Context Protocol (MCP). It seamlessly integrates large language models (such as Claude) with the Nornir network automation framework, supporting natural language orchestration for multi-vendor network devices (Cisco, Arista, Juniper, etc.), and providing production-grade features like a dual-engine architecture (NAPALM + Netmiko), intelligent filtering, and a secure sandbox.

Recent activity 2026-05-06 20:51

Bibliothèque Française LLM: A French Public Domain Literature Index System Optimized for Large Language Models

Bibliothèque Française LLM is a structured indexing and annotation project for French public domain literature designed specifically for large language models (LLMs). It integrates multiple authoritative sources such as DraCor, Common Corpus, and Wikisource, providing metadata indexing categorized by genre, author, and era, as well as in-depth annotations for dramatic texts (including characters, lines, stage directions, etc.). Its aim is to enable LLMs to efficiently read and understand classic French literary works.

Recent activity 2026-05-06 20:50

Splinter: A Lock-Free Zero-Copy Shared Memory KV and Vector Storage Library That Eliminates Socket and Memcpy Overhead for LLM Inference

Splinter is a minimalist, high-performance key-value (KV) and vector storage system enabling zero-latency inter-process communication via shared memory and atomic operations. With only 766 lines of core code, it supports millions of operations per second and 768-dimensional vector storage, offering a new architectural approach for local LLM inference and data-intensive applications.

Recent activity 2026-04-03 08:49

Folkering OS: When the Operating System Itself Is AI—A Self-Evolving Bare-Metal Rust System

Folkering OS is the world's first AI-native bare-metal operating system, entirely written in Rust no_std without relying on Linux, POSIX, or libc. It can generate commands from scratch, compile them into WASM, and run them in 10 seconds, achieving true self-evolution.

Recent activity 2026-04-09 16:15