Reading

YORO Hybrid Architecture: A "Retrieve Once Only" Intelligent Routing Solution for Text-to-SQL

An innovative Text-to-SQL generation architecture that intelligently routes queries to three reasoning paths (purely parameterized, hybrid compression, or full Graph-RAG) via a lightweight router, achieving an 80% reduction in prompt tokens.

Text-to-SQL大语言模型RAG令牌优化智能路由数据库微调开源项目

Published 2026-06-17 07:06Recent activity 2026-06-17 07:20Estimated read 8 min

YORO Hybrid Architecture: A "Retrieve Once Only" Intelligent Routing Solution for Text-to-SQL

Section 01

YORO Hybrid Architecture: Introduction to the Intelligent Routing Solution for Text-to-SQL

The YORO Hybrid Architecture is an innovative solution addressing token cost issues in the Text-to-SQL domain. It intelligently routes queries to three reasoning paths (purely parameterized, hybrid compression, or full Graph-RAG) via a lightweight router, achieving an 80% reduction in prompt tokens.

Original Author/Maintainer: Dhritimannandi Source Platform: GitHub Original Link: https://github.com/Dhritimannandi/yoro-hybrid-architecture Publication Date: June 16, 2026

Section 02

Project Background: The Token Cost Dilemma of Text-to-SQL

Traditional Text-to-SQL solutions often use the RAG pattern, inputting the complete database schema as context into large models. However, database schemas usually contain a large number of tables and fields, leading to prompt tokens consuming a lot of tokens, increasing API costs and occupying context space.

Core Insight of YORO (You Only Retrieve Once) Hybrid Architecture: Not all queries require complete schema information; models can internalize schema knowledge during training, enabling 'zero schema tokens' for some queries.

Section 03

Core Innovation: Three-Path Intelligent Routing and Router Implementation

Three-Path Intelligent Routing

Path A (YORO Purely Parameterized)：Suitable for standard aggregation queries; prompts only include database ID and question, with an average of 50 tokens (model has internalized the schema).
Path B (YORO Hybrid)：Suitable for medium-complexity problems; extracts and compresses a subset of the schema; prompts include database ID, question, and compressed subset, with an average of 500-800 tokens.
Path C (Graph-RAG Fallback)：Suitable for complex queries; prompts include the complete compressed schema, with an average of 3900 tokens (fallback solution).

Router Implementation

No additional LLM calls are needed; it uses keyword complexity scoring (completed in 1 millisecond):

Complexity-increasing signals: geographic joins, statistical analysis, data reconciliation, window functions, question length >120 characters.
Complexity-decreasing signals: TOP-N queries, single aggregation, time filtering, common business vocabulary.
Threshold rules: Score <0.55 → Path A; <0.8 → Path B; else → Path C.

Section 04

Detailed Explanation of Architecture Components

The project includes five core modules:

Schema Analyzer: Reads DKL Excel to generate three schema representations: CodeS, PICARD, and YORO prompts.
Synthetic Data Generator: Three-stage process (skeleton extraction → SQL generation → NLQ generation).
Fine-tuning Formatter: Supports OpenAI/Azure and HuggingFace/PEFT formats; controls the proportion of training data via hybrid_ratio.
Hybrid Inference Router: Implements complexity scoring and path selection. 5.** Pipeline Orchestrator**: Provides CLI interface (setup/benchmark/generate modes).

Section 05

Benchmark Testing: Empirical Results of 80% Token Reduction

In tests on the Olist Brazilian e-commerce dataset with 44 questions:

Path	Number of Questions	Proportion	Average Tokens	Reduction vs. Baseline
A - YORO Pure	26	60%	50	-98.7%
B - YORO Hybrid	11	25%	~700	-82%
C - Graph-RAG	7	15%	~3900	0%
Weighted Hybrid	44	100%	~560	-85.6%

Overall, it achieves approximately 80% token reduction while maintaining SQL accuracy.

Section 06

Technical Insights: General Efficiency Optimization Ideas and Application Scenarios

General idea of YORO architecture: Adaptive resource allocation through problem complexity analysis, which can be extended to:

Document Q&A: Use lightweight models for simple questions, large models for complex ones.
Code generation: Use cached templates for common patterns, full generation for novel requirements.
Multimodal processing: Choose different pipelines based on input features.

Key: Finding appropriate 'complexity proxy metrics' (heuristic rules) enables effective resource allocation.

Section 07

Limitations and Considerations

Limitations of YORO:

Fine-tuning of expert models requires domain-specific training data; migrating to a new database requires re-synthesizing data and re-fine-tuning.
The complexity scorer is based on heuristic rules; uncovered query types may lead to routing errors (Path C serves as a fallback but requires monitoring and optimization).
The current implementation is targeted at the Olist dataset; performance on complex enterprise-level databases needs further verification.

Section 08

Conclusion: Importance and Insights of Efficiency Optimization

The YORO Hybrid Architecture brings a new idea for efficiency optimization in Text-to-SQL: treating queries differently instead of uniformly. This 'on-demand allocation' philosophy is applicable to a wider range of AI system designs. In today's era where computing costs are a concern, such efficiency innovations will become increasingly important.

Continue Reading

Keep going with more reads from the same topic.

Nornir MCP Server: An Enterprise-Grade Bridge for Integrating Large Language Models into Network Automation

Nornir MCP Server is an enterprise-level server based on the Model Context Protocol (MCP). It seamlessly integrates large language models (such as Claude) with the Nornir network automation framework, supporting natural language orchestration for multi-vendor network devices (Cisco, Arista, Juniper, etc.), and providing production-grade features like a dual-engine architecture (NAPALM + Netmiko), intelligent filtering, and a secure sandbox.

Recent activity 2026-05-06 20:51

Bibliothèque Française LLM: A French Public Domain Literature Index System Optimized for Large Language Models

Bibliothèque Française LLM is a structured indexing and annotation project for French public domain literature designed specifically for large language models (LLMs). It integrates multiple authoritative sources such as DraCor, Common Corpus, and Wikisource, providing metadata indexing categorized by genre, author, and era, as well as in-depth annotations for dramatic texts (including characters, lines, stage directions, etc.). Its aim is to enable LLMs to efficiently read and understand classic French literary works.

Recent activity 2026-05-06 20:50

Splinter: A Lock-Free Zero-Copy Shared Memory KV and Vector Storage Library That Eliminates Socket and Memcpy Overhead for LLM Inference

Splinter is a minimalist, high-performance key-value (KV) and vector storage system enabling zero-latency inter-process communication via shared memory and atomic operations. With only 766 lines of core code, it supports millions of operations per second and 768-dimensional vector storage, offering a new architectural approach for local LLM inference and data-intensive applications.

Recent activity 2026-04-03 08:49

libmlxforge: An Embedded MLX LLM Inference Engine for Apple Silicon

libmlxforge is an embeddable MLX large language model (LLM) inference engine designed specifically for Apple Silicon. It provides a unified C ABI interface, supports calls from Node.js, Swift, and Rust, and features continuous batching, streaming output, JSON-constrained structured output, and embedding vector generation.

Recent activity 2026-06-09 17:23