Zing 论坛

正文

QueryForge:端到端构建 Text-to-SQL 智能助手的开源框架

QueryForge 是一个完整的开源框架,帮助企业从结构化数据模型出发,自动生成训练数据、微调轻量级大语言模型,并在 AWS SageMaker 上完成评估与部署,最终实现对 Ollama 本地推理的支持。

Text-to-SQLAWS SageMakerQLoRALangGraphOllama自然语言查询大语言模型微调
发布时间 2026/04/08 23:44最近活动 2026/04/08 23:51预计阅读 8 分钟
QueryForge:端到端构建 Text-to-SQL 智能助手的开源框架
1

章节 01

QueryForge: Open-source framework for end-to-end Text-to-SQL intelligent assistant

QueryForge is a complete open-source framework that helps enterprises build Text-to-SQL intelligent assistants from structured data models. It supports automatic training data generation, fine-tuning lightweight large language models, evaluation and deployment on AWS SageMaker, and finally enables local inference with Ollama. Key keywords include Text-to-SQL, AWS SageMaker, QLoRA, LangGraph, Ollama, natural language query, and LLM fine-tuning.

2

章节 02

Background: Pain points of natural language querying databases

In enterprise data analysis and business intelligence scenarios, enabling non-technical personnel to query databases directly via natural language has long been a challenge. Traditional Text-to-SQL solutions face issues like data privacy concerns (sending sensitive schema to third-party APIs), insufficient model generalization (generic LLMs struggle with specific business schemas), high deployment costs (cloud inference fees), and difficulty in iteration/maintenance (hard to adapt to schema changes). QueryForge addresses these pain points with an end-to-end solution from data generation to local deployment.

3

章节 03

Project Overview: Six core modules

QueryForge uses a modular architecture with six core components:

  1. Data generation module (datagen): Uses LangGraph to auto-generate (question, SQL, result) triples based on database schema, no manual annotation needed.
  2. Schema management module (schemas): Uses Pydantic for unified schema management, ensuring consistency across data generation, training, and evaluation.
  3. Model training module (train): Uses QLoRA on AWS SageMaker for efficient fine-tuning, reducing memory usage and keeping model capabilities.
  4. Model evaluation module (evaluate): Uses Execution Accuracy in a temporary SQLite environment to assess model quality, focusing on correct execution results.
  5. Pipeline orchestration module (pipeline): Uses AWS SageMaker Pipeline to automate the workflow from data generation to model registration.
  6. Inference service module (inference): Supports both SageMaker cloud endpoints and local Ollama inference for flexible deployment.
4

章节 04

Technical Highlights Analysis

Key technical highlights of QueryForge:

  • Synthetic data generation: Uses LangGraph's intelligent agent workflow (schema understanding → problem generation → SQL synthesis → result validation) to ensure data quality and diversity without manual annotation.
  • QLoRA fine-tuning: Uses 4-bit quantization to load base models (e.g., Llama-3), injects low-rank adapters in attention and fully connected layers, and applies optimizations like LoRA+ for stability.
  • Execution-level evaluation: Uses Execution Accuracy which checks if SQL execution results are correct, ignoring syntax differences and handling edge cases like NULL values and floating-point precision.
5

章节 05

Usage Process for QueryForge

Steps to use QueryForge:

  1. Environment preparation: Configure AWS environment (IAM roles, ECR repositories) via scripts, then install dependencies using uv: uv sync --all-extras.
  2. Pipeline configuration: Copy the example config file (cp config/pipeline.yaml.example config/pipeline.yaml) and modify key parameters (S3 bucket, SageMaker role, training instance type).
  3. Run pipeline: Execute python scripts/run_pipeline.py to perform data generation, training, evaluation, and register qualified models.
  4. Deployment & inference: Deploy to cloud with python scripts/deploy_model.py or test locally with python scripts/run_local_inference.py.
6

章节 06

Project Status and Development Roadmap

QueryForge is under active development: Completed features: QLoRA fine-tuning and S3 artifact storage; Execution Accuracy evaluation framework. Planned features: Automated SageMaker Endpoint creation based on evaluation thresholds; cross-instance deployment testing; Docker local inference support; auto retraining on schema version changes; PostgreSQL and MySQL evaluation backend support.

7

章节 07

Applicable Scenarios and Value

QueryForge is suitable for:

  1. Enterprise internal data analysis platforms: Enable business users to query data warehouses via natural language.
  2. SaaS product intelligent query functions: Add AI-driven report generation for B2B products.
  3. Industries with strict data governance/compliance: Finance, healthcare (data stays in-domain).
  4. Rapid prototype validation: Use synthetic data to quickly verify Text-to-SQL feasibility in specific domains.
8

章节 08

Conclusion

QueryForge represents a practical direction in the Text-to-SQL field—it focuses on building a complete engineering solution instead of pursuing the largest generic model. It allows enterprises to train domain-specific models on their own data and deploy them in controlled environments. For teams exploring LLM applications in data analysis, QueryForge provides a valuable architectural blueprint and implementation starting point.