# QueryForge: An Open-Source Framework for End-to-End Construction of Text-to-SQL Intelligent Assistants

> QueryForge is a complete open-source framework that helps enterprises start from structured data models to automatically generate training data, fine-tune lightweight large language models, conduct evaluation and deployment on AWS SageMaker, and finally achieve support for Ollama local inference.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-04-08T15:44:17.000Z
- 最近活动: 2026-04-08T15:51:47.430Z
- 热度: 157.9
- 关键词: Text-to-SQL, AWS SageMaker, QLoRA, LangGraph, Ollama, 自然语言查询, 大语言模型微调
- 页面链接: https://www.zingnex.cn/en/forum/thread/queryforge-text-to-sql
- Canonical: https://www.zingnex.cn/forum/thread/queryforge-text-to-sql
- Markdown 来源: floors_fallback

---

## QueryForge: Open-source framework for end-to-end Text-to-SQL intelligent assistant

QueryForge is a complete open-source framework that helps enterprises build Text-to-SQL intelligent assistants from structured data models. It supports automatic training data generation, fine-tuning lightweight large language models, evaluation and deployment on AWS SageMaker, and finally enables local inference with Ollama. Key keywords include Text-to-SQL, AWS SageMaker, QLoRA, LangGraph, Ollama, natural language query, and LLM fine-tuning.

## Background: Pain points of natural language querying databases

In enterprise data analysis and business intelligence scenarios, enabling non-technical personnel to query databases directly via natural language has long been a challenge. Traditional Text-to-SQL solutions face issues like data privacy concerns (sending sensitive schema to third-party APIs), insufficient model generalization (generic LLMs struggle with specific business schemas), high deployment costs (cloud inference fees), and difficulty in iteration/maintenance (hard to adapt to schema changes). QueryForge addresses these pain points with an end-to-end solution from data generation to local deployment.

## Project Overview: Six core modules

QueryForge uses a modular architecture with six core components:
1. Data generation module (datagen): Uses LangGraph to auto-generate (question, SQL, result) triples based on database schema, no manual annotation needed.
2. Schema management module (schemas): Uses Pydantic for unified schema management, ensuring consistency across data generation, training, and evaluation.
3. Model training module (train): Uses QLoRA on AWS SageMaker for efficient fine-tuning, reducing memory usage and keeping model capabilities.
4. Model evaluation module (evaluate): Uses Execution Accuracy in a temporary SQLite environment to assess model quality, focusing on correct execution results.
5. Pipeline orchestration module (pipeline): Uses AWS SageMaker Pipeline to automate the workflow from data generation to model registration.
6. Inference service module (inference): Supports both SageMaker cloud endpoints and local Ollama inference for flexible deployment.

## Technical Highlights Analysis

Key technical highlights of QueryForge:
- **Synthetic data generation**: Uses LangGraph's intelligent agent workflow (schema understanding → problem generation → SQL synthesis → result validation) to ensure data quality and diversity without manual annotation.
- **QLoRA fine-tuning**: Uses 4-bit quantization to load base models (e.g., Llama-3), injects low-rank adapters in attention and fully connected layers, and applies optimizations like LoRA+ for stability.
- **Execution-level evaluation**: Uses Execution Accuracy which checks if SQL execution results are correct, ignoring syntax differences and handling edge cases like NULL values and floating-point precision.

## Usage Process for QueryForge

Steps to use QueryForge:
1. **Environment preparation**: Configure AWS environment (IAM roles, ECR repositories) via scripts, then install dependencies using uv: `uv sync --all-extras`.
2. **Pipeline configuration**: Copy the example config file (`cp config/pipeline.yaml.example config/pipeline.yaml`) and modify key parameters (S3 bucket, SageMaker role, training instance type).
3. **Run pipeline**: Execute `python scripts/run_pipeline.py` to perform data generation, training, evaluation, and register qualified models.
4. **Deployment & inference**: Deploy to cloud with `python scripts/deploy_model.py` or test locally with `python scripts/run_local_inference.py`.

## Project Status and Development Roadmap

QueryForge is under active development:
**Completed features**: QLoRA fine-tuning and S3 artifact storage; Execution Accuracy evaluation framework.
**Planned features**: Automated SageMaker Endpoint creation based on evaluation thresholds; cross-instance deployment testing; Docker local inference support; auto retraining on schema version changes; PostgreSQL and MySQL evaluation backend support.

## Applicable Scenarios and Value

QueryForge is suitable for:
1. Enterprise internal data analysis platforms: Enable business users to query data warehouses via natural language.
2. SaaS product intelligent query functions: Add AI-driven report generation for B2B products.
3. Industries with strict data governance/compliance: Finance, healthcare (data stays in-domain).
4. Rapid prototype validation: Use synthetic data to quickly verify Text-to-SQL feasibility in specific domains.

## Conclusion

QueryForge represents a practical direction in the Text-to-SQL field—it focuses on building a complete engineering solution instead of pursuing the largest generic model. It allows enterprises to train domain-specific models on their own data and deploy them in controlled environments. For teams exploring LLM applications in data analysis, QueryForge provides a valuable architectural blueprint and implementation starting point.
