Zing Forum

Reading

Synth-Forge: A Local-First Synthetic Data Generation and Privacy Protection Testing Framework

Introducing a local data generation tool designed specifically for intelligent agent workflow testing, which can generate high-quality synthetic data and perform PII desensitization without relying on cloud-based large models or leaking sensitive data.

合成数据生成PII脱敏智能体工作流本地优先隐私保护测试数据TypeScript数据匿名化Agentic Workflow大模型测试
Published 2026-06-15 01:16Recent activity 2026-06-15 01:20Estimated read 8 min
Synth-Forge: A Local-First Synthetic Data Generation and Privacy Protection Testing Framework
1

Section 01

Synth-Forge: Local-First Synthetic Data Tool for Agent Workflow Testing & Privacy Protection

Synth-Forge Overview

Synth-Forge is a local-first synthetic data generation and privacy protection testing framework designed for intelligent agent workflow testing. Key highlights:

  • Local-first: Runs entirely offline, no cloud dependency, zero API cost.
  • Privacy protection: Automatically identifies and desensitizes PII (Personal Identifiable Information).
  • Agent workflow adaptation: Supports multi-round dialogue simulation, tool call scenarios, and boundary condition testing.

Basic Info:

2

Section 02

Project Background & Motivation

Project Background & Motivation

When building and testing agentic workflows, developers face a core dilemma: obtaining diverse, realistic test data without exposing sensitive information. Traditional solutions have limitations:

  • Cloud-based models: Risk of data leakage and ongoing API costs.
  • Static test sets: Fail to cover complex real-world scenarios.

Synth-Forge addresses this by adopting a local-first architecture, enabling offline synthetic data generation with automatic PII desensitization.

3

Section 03

Core Features of Synth-Forge

Core Features

  1. Local Data Synthesis Engine:

    • Runs locally (zero API cost, offline available, low latency).
    • Supports structured (tables, JSON), semi-structured text, and multi-modal metadata simulation.
  2. PII Recognition & Desensitization:

    • Identifies direct identifiers (name, ID, phone, email), quasi-identifiers (location, DOB), and sensitive attributes (medical/financial records).
    • Desensitization strategies: anonymization (pseudonyms), partial masking (e.g., 138****8888), generalization (address to region).
  3. Agent Workflow Adaptation:

    • Multi-round dialogue simulation (context coherence, memory testing).
    • Tool call scenario simulation (external API response formats).
    • Boundary condition generation (edge cases, abnormal inputs).
4

Section 04

Technical Architecture & Implementation

Technical Architecture

  • TypeScript Stack: Ensures type safety and cross-platform compatibility. Core modules:

    • Generator: Synthesizes various data patterns.
    • Detector: Rule/pattern-based PII identification.
    • Sanitizer: Executes desensitization strategies.
    • Test Suite: Validates data quality and desensitization effectiveness.
  • Extensible Plugin Mechanism:

    • Custom data generation rules for specific business scenarios.
    • Extend PII recognition patterns for industry-specific needs.
    • Integrate external data sources for hybrid generation.
5

Section 05

Application Scenarios & Practical Value

Application Scenarios

  1. Agent Development Testing:

    • Generate realistic user queries (common intents + edge cases).
    • Simulate multi-round dialogue to test memory and reasoning.
    • End-to-end testing without real customer data.
  2. RAG System Evaluation:

    • Generate QA pairs with ground truth for retrieval accuracy assessment.
    • Simulate document libraries to test chunking strategies.
    • Adversarial queries to evaluate robustness.
  3. Privacy Compliance Pre-check:

    • Validate desensitization process effectiveness.
    • Assess re-identification risk of de-identified data.
    • Test data analyzability after anonymization.
6

Section 06

Comparison with Existing Solutions

Comparison with Existing Solutions

Feature Synth-Forge Cloud API Solutions Static Test Sets
Data Privacy Fully local, zero leakage risk Data transmitted to cloud Local storage
Generation Diversity Highly configurable Dependent on model capability Fixed/limited
Cost One-time development cost Ongoing API fees No running cost
Offline Availability Supported Not supported Supported
Scenario Customization Flexible extension Limited by API parameters Manual maintenance
7

Section 07

Future Prospects & Community Contribution

Future Prospects & Community Contribution

Potential Directions:

  • Multi-modal data support (images, audio, video synthesis/desensitization).
  • Federated learning integration for distributed collaborative data generation.
  • Differential privacy enhancement (mathematically provable privacy protection).

How to Contribute:

  • Expand PII recognition patterns for more languages/regions.
  • Optimize data generation authenticity and diversity.
  • Develop visual configuration interfaces.
8

Section 08

Conclusion

Conclusion

Synth-Forge represents an important trend in agent development tools: balancing large model capabilities with local data processing to ensure privacy and efficiency. For enterprise teams building agent applications, it provides a secure, controllable test data solution, making it a valuable addition to the tech stack.