Zing Forum

Reading

Research on Synthetic Tabular Data Generation Based on Fine-Tuning of Large Language Models

A master's thesis project at ITMO University exploring methods and strategies for generating high-quality synthetic tabular data using fine-tuning techniques for large language models.

合成数据大语言模型表格数据微调数据隐私生成模型ITMO
Published 2026-05-20 23:44Recent activity 2026-05-20 23:51Estimated read 6 min
Research on Synthetic Tabular Data Generation Based on Fine-Tuning of Large Language Models
1

Section 01

Main Floor | Introduction to Research on Synthetic Tabular Data Generation Based on LLM Fine-Tuning

The master's thesis project at ITMO University explores methods and strategies for generating high-quality synthetic tabular data using fine-tuning techniques for large language models (LLMs). It aims to address the bottleneck of data scarcity in the field of machine learning, as well as issues such as privacy regulation constraints, high annotation costs, etc., in real data acquisition. The core idea is to serialize tabular data into text formats (e.g., JSON, CSV), leverage the powerful sequence modeling capabilities of LLMs to transfer to structured data generation tasks, and explore effective fine-tuning strategies and multi-dimensional evaluation frameworks.

2

Section 02

Research Background | Necessity of Synthetic Tabular Data and Limitations of Traditional Methods

Tabular data is a core data form in fields such as finance, healthcare, and e-commerce. However, real data acquisition faces obstacles like privacy regulation constraints (e.g., GDPR), high annotation costs, insufficient samples of rare events, and barriers to cross-organizational sharing. Synthetic data technology generates artificial data with similar statistical characteristics but no real individual information. Traditional methods such as statistical models (Gaussian mixture models) and GANs have limitations in capturing complex cross-feature dependencies, and the emergence of LLMs brings new possibilities for synthetic data generation.

3

Section 03

Core Insight | Logic of LLM Adaptation for Tabular Data Generation

Although LLMs seem to be designed specifically for text, tabular data can be serialized into text formats (JSON/CSV), and their sequence modeling capabilities can be transferred to structured data generation. The advantages of LLMs include: modeling complex cross-feature dependencies, robust handling of missing values, and extensive world knowledge obtained from pre-training. These characteristics enable fine-tuned LLMs to generate semantically reasonable synthetic records.

4

Section 04

Technical Challenges and Exploration of Effective Fine-Tuning Strategies

Adapting general-purpose LLMs to tabular generation faces challenges such as format consistency (compliance with Schema), statistical fidelity (consistency of marginal/joint distributions + differential privacy), conditional generation capability, and rare event generation. The effective fine-tuning strategies explored in the research include: parameter-efficient fine-tuning (PEFT such as LoRA, Adapter), instruction fine-tuning (designing instruction templates to guide semantic constraints), mixed training (real + simple baseline synthetic data), and reinforcement learning optimization (RLHF framework using statistical similarity as a reward).

5

Section 05

Evaluation Framework | Multi-Dimensional Metrics for Synthetic Data Quality

Synthetic data evaluation is carried out from four dimensions: statistical similarity (KL divergence of column distributions, Frobenius distance of correlation matrices), downstream task utility (performance comparison of models trained on synthetic data on real test sets), privacy protection strength (audit of membership/attribute inference attacks), and diversity (coverage of real data diversity).

6

Section 06

Application Prospects | Industry Value of Synthetic Tabular Data

Synthetic tabular data has transformative potential in multiple fields: medical research (de-identified patient records to protect privacy), financial risk control (synthetic rare fraud cases to improve identification capabilities), software testing (test data with real statistical characteristics to increase coverage), and data sharing (enterprises can share synthetic data for cooperation without exposing sensitive information).

7

Section 07

Research Limitations and Future Directions

Current research limitations include high computational costs, difficulty in processing complex pattern tables (multi-table relational databases), and insufficient interpretability of generated data. Future directions may include multi-modal synthesis (text + tables), causal-preserving synthesis methods, and development of domain-specific pre-trained models.