Zing Forum

Reading

End-to-End Automatic Data Integration Using Large Language Models: A Research Breakthrough from the University of Mannheim

This article introduces how the research team at the University of Mannheim leverages large language models (LLMs) to achieve fully automatic end-to-end data integration, covering technical innovations and practical applications in core areas such as schema matching, entity resolution, and data fusion.

数据集成大型语言模型模式匹配实体解析数据融合曼海姆大学数据工程自动化
Published 2026-05-06 22:46Recent activity 2026-05-06 23:20Estimated read 8 min
End-to-End Automatic Data Integration Using Large Language Models: A Research Breakthrough from the University of Mannheim
1

Section 01

[Introduction] Research Breakthrough: University of Mannheim Uses LLMs for End-to-End Automatic Data Integration

Data integration is a key bottleneck in the field of data engineering. Traditional methods rely heavily on manual work and are prone to errors. The research team at the University of Mannheim proposes using large language models (LLMs) to achieve end-to-end automatic data integration, covering three core links: schema matching, entity resolution, and data fusion. Through unified framework design and in-context learning, their approach outperforms traditional methods in experiments and has been applied in scenarios such as retail, healthcare, and scientific research, opening up new directions for the data engineering field.

2

Section 02

Core Challenges of Data Integration and Limitations of Traditional Methods

Data integration faces three core challenges: 1. Schema matching: Identifying semantically equivalent fields across different data sources (e.g., "customer_name" vs. "client_full_name"); 2. Entity resolution: Determining whether different records refer to the same entity (e.g., "John Smith" vs. "J.Smith"); 3. Data fusion: Resolving conflicting information about the same entity (e.g., age 30 vs. 32). Traditional methods handle these steps independently, requiring specialized algorithms and large amounts of manually labeled data, leading to issues like error propagation and low efficiency.

3

Section 03

Design of an LLM-Driven End-to-End Data Integration Framework

The research team transforms data integration into a sequence-to-sequence generation task and designs a unified prompt engineering framework:

  • Schema matching prompt: Given fields from source and target tables, identify equivalent field pairs and explain the reasons;
  • Entity resolution prompt: Analyze whether two records refer to the same entity and provide confidence scores;
  • Data fusion prompt: Select the most appropriate value for conflicting attributes and explain the reasoning. Through in-context learning (few-shot examples), LLMs can outperform traditional models without specialized training. The end-to-end advantages include reduced error propagation, cross-step knowledge sharing, and flexible adaptation to new scenarios.
4

Section 04

Experimental Evaluation and Results: Performance of the LLM Approach

In evaluations on public datasets and real-world scenarios:

  • Schema matching: F1 score reaches 0.85+, which is 15% higher than traditional methods in cross-domain scenarios;
  • Entity resolution: More robust than traditional methods in noisy/incomplete record scenarios;
  • End-to-end integration: Data quality metrics (accuracy, completeness, consistency) are improved by 20-30% compared to traditional pipelines. In terms of efficiency, LLM invocation costs are relatively high, but they save manual annotation and development time. Through model distillation, inference costs can be reduced by 70% while maintaining 90% of the performance.
5

Section 05

Real-World Business Application Cases of LLM Data Integration Technology

  1. Retail enterprise customer data integration: After a cross-border retail merger, 5 subsidiary customer databases were integrated in two weeks (traditional methods take 6 months); 2. Healthcare data fusion: Collaborated with hospitals to integrate patient records from different departments, with accuracy and interpretability meeting compliance requirements; 3. Scientific research data warehouse construction: Integrated experimental data from over 50 global institutions, successfully handling heterogeneous terminology and coding systems.
6

Section 06

Current Technical Limitations and Future Improvement Directions

Existing challenges and countermeasures:

  • Scalability: High costs when processing ultra-large-scale data; will explore batch processing optimization, hierarchical filtering, and active learning;
  • Privacy and security: Risks of sensitive data; will adopt local deployment, data desensitization, and federated learning;
  • Hallucination issue: LLMs may generate incorrect matches; will mitigate this through confidence calibration, consistency checks, and human-machine collaboration.
7

Section 07

Transformative Significance of LLM Data Integration Technology for the Data Engineering Field

  1. Lowering technical barriers: Natural language interfaces allow business personnel to participate in data integration; 2. Accelerating project delivery: Cycle time reduced from months/years to weeks/days, improving data agility; 3. Promoting data democratization: Small and medium-sized enterprises and scientific research institutions can also efficiently integrate multi-source data, unlocking data value.
8

Section 08

Conclusion: Future Outlook of LLM-Driven Data Integration

The research from the University of Mannheim demonstrates the potential of LLMs in structured data processing and redefines the data integration task. Although it is in the early stage, with the advancement of LLM technology, end-to-end automatic data integration is expected to become a standard practice. It is recommended that data engineers and decision-makers pay attention to this technology; early adopters will gain an advantage in data competition.