Zing Forum

Reading

DataStew: An LLM Embedding-Based Intelligent Data Harmonization Python Library

An open-source Python library that leverages large language model (LLM) vector embedding technology to enable intelligent matching of medical data dictionaries and terminology harmonization, supporting PostgreSQL persistent storage and t-SNE visualization.

数据协调LLM嵌入医疗数据术语匹配Python库PostgreSQL向量搜索生物医学信息学
Published 2026-04-07 21:16Recent activity 2026-04-07 21:21Estimated read 6 min
DataStew: An LLM Embedding-Based Intelligent Data Harmonization Python Library
1

Section 01

DataStew: An Open-Source Python Library for LLM Embedding-Based Intelligent Medical Data Harmonization

DataStew is an open-source Python library developed by the SCAI-BIO team. It addresses the core challenge of data harmonization in medical informatics—terminology heterogeneity across data from different sources—by using LLM vector embedding technology to achieve intelligent semantic-level matching. It supports Excel/CSV data dictionary matching, PostgreSQL persistent storage (integrated with pgvector), t-SNE embedding visualization, and other features, making it suitable for scenarios like multi-center clinical research integration and terminology system alignment.

2

Section 02

Project Background: The Challenge of Biomedical Data Heterogeneity

Biomedical data exhibits multi-level heterogeneity: the same concept may have multiple descriptions (e.g., "Diabetes mellitus" vs. "糖尿病"), and variable naming rules differ across datasets. Traditional rule/dictionary-based matching struggles to handle semantic differences. DataStew's core insight is that LLM embeddings can capture deep textual semantics, bringing similar terms closer in vector space to enable intelligent matching.

3

Section 03

Core Features: Intelligent Matching, Persistence, and Visualization

  1. Intelligent Matching: Supports Excel/CSV data dictionary matching with the workflow: load source data → perform matching → obtain mapping results. Uses the local MPNet model by default, and can also integrate with the OpenAI Embedding API.
  2. PostgreSQL Persistence: Implements vector storage and search via pgvector, supporting management and querying of terminology systems, concepts, and mappings.
  3. Visualization: Integrates t-SNE dimensionality reduction, which can project embedding vectors into a 2D space to assist in analyzing semantic clustering and outliers.
4

Section 04

Design Philosophy: Clear Separation and Usability

DataStew follows the principle of separation of concerns. Its core modules include embedding (vector conversion), harmonization (matching algorithms), io.source (data import), repository (persistence), and visualisation (visualization). It provides abundant example scripts to help users get started quickly.

5

Section 05

Application Scenarios: Multi-Dimensional Solutions for Biomedical Data Problems

Applicable to:

  • Multi-center clinical research: Automatically identifies equivalent variables, reducing manual mapping effort;
  • Terminology system alignment: Captures synonym/near-synonym relationships, facilitating mapping between local terminology and standard systems (e.g., SNOMED CT);
  • Data quality auditing: Discovers abnormal data points through visualization;
  • Legacy system migration: Identifies correspondences between old databases and modern terminology.
6

Section 06

Technology Selection: Pragmatic Engineering Trade-offs

The MPNet model is selected by default (excellent semantic similarity and runs locally, avoiding external dependencies); it supports switching to commercial models via the Vectorizer abstraction layer. PostgreSQL + pgvector is chosen over dedicated vector databases to reduce operational complexity and facilitate integration into existing tech stacks.

7

Section 07

Community Support: Academic Background and Open-Source Practices

DataStew is supported by SCAI-BIO (Institute of Scientific Computing and AI in Biomedicine), ensuring long-term maintenance and technical depth. It follows open-source practices, including continuous integration testing, code coverage monitoring, and version management.

8

Section 08

Conclusion: A Model of LLM Vertical Application

DataStew is a practical case of LLM technology in vertical domains, focusing on solving pain points in medical data harmonization. It is an efficient tool for biomedical practitioners and a reference case for LLM productization for developers. Its concise design and practical features make it a plug-and-play solution.