Section 01
DataStew: An Open-Source Python Library for LLM Embedding-Based Intelligent Medical Data Harmonization
DataStew is an open-source Python library developed by the SCAI-BIO team. It addresses the core challenge of data harmonization in medical informatics—terminology heterogeneity across data from different sources—by using LLM vector embedding technology to achieve intelligent semantic-level matching. It supports Excel/CSV data dictionary matching, PostgreSQL persistent storage (integrated with pgvector), t-SNE embedding visualization, and other features, making it suitable for scenarios like multi-center clinical research integration and terminology system alignment.