Zing Forum

Reading

BERT-Knowledge-Based-Systems: Ensemble Selection of Large Language Models and Text Embedding Optimization Using Fuzzy Set Methods

A complete workflow for building and optimizing domain-specific text embeddings, which automatically selects the optimal subset of large language models via genetic algorithms to improve the accuracy of professional scientific literature retrieval.

文本嵌入大语言模型集成学习遗传算法模糊集理论语义检索科学文献领域自适应
Published 2026-04-20 00:44Recent activity 2026-04-20 00:50Estimated read 5 min
BERT-Knowledge-Based-Systems: Ensemble Selection of Large Language Models and Text Embedding Optimization Using Fuzzy Set Methods
1

Section 01

[Main Thread Guide] BERT-Knowledge-Based-Systems: An Ensemble Solution for Domain Text Embedding Optimization

This project addresses the limitations of single pre-trained models in professional scientific literature retrieval, proposing an ensemble selection scheme for large language models based on fuzzy set methods and genetic algorithms. It improves semantic retrieval accuracy by automatically screening the optimal model subset. The core innovation lies in transforming model selection into a combinatorial optimization problem, designing a complete three-stage workflow (data processing → embedding training → ensemble optimization), and open-sourcing the code and model weights to provide a new framework for domain-adaptive text embeddings.

2

Section 02

Research Background: Limitations of Single Models and Opportunities in Ensemble Learning

In the field of semantic retrieval, traditional single pre-trained models struggle to cover all domain tasks. Especially in the retrieval of professional scientific literature such as medicine and physics, general-purpose models cannot accurately capture domain-specific terms and conceptual relationships. While ensemble learning can combine the advantages of multiple models, it faces challenges like 'how to select the optimal subset' and 'how to determine weights'. This project was born to solve these problems.

3

Section 03

Core Methods: Combinatorial Optimization + Fuzzy Sets + Genetic Algorithms

The project transforms model ensemble selection into a combinatorial optimization problem: 1. Fuzzy set scoring mechanism: Maps model similarity scores to the degree of 'correct matching' via membership functions to quantify uncertainty; 2. Genetic algorithm: Encodes model subsets in binary, and efficiently searches for optimal solutions in the exponential space through selection, crossover, and mutation operations; 3. Three-stage workflow: Data processing (cleaning scientific papers to generate training blocks), embedding training (domain-adaptive pre-training + contrastive learning), and ensemble optimization (selecting the optimal subset via genetic algorithm).

4

Section 04

Experimental Validation: Performance Improvement in Scientific Literature Retrieval

Experiments on multi-domain scientific literature datasets (computer science, physics, life sciences, etc.) show that: the optimized model ensemble significantly outperforms single models; the selected subset includes models of different architectures (BERT, RoBERTa, etc.), reflecting complementarity; ablation experiments prove that domain-adaptive pre-training, contrastive learning, and genetic algorithm integration are all indispensable and jointly improve performance.

5

Section 05

Application Scenarios and Future Directions

Application scenarios: Domain-specific search engines (law/medical/finance), embedding model evaluation, high-reliability NLP systems. Future directions: Explore more efficient optimization algorithms (gradient/reinforcement learning), expand to multi-modal scenarios, and study online learning for dynamic update of ensembles.

6

Section 06

Open-Source Contributions and Community Value

The project open-sources the complete code (training/evaluation/embedding generation modules) and Hugging Face model weights, lowering the barrier to use; provides a new perspective for model ensemble selection, inspiring related research; the repository has a clear structure with interactive examples, facilitating reuse and secondary development.