# BERT-Knowledge-Based-Systems: Ensemble Selection of Large Language Models and Text Embedding Optimization Using Fuzzy Set Methods

> A complete workflow for building and optimizing domain-specific text embeddings, which automatically selects the optimal subset of large language models via genetic algorithms to improve the accuracy of professional scientific literature retrieval.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-04-19T16:44:34.000Z
- 最近活动: 2026-04-19T16:50:53.144Z
- 热度: 141.9
- 关键词: 文本嵌入, 大语言模型, 集成学习, 遗传算法, 模糊集理论, 语义检索, 科学文献, 领域自适应
- 页面链接: https://www.zingnex.cn/en/forum/thread/bert-knowledge-based-systems
- Canonical: https://www.zingnex.cn/forum/thread/bert-knowledge-based-systems
- Markdown 来源: floors_fallback

---

## [Main Thread Guide] BERT-Knowledge-Based-Systems: An Ensemble Solution for Domain Text Embedding Optimization

This project addresses the limitations of single pre-trained models in professional scientific literature retrieval, proposing an ensemble selection scheme for large language models based on fuzzy set methods and genetic algorithms. It improves semantic retrieval accuracy by automatically screening the optimal model subset. The core innovation lies in transforming model selection into a combinatorial optimization problem, designing a complete three-stage workflow (data processing → embedding training → ensemble optimization), and open-sourcing the code and model weights to provide a new framework for domain-adaptive text embeddings.

## Research Background: Limitations of Single Models and Opportunities in Ensemble Learning

In the field of semantic retrieval, traditional single pre-trained models struggle to cover all domain tasks. Especially in the retrieval of professional scientific literature such as medicine and physics, general-purpose models cannot accurately capture domain-specific terms and conceptual relationships. While ensemble learning can combine the advantages of multiple models, it faces challenges like 'how to select the optimal subset' and 'how to determine weights'. This project was born to solve these problems.

## Core Methods: Combinatorial Optimization + Fuzzy Sets + Genetic Algorithms

The project transforms model ensemble selection into a combinatorial optimization problem: 1. Fuzzy set scoring mechanism: Maps model similarity scores to the degree of 'correct matching' via membership functions to quantify uncertainty; 2. Genetic algorithm: Encodes model subsets in binary, and efficiently searches for optimal solutions in the exponential space through selection, crossover, and mutation operations; 3. Three-stage workflow: Data processing (cleaning scientific papers to generate training blocks), embedding training (domain-adaptive pre-training + contrastive learning), and ensemble optimization (selecting the optimal subset via genetic algorithm).

## Experimental Validation: Performance Improvement in Scientific Literature Retrieval

Experiments on multi-domain scientific literature datasets (computer science, physics, life sciences, etc.) show that: the optimized model ensemble significantly outperforms single models; the selected subset includes models of different architectures (BERT, RoBERTa, etc.), reflecting complementarity; ablation experiments prove that domain-adaptive pre-training, contrastive learning, and genetic algorithm integration are all indispensable and jointly improve performance.

## Application Scenarios and Future Directions

Application scenarios: Domain-specific search engines (law/medical/finance), embedding model evaluation, high-reliability NLP systems. Future directions: Explore more efficient optimization algorithms (gradient/reinforcement learning), expand to multi-modal scenarios, and study online learning for dynamic update of ensembles.

## Open-Source Contributions and Community Value

The project open-sources the complete code (training/evaluation/embedding generation modules) and Hugging Face model weights, lowering the barrier to use; provides a new perspective for model ensemble selection, inspiring related research; the repository has a clear structure with interactive examples, facilitating reuse and secondary development.
