# Naamah: Building a 100k-level Sanskrit Named Entity Recognition Corpus Using DBpedia Seeds and Hybrid Reasoning Large Models

> The research team launched the Naamah dataset, generating 103,000 high-quality Sanskrit NER sentences via DBpedia entity extraction and a 24B-parameter hybrid reasoning model, while comparing the performance of XLM-RoBERTa and IndicBERTv2.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-04-29T09:12:57.000Z
- 最近活动: 2026-04-30T04:47:22.663Z
- 热度: 124.4
- 关键词: 梵文NER, 命名实体识别, DBpedia, 混合推理模型, 低资源语言, XLM-RoBERTa, IndicBERTv2, 数据增强, 古典语言数字化
- 页面链接: https://www.zingnex.cn/en/forum/thread/naamah-dbpedia
- Canonical: https://www.zingnex.cn/forum/thread/naamah-dbpedia
- Markdown 来源: floors_fallback

---

## Naamah Dataset: Construction and Value of a 100k-level Sanskrit NER Corpus

The research team launched the Naamah dataset, generating 103,000 high-quality Sanskrit NER sentences via DBpedia entity extraction and a 24B-parameter hybrid reasoning model, while comparing the performance of XLM-RoBERTa and IndicBERTv2. This dataset is currently the largest synthetic Sanskrit NER dataset, providing an innovative path for the digitization of low-resource classical languages.

## Core Bottleneck in Sanskrit Digitization: Lack of High-Quality NER Annotated Corpus

The digitization of classical Sanskrit literature has long been limited by the lack of high-quality NER annotated corpora. As the core carrier of Indian classical academic, religious, and philosophical literature, Sanskrit digitization is of great significance for humanities research and cross-lingual knowledge graph construction. However, the cost of traditional manual annotation is extremely high, and the insufficiency of general large language models in classical grammar reasoning results in poor quality of automatic annotation.

## Technical Solution: Innovative Combination of DBpedia Seeds and 24B Hybrid Reasoning Model

### First Phase: DBpedia Entity Seed Extraction
Extract Sanskrit-related entities from the DBpedia knowledge base as seeds, using its cross-lingual entity alignment information to provide a reliable starting point.

### Second Phase: Generation with 24B-Parameter Hybrid Reasoning Model
Adopting a 24B-parameter hybrid reasoning model, which has three key advantages:
- Deeply understands complex classical Sanskrit grammar rules
- Creates rich sentence variants while ensuring grammatical correctness
- Lower hallucination rate in classical language processing
Input entity seeds to generate syntactically natural and accurately annotated synthetic sentences.

## Model Testing: Performance Comparison Between XLM-RoBERTa and IndicBERTv2

Use the Naamah dataset to train two Transformer architectures to verify the dataset's quality:

### XLM-RoBERTa
As a benchmark model for cross-lingual transfer, it is pre-trained on 100 languages and can fully leverage the advantages of cross-lingual knowledge transfer.

### IndicBERTv2
Focused on the Indian language family, it features a parameter-efficient design and achieves or even surpasses the performance of general multilingual models with fewer parameters within the specific language family.

## Practical Significance and Future Directions: Providing Reference for Low-Resource Classical Language NLP

The Naamah dataset provides an important reference for low-resource classical language NLP research. Its methodology of 'knowledge base seeds + domain-specific large model generation' can be extended to the processing of other classical languages such as Pali and Tibetan. Meanwhile, the development of hybrid reasoning architectures demonstrates the potential of large models in deep understanding tasks for low-resource languages.

## Core Highlights of the Naamah Dataset Recap

- Currently the largest synthetic Sanskrit NER dataset (103,000 sentences)
- Innovatively combines the DBpedia knowledge base with a 24B-parameter hybrid reasoning model
- Comparative tests verify the dataset's training effect on XLM-RoBERTa and IndicBERTv2
- Provides a reusable technical path for classical language digitization