Section 01
Naamah Dataset: Construction and Value of a 100k-level Sanskrit NER Corpus
The research team launched the Naamah dataset, generating 103,000 high-quality Sanskrit NER sentences via DBpedia entity extraction and a 24B-parameter hybrid reasoning model, while comparing the performance of XLM-RoBERTa and IndicBERTv2. This dataset is currently the largest synthetic Sanskrit NER dataset, providing an innovative path for the digitization of low-resource classical languages.