# New Practice in Endangered Language Preservation: QLoRA Fine-Tuning Scheme for Chakma Machine Translation

> This article introduces The Chakma Project, a master's program in data science at UCL. By building the first Chakma-English word-level translation dataset and using QLoRA technology to fine-tune LLaMA and Gemma models, it provides a technical example for the digital preservation of endangered languages.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-04-17T12:44:55.000Z
- 最近活动: 2026-04-17T12:51:15.181Z
- 热度: 150.9
- 关键词: 濒危语言, Chakma语, 机器翻译, QLoRA, LLaMA, Gemma, 低资源语言, NLP
- 页面链接: https://www.zingnex.cn/en/forum/thread/chakmaqlora
- Canonical: https://www.zingnex.cn/forum/thread/chakmaqlora
- Markdown 来源: floors_fallback

---

## Introduction: Chakma Machine Translation Project Provides Technical Example for Endangered Language Preservation

Over 40% of the world's languages are at risk of extinction. As an endangered language in the Eastern Bengal region, Chakma lacks digital resources, making it unsupported by mainstream machine translation systems. The Chakma Project, a master's program in data science at University College London (UCL), built the first Chakma-English word-level translation dataset and used QLoRA technology to fine-tune LLaMA 3.1 8B and Gemma3 4B models, achieving Chakma machine translation capability for the first time and providing a reference technical path for the digital preservation of endangered languages.

## Background: Digital Dilemma of Chakma Language

Chakma belongs to the Sino-Tibetan language family, with approximately 300,000 speakers, and is listed as an endangered language by UNESCO. Its digitization faces three major challenges: 1. Scarcity of training data (almost no Chakma text on the internet); 2. Lack of standardization (multiple Romanization spelling schemes); 3. Difficulty in verification (very few experts who understand Chakma). Mainstream large language models (such as GPT-4, Claude) have limited support for it, with extremely poor translation quality.

## Methodology: Construction of Chakma-English Translation Dataset

The project manually extracted 20,206 pairs of Chakma-English vocabulary from the 1993 published paper dictionary *Chakma Dictionary*, which were verified by three native Chakma speakers (Phonebuson Chakma, Pankaj Chakma, Soumik Chakma) to form the `ChakmaBridge Verified Version` dataset. In addition, an approximately 800-pair sentence-level parallel dataset MELD was built (not used for training, only for evaluation). The data extraction process needed to solve problems such as inconsistent Romanization and diverse formats.

## Methodology: QLoRA Fine-Tuning Technology and Model Selection

The project chose QLoRA technology to fine-tune LLaMA3.1 8B and Gemma3 4B models for reasons including: open-source and commercially usable, moderate parameter scale (runnable on consumer-grade hardware), and good multilingual baseline. Advantages of QLoRA: high memory efficiency (4-bit quantization, single A100 can fine-tune an 8B model), parameter efficiency (only training LoRA adapters), and inference-friendly. Training configuration: LoRA rank=16, alpha=32, dropout=0.05, quantization precision 4-bit, learning rate 1e-4, batch size 1, maximum token length 256, training epochs 2.

## Results: Model Performance Evaluation

chrF (character n-gram F-score) was used to evaluate word-level translation quality. The results show that LLaMA-QLoRA (14.83) outperforms Gemma-QLoRA (11.10), possibly due to LLaMA's larger parameters and stronger multilingual pre-training foundation. Sentence-level evaluation (using the MELD dataset) shows that the model has certain generalization ability, though it was not involved in training. Although the absolute score is not high, considering factors such as data scarcity, language complexity, and independent test sets, the results are already meaningful.

## Contributions: Open-Source Resources and Methodological Value

The project's contributions include: 1. First realization of a Chakma-English translation system; 2. Open-source resources (training dataset Final_Chakma.csv, validation dataset ChakmaBridge Verified Version.csv, fine-tuning scripts, inference script test_adapter.py, LoRA adapter weights). Methodological insights: prioritize data quality, effective transfer learning, and key participation of native speakers.

## Limitations and Future Directions

Current limitations: limited data scale, only supports word-level translation, one-way translation (Chakma to English), and reliance on Romanized text. Future directions: expand the dataset, sentence-level training, bidirectional translation, support for native Chakma script, and build a complete language technology stack by combining voice technology.
