Reading

Celtic-LLM: Exploring Neural Network Representations of Low-Resource Languages via Parameter-Efficient Fine-Tuning

An open-source project investigating whether large language models can learn and distinguish closely related Celtic languages through LoRA fine-tuning, and whether these languages form structured clusters in the embedding space.

LoRAQLoRA凯尔特语言低资源语言机器翻译嵌入空间参数高效微调多语言模型神经网络表征学习

Published 2026-04-28 09:09Recent activity 2026-04-28 09:18Estimated read 17 min

Celtic-LLM: Exploring Neural Network Representations of Low-Resource Languages via Parameter-Efficient Fine-Tuning

Section 01

Celtic-LLM Project Guide: Exploring Neural Network Representations of Low-Resource Celtic Languages

Celtic-LLM Project Guide

Celtic-LLM is an open-source research project focused on using LoRA/QLoRA parameter-efficient fine-tuning techniques to explore whether large language models can learn and distinguish closely related Celtic languages, and verify if these languages form structured clusters in the embedding space. The core goal of the project is to fill the gap in LLM research for low-resource minority languages, while exploring the deep connection between neural network representation learning and human linguistic knowledge, providing a replicable technical solution for machine translation and preservation of low-resource languages.

Section 02

Project Background and Research Motivation

As large language models sweep the global natural language processing field, a rarely discussed question emerges: Can these models truly understand and distinguish those historically rich but resource-scarce minority languages? The Celtic-LLM project precisely targets this research gap, focusing on the Celtic language family—a language group with unique linguistic value.

As an important branch of the Indo-European language family, Celtic languages include six existing related languages: Irish, Scottish Gaelic, Manx, Welsh, Breton, and Cornish. These languages not only have complex kinship in grammatical structure and vocabulary systems but also face extreme scarcity of digital resources. Traditional machine translation systems often struggle to achieve ideal results on these low-resource language pairs, while large-scale pre-trained language models rarely fully cover these languages in their training data.

The core research hypothesis of Celtic-LLM has profound theoretical depth: Can neural networks reconstruct linguistic family structures in the embedding space? In other words, when the model learns these languages, can it spontaneously recognize the kinship between Irish and Scottish Gaelic, and the historical distance between them and Welsh? The answer to this question is not only related to the quality of machine translation for Celtic languages but also touches on the deep connection between neural network representation learning and human linguistic knowledge.

Section 03

Technical Solution and Training Strategy

Technical Solution and Model Architecture

To achieve this goal with limited computing resources, the project adopts a Parameter-Efficient Fine-Tuning (PEFT) strategy, specifically using LoRA and QLoRA techniques. The core advantage of this method is that it does not require updating all parameters of the pre-trained model; instead, it introduces a small number of trainable low-rank adapter matrices to achieve targeted enhancement of model capabilities.

The project selects Gemma 4 E2B as the main base model, while retaining Mistral 7B as an alternative. The reason for choosing the Gemma series is the balance between its relatively compact model size and excellent multilingual capabilities, which is particularly important for training on consumer-grade GPUs. Through 4-bit quantization technology, the project can complete training on hardware with only 8GB of VRAM, greatly lowering the research threshold.

In terms of the tech stack, the project integrates the Hugging Face Transformers library, PEFT framework, TRL training library, and optional Unsloth acceleration library. This combination not only ensures code maintainability and community support but also achieves a significant improvement in training speed through Unsloth's optimizations. Training data is formatted into instruction-style JSONL format, with each sample containing a system prompt, user query, and expected output—this structure helps the model better understand the contextual requirements of translation tasks.

Training Strategy and Experimental Design

The project's training uses a progressive multi-stage strategy instead of mixing all languages at once. The first stage focuses on Irish-English translation pairs to establish basic fine-tuning weights. The second stage introduces Scottish Gaelic to test the model's transfer ability between related languages. The third stage adds Welsh and Breton, and the fourth stage includes Manx and Cornish. Finally, all languages are mixed for joint fine-tuning.

The design consideration for this progressive strategy is that it allows researchers to evaluate the model's performance at each stage and observe how the introduction of new languages affects the translation quality of existing languages. If the model experiences severe language confusion or performance degradation after adding new languages, training parameters or data ratios can be adjusted in a timely manner.

Specific hyperparameters are carefully tuned: LoRA rank is set to 16, Alpha value to 32, Dropout rate to 0.05, and target modules include q_proj and v_proj. Training sequence length ranges from 512 to 1024, batch size from 1 to 2, with gradient accumulation of 8 to 16 steps to simulate a larger effective batch. The learning rate is set to 2e-4, and training epochs are controlled between 1 and 3 to prevent overfitting.

Section 04

Data Collection and Evaluation System

Data Collection and Preprocessing Process

High-quality training data is the cornerstone of any successful machine learning project, especially for low-resource languages. The Celtic-LLM project obtains parallel corpora from three main sources: the OPUS multilingual parallel corpus, the Tatoeba sentence database, and Wikimedia corpus dumps.

The OPUS corpus provides large-scale parallel sentence pairs between Celtic languages and English, serving as the main source of training data. Although Tatoeba is smaller in scale, its sentences are carefully proofread by community volunteers and of high quality, suitable for use as validation and test sets. Wikimedia dumps provide rich monolingual text, which can be used for language modeling and embedding analysis.

The data preprocessing process follows strict standardization steps. First, language identification and filtering are performed to ensure the accuracy of language labels for each sample. Then, text cleaning is done to remove HTML tags, special symbols, and format noise. Next, sentence alignment checks are performed to ensure that the source and target languages in the parallel corpus truly correspond semantically. Finally, all samples are converted into a unified instruction format, including clear language identifiers and translation instructions—this helps the model accurately identify the target language during inference.

Evaluation System and Core Metrics

The project's evaluation system design balances traditional machine translation metrics and innovative linguistic analysis dimensions. For translation quality, BLEU and chrF are used as automatic evaluation metrics. Notably, chrF is chosen because this metric is more sensitive to languages with rich morphological changes, and Celtic languages are known for their complex word-form changes.

In addition to automatic metrics, the project also designs language correctness checks to ensure that the model output is indeed the requested target language, rather than mixing other Celtic languages or English. The evaluation of instruction-following ability tests whether the model can accurately understand and execute translation instructions containing language identifiers.

The most innovative evaluation dimension is the zero-shot cross-Celtic translation ability. The model has never seen direct translation pairs from Irish to Breton during training, but researchers expect it to use English as a bridge language to achieve this cross-language transfer. The strength of this ability directly reflects whether the model truly understands the structural relationships between these languages, rather than just memorizing specific translation mappings.

Section 05

Embedding Space Analysis and Linguistic Findings

The most theoretically valuable part of the project lies in the in-depth analysis of the embedding space. Researchers use dimensionality reduction visualization techniques such as t-SNE and UMAP to project high-dimensional sentence embeddings onto a 2D plane and observe the distribution patterns of different languages. At the same time, the cosine similarity between translated sentence pairs is calculated to quantify the model's understanding of semantic equivalence.

The core research question is: Do Celtic languages form structured clusters in the embedding space? Ideally, researchers expect to see the Goidelic branch (Irish, Scottish Gaelic, Manx) and the Brythonic branch (Welsh, Breton, Cornish) form two independent clusters, while English as a hub language is located between them or forms its own region.

The emergence of this clustering pattern will strongly prove that neural networks can learn linguistic family structures from raw text data without explicit injection of linguistic knowledge. Conversely, if languages are randomly distributed in the embedding space or clustered by topic rather than language type, it indicates that there is still room for improvement in the model's representation learning.

Section 06

Practical Significance and Future Outlook

The practical significance of the Celtic-LLM project goes far beyond Celtic languages themselves. It provides a replicable technical blueprint for thousands of low-resource languages worldwide, proving that even under conditions of data scarcity and limited computing resources, a practical machine translation system can still be built through parameter-efficient fine-tuning and carefully designed training strategies.

The open-source nature of the project means that researchers from other language communities can directly draw on its data collection, preprocessing, and training processes—only the corresponding corpora need to be replaced to launch their own language model projects. This transferability is of great value for protecting language diversity and promoting digital equity.

Looking to the future, the project team plans to further explore directions such as multi-modal expansion (e.g., speech synthesis), dialect variant handling, and modern translation of historical texts after completing the verification of basic translation capabilities. These expansions will make Celtic-LLM not only a research prototype but also a practical tool serving the actual needs of the Celtic language community.

Celtic-LLM: Exploring Neural Network Representations of Low-Resource Languages via Parameter-Efficient Fine-Tuning

Celtic-LLM Project Guide: Exploring Neural Network Representations of Low-Resource Celtic Languages

Celtic-LLM Project Guide

Project Background and Research Motivation

Project Background and Research Motivation

Technical Solution and Training Strategy

Technical Solution and Model Architecture

Training Strategy and Experimental Design

Data Collection and Evaluation System

Data Collection and Preprocessing Process

Evaluation System and Core Metrics

Embedding Space Analysis and Linguistic Findings

Embedding Space Analysis and Linguistic Findings

Practical Significance and Future Outlook

Practical Significance and Future Outlook

Continue Reading

Splinter: A Lock-Free Zero-Copy Shared Memory KV and Vector Storage Library That Eliminates Socket and Memcpy Overhead for LLM Inference

Folkering OS: When the Operating System Itself Is AI—A Self-Evolving Bare-Metal Rust System

LLM-assisted-analysis: A New Approach to Detecting Logical Vulnerabilities in Smart Contracts Using Large Language Models

Building Modern LLM from Scratch: A Tutorial-level Implementation of Llama-style Language Model