Technical Solution and Model Architecture
To achieve this goal with limited computing resources, the project adopts a Parameter-Efficient Fine-Tuning (PEFT) strategy, specifically using LoRA and QLoRA techniques. The core advantage of this method is that it does not require updating all parameters of the pre-trained model; instead, it introduces a small number of trainable low-rank adapter matrices to achieve targeted enhancement of model capabilities.
The project selects Gemma 4 E2B as the main base model, while retaining Mistral 7B as an alternative. The reason for choosing the Gemma series is the balance between its relatively compact model size and excellent multilingual capabilities, which is particularly important for training on consumer-grade GPUs. Through 4-bit quantization technology, the project can complete training on hardware with only 8GB of VRAM, greatly lowering the research threshold.
In terms of the tech stack, the project integrates the Hugging Face Transformers library, PEFT framework, TRL training library, and optional Unsloth acceleration library. This combination not only ensures code maintainability and community support but also achieves a significant improvement in training speed through Unsloth's optimizations. Training data is formatted into instruction-style JSONL format, with each sample containing a system prompt, user query, and expected output—this structure helps the model better understand the contextual requirements of translation tasks.
Training Strategy and Experimental Design
The project's training uses a progressive multi-stage strategy instead of mixing all languages at once. The first stage focuses on Irish-English translation pairs to establish basic fine-tuning weights. The second stage introduces Scottish Gaelic to test the model's transfer ability between related languages. The third stage adds Welsh and Breton, and the fourth stage includes Manx and Cornish. Finally, all languages are mixed for joint fine-tuning.
The design consideration for this progressive strategy is that it allows researchers to evaluate the model's performance at each stage and observe how the introduction of new languages affects the translation quality of existing languages. If the model experiences severe language confusion or performance degradation after adding new languages, training parameters or data ratios can be adjusted in a timely manner.
Specific hyperparameters are carefully tuned: LoRA rank is set to 16, Alpha value to 32, Dropout rate to 0.05, and target modules include q_proj and v_proj. Training sequence length ranges from 512 to 1024, batch size from 1 to 2, with gradient accumulation of 8 to 16 steps to simulate a larger effective batch. The learning rate is set to 2e-4, and training epochs are controlled between 1 and 3 to prevent overfitting.