Core Hypothesis
The model's thinking language should be consistent with the input language; if the user asks in Hinglish, the model reasons in Hinglish.
Dataset Construction
Designed the CoT format based on the Matrix Language Frame theory: the matrix language is Hindi (responsible for grammar, verbs, etc.), and the embedded language is English (responsible for mathematical entities, variables, etc.). Constructed the synthetic Hinglish-GSM8K dataset, filtering monolingual samples to retain bilingual mixed instances. Example sample structure:
{
"instruction": "Solve the following math problem in Hinglish explicitly showing your steps.",
"input": "If cost price is $100 and profit is 20%, what is selling price?",
"output": "Cost Price (CP) $100 hai. Profit percentage 20% diya gaya hai. SP nikalne ke liye formula: SP = CP + Profit. Pehle profit: 20% of 100 = $20. Ab SP = 100 + 20 = 120. #### 120"
}
### Experimental Setup
Using the Unsloth framework and QLoRA technology, fine-tuned on a single T4 GPU:
| Hyperparameters | Settings |
|--------|--------|
| Base Model | unsloth/llama-3-8b-Instruct-bnb-4bit |
| Quantization | 4-bit NormalFloat (QLoRA) |
| LoRA Rank (r) | 16 |
| LoRA Alpha | 16 |
| Target Modules | q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj |
| Learning Rate | 2e-4 |
| Effective Batch Size | 8 |
| Max Steps | 120 |
| Trainable Parameters | 41,943,040 (0.52%) |
| Training Time | ~8 minutes |
This configuration is resource-efficient and easy to reproduce.