Zing Forum

Reading

MixCode-CoT: Breaking Translation Barriers, Enabling Small Models to Reason with Hindi-English Mixed Thinking

By constructing a synthetic Hinglish Chain-of-Thought dataset to fine-tune Llama-3-8B, we achieved an 18% accuracy improvement and 4x inference speedup, validating the core hypothesis that "the model's thinking language should align with the input language."

代码混合Hinglish思维链LoRAQLoRA多语言模型Llama-3数学推理Unsloth语言对齐
Published 2026-03-31 13:14Recent activity 2026-03-31 13:22Estimated read 9 min
MixCode-CoT: Breaking Translation Barriers, Enabling Small Models to Reason with Hindi-English Mixed Thinking
1

Section 01

Introduction: MixCode-CoT Breaks Translation Barriers, Enabling Small Models to Reason with Hinglish Mixed Thinking

This study proposes a core hypothesis: the model's thinking language should align with the input language. By constructing a synthetic Hinglish Chain-of-Thought dataset (Hinglish-GSM8K) and using the Unsloth framework and QLoRA technology to fine-tune the Llama-3-8B model, we achieved an 18% improvement in EM accuracy and a 4x inference speedup, validating the hypothesis's effectiveness and providing a new direction for multilingual models to handle code-mixed languages.

2

Section 02

Research Background: Translation Barrier Issues in Multilingual Models

Current mainstream large models (e.g., Llama, GPT series) often implicitly assume translating input into English during internal reasoning, leading to two issues: 1. Extra translation steps increase inference latency; 2. The translation process easily causes semantic drift (especially for mathematical symbols and technical terms). For code-mixed languages like Hinglish, forced translation disrupts natural mixed expressions, making the problem more prominent.

3

Section 03

Research Methodology: Aligning Thinking Language with Input Language

Core Hypothesis

The model's thinking language should be consistent with the input language; if the user asks in Hinglish, the model reasons in Hinglish.

Dataset Construction

Designed the CoT format based on the Matrix Language Frame theory: the matrix language is Hindi (responsible for grammar, verbs, etc.), and the embedded language is English (responsible for mathematical entities, variables, etc.). Constructed the synthetic Hinglish-GSM8K dataset, filtering monolingual samples to retain bilingual mixed instances. Example sample structure:

{
  "instruction": "Solve the following math problem in Hinglish explicitly showing your steps.",
  "input": "If cost price is $100 and profit is 20%, what is selling price?",
  "output": "Cost Price (CP) $100 hai. Profit percentage 20% diya gaya hai. SP nikalne ke liye formula: SP = CP + Profit. Pehle profit: 20% of 100 = $20. Ab SP = 100 + 20 = 120. #### 120"
}

### Experimental Setup
Using the Unsloth framework and QLoRA technology, fine-tuned on a single T4 GPU:
| Hyperparameters | Settings |
|--------|--------|
| Base Model | unsloth/llama-3-8b-Instruct-bnb-4bit |
| Quantization | 4-bit NormalFloat (QLoRA) |
| LoRA Rank (r) | 16 |
| LoRA Alpha | 16 |
| Target Modules | q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj |
| Learning Rate | 2e-4 |
| Effective Batch Size | 8 |
| Max Steps | 120 |
| Trainable Parameters | 41,943,040 (0.52%) |
| Training Time | ~8 minutes |

This configuration is resource-efficient and easy to reproduce.
4

Section 04

Experimental Evidence: Performance Improvement and Changes in Error Patterns

Experimental Results

On 150 Hinglish math reasoning test questions, compared with the baseline:

Metrics Baseline Llama-3-8B MixCode-CoT Improvement
EM Accuracy 44.00% 62.00% +18.00%
Average Inference Latency 97.22s 23.86s 4.07x speedup
Average CMI Score 32.07 64.76 +32.69

Error Analysis

Error Type Baseline After Fine-tuning
Type A (Calculation Errors) 81 48
Type B (Semantic Errors) 3 2
Type C (Hallucination/Looping) 0 7
The significant reduction in calculation errors is the main reason for the accuracy improvement; a small number of hallucination errors appeared after fine-tuning.

CMI Distribution Changes

Range Baseline After Fine-tuning
Low CMI (<40) 143 8
Medium CMI (40-70) 7 88
High CMI (≥70) 0 54
After fine-tuning, the model is more inclined to retain mixed language characteristics.
5

Section 05

Research Conclusions: Value of Synthetic Data and Lightweight Fine-tuning

Technical Contributions

  1. Effectiveness of Synthetic Data: Well-designed mixing rules and CoT format can improve multilingual performance without large-scale manual annotation.
  2. Potential of Lightweight Fine-tuning: Training only 0.52% of the parameters (completed in ~8 minutes) achieved significant improvement, indicating that the underlying multilingual capabilities of base models need proper activation, and technologies like LoRA are practical.
  3. Universality of Language Alignment: The principle may apply to other code-mixed language scenarios such as Spanglish and Taglish.
6

Section 06

Limitations and Future Directions

Current Limitations

  1. Small dataset size with limited scenario coverage;
  2. Hallucination errors appear after fine-tuning;
  3. Only validated on Hinglish scenarios.

Future Directions

  1. Expand to more code-mixed languages;
  2. Build larger and more diverse synthetic datasets;
  3. Integrate technologies like RAG and tool usage;
  4. Deepen research on code-mixed reasoning mechanisms from the perspective of cognitive linguistics.
7

Section 07

Implications for AI Democratization

  1. Reducing Language Barriers: Allows non-English users to interact with AI using their natural language thinking;
  2. Resource Efficiency: Effective customization can be achieved with consumer-grade hardware;
  3. Cultural Inclusivity: Respects linguistic diversity (including code-mixing phenomena).

Technology should adapt to users' language habits rather than enforcing a single paradigm, and this study provides technical proof for this.