Mainstream evaluation frameworks like lm-evaluation-harness and HELM have at least three critical issues when applied to Indian languages:
First, the phenomenon of code-switching is ignored. Real Indian online texts are not purely Hindi or English; instead, there is a large amount of Hinglish (Hindi-English mixed language). For example, sentences like "Yaar ye movie bilkul bakwaas thi" require the model to understand the meaning of Romanized Hindi embedded in informal English grammar. No standard benchmark test covers this linguistic phenomenon.
Second, the lack of cultural grounding. A model may correctly translate the word "Onam" but have no idea that it refers to a traditional festival in Kerala, India. Cultural reasoning is an independent, testable ability, not just a matter of language translation.
Third, the inapplicability of evaluation metrics. The BLEU metric was designed for European languages; for morphologically rich languages like Hindi, the chrF metric is clearly more appropriate. However, most evaluation frameworks do not make this distinction.