Zing Forum

Reading

Consistency Evaluation of Multilingual Large Language Models: A New Framework for Cross-Lingual Symmetry Measurement

Researchers propose a systematic evaluation framework that uses multilingual embeddings and sliced Kolmogorov-Smirnov distance to measure the consistency performance of large language models across different languages, providing a quantitative tool for assessing the multilingual capabilities of models.

多语言模型跨语言一致性Kolmogorov-Smirnov 距离模型评估嵌入空间多语言嵌入AI 公平性语言对称性
Published 2026-05-05 01:15Recent activity 2026-05-05 01:20Estimated read 6 min
Consistency Evaluation of Multilingual Large Language Models: A New Framework for Cross-Lingual Symmetry Measurement
1

Section 01

[Main Post/Introduction] A New Framework for Cross-Lingual Consistency Evaluation of Multilingual Large Language Models

Researchers propose a systematic evaluation framework (the multilingual-llm-symmetry project) that uses multilingual embeddings and sliced Kolmogorov-Smirnov distance to measure the consistency performance of large language models across different languages. This fills the gap in existing multilingual evaluations, which lack quantitative methods for cross-lingual consistency, and provides a new quantitative tool for assessing the capabilities of multilingual models.

2

Section 02

Research Background: Consistency Challenges of Multilingual AI

With the global application of large language models, the issue of cross-lingual consistency has become prominent: Are answers in different languages to the same question logically equivalent? This relates to the fairness of user experience and the cross-cultural reliability of AI systems. Existing evaluations focus on accuracy and fluency, but lack systematic measurement methods for cross-lingual consistency (consistency of knowledge, reasoning, and values across languages). This gap is the core goal of the project.

3

Section 03

Core Methodology: Sliced K-S Distance and Symmetry Score

Core process: 1. Obtain multi-model responses to the same prompt in source/target languages; 2. Use a multilingual embedding model (e.g., Cohere embed-multilingual-v3.0) to map to a shared semantic space; 3. Use sliced K-S distance to quantify distribution differences: project high-dimensional embeddings onto random directions, calculate one-dimensional K-S statistics and average them to get a symmetry score with confidence intervals. The K-S statistic measures the maximum gap between the cumulative curves of two distributions; the smaller the value, the more similar the distributions.

4

Section 04

Experimental Design and Benchmark Testing

Test prompts are divided into two categories: factual prompts (scientific common sense, geography, astronomy, etc., with clear right/wrong standards) and open-ended prompts (daily advice, etc., with no unique answer, to observe consistency in creative tendencies and values). Compare the response distributions of language pairs such as English-French, and plan to expand to out-of-distribution languages like Inuktitut to test generalization ability.

5

Section 05

Technical Implementation and Usage

The project provides runnable code in the form of Jupyter Notebooks, with dependencies managed by Pipenv. Core components: the main Notebook (cohere-multilingual-symmetry.ipynb) contains the complete process; stats_helpers.py implements sliced K-S distance calculation; Pipfile defines dependencies. Users need to configure a Cohere API key and can modify parameters to test different language pairs, model versions, or custom prompt sets.

6

Section 06

Research Significance and Application Value

Theoretical value: Complement traditional accuracy metrics, helping to explore the impact of architecture, training data ratio, and fine-tuning strategies on cross-lingual consistency. Practical value: Identify language biases (asymmetry implies data/architecture issues); monitor the degradation of multilingual capabilities during model iteration to ensure version updates do not harm performance in other languages.

7

Section 07

Limitations and Future Directions

Limitations: Dependence on the Cohere API limits the range of models; sliced K-S distance requires manual judgment of the significance of differences (subtle semantic differences and misinformation may have similar numerical values). Future directions: Support more models (including open-source local models); expand to multimodal scenarios to evaluate cross-lingual consistency of vision-language models.