Reading

Consistency Evaluation of Multilingual Large Language Models: A New Framework for Cross-Lingual Symmetry Measurement

Researchers propose a systematic evaluation framework that uses multilingual embeddings and sliced Kolmogorov-Smirnov distance to measure the consistency performance of large language models across different languages, providing a quantitative tool for assessing the multilingual capabilities of models.

多语言模型跨语言一致性Kolmogorov-Smirnov 距离模型评估嵌入空间多语言嵌入AI 公平性语言对称性

Published 2026-05-05 01:15Recent activity 2026-05-05 01:20Estimated read 6 min

Consistency Evaluation of Multilingual Large Language Models: A New Framework for Cross-Lingual Symmetry Measurement

Section 01

[Main Post/Introduction] A New Framework for Cross-Lingual Consistency Evaluation of Multilingual Large Language Models

Researchers propose a systematic evaluation framework (the multilingual-llm-symmetry project) that uses multilingual embeddings and sliced Kolmogorov-Smirnov distance to measure the consistency performance of large language models across different languages. This fills the gap in existing multilingual evaluations, which lack quantitative methods for cross-lingual consistency, and provides a new quantitative tool for assessing the capabilities of multilingual models.

Section 02

Research Background: Consistency Challenges of Multilingual AI

With the global application of large language models, the issue of cross-lingual consistency has become prominent: Are answers in different languages to the same question logically equivalent? This relates to the fairness of user experience and the cross-cultural reliability of AI systems. Existing evaluations focus on accuracy and fluency, but lack systematic measurement methods for cross-lingual consistency (consistency of knowledge, reasoning, and values across languages). This gap is the core goal of the project.

Section 03

Core Methodology: Sliced K-S Distance and Symmetry Score

Core process: 1. Obtain multi-model responses to the same prompt in source/target languages; 2. Use a multilingual embedding model (e.g., Cohere embed-multilingual-v3.0) to map to a shared semantic space; 3. Use sliced K-S distance to quantify distribution differences: project high-dimensional embeddings onto random directions, calculate one-dimensional K-S statistics and average them to get a symmetry score with confidence intervals. The K-S statistic measures the maximum gap between the cumulative curves of two distributions; the smaller the value, the more similar the distributions.

Section 04

Experimental Design and Benchmark Testing

Test prompts are divided into two categories: factual prompts (scientific common sense, geography, astronomy, etc., with clear right/wrong standards) and open-ended prompts (daily advice, etc., with no unique answer, to observe consistency in creative tendencies and values). Compare the response distributions of language pairs such as English-French, and plan to expand to out-of-distribution languages like Inuktitut to test generalization ability.

Section 05

Technical Implementation and Usage

The project provides runnable code in the form of Jupyter Notebooks, with dependencies managed by Pipenv. Core components: the main Notebook (cohere-multilingual-symmetry.ipynb) contains the complete process; stats_helpers.py implements sliced K-S distance calculation; Pipfile defines dependencies. Users need to configure a Cohere API key and can modify parameters to test different language pairs, model versions, or custom prompt sets.

Section 06

Research Significance and Application Value

Theoretical value: Complement traditional accuracy metrics, helping to explore the impact of architecture, training data ratio, and fine-tuning strategies on cross-lingual consistency. Practical value: Identify language biases (asymmetry implies data/architecture issues); monitor the degradation of multilingual capabilities during model iteration to ensure version updates do not harm performance in other languages.

Section 07

Limitations and Future Directions

Limitations: Dependence on the Cohere API limits the range of models; sliced K-S distance requires manual judgment of the significance of differences (subtle semantic differences and misinformation may have similar numerical values). Future directions: Support more models (including open-source local models); expand to multimodal scenarios to evaluate cross-lingual consistency of vision-language models.

Consistency Evaluation of Multilingual Large Language Models: A New Framework for Cross-Lingual Symmetry Measurement

[Main Post/Introduction] A New Framework for Cross-Lingual Consistency Evaluation of Multilingual Large Language Models

Research Background: Consistency Challenges of Multilingual AI

Core Methodology: Sliced K-S Distance and Symmetry Score

Experimental Design and Benchmark Testing

Technical Implementation and Usage

Research Significance and Application Value

Limitations and Future Directions

Continue Reading

Splinter: A Lock-Free Zero-Copy Shared Memory KV and Vector Storage Library That Eliminates Socket and Memcpy Overhead for LLM Inference

Folkering OS: When the Operating System Itself Is AI—A Self-Evolving Bare-Metal Rust System

LLM-assisted-analysis: A New Approach to Detecting Logical Vulnerabilities in Smart Contracts Using Large Language Models

Building Modern LLM from Scratch: A Tutorial-level Implementation of Llama-style Language Model