Zing Forum

Reading

Do Large Language Models Truly Understand Phonetic Symbolism? A Psycholinguistic Framework for LLM Evaluation

A research framework for systematically evaluating whether large language models (LLMs) exhibit psycholinguistically validated phonetic-semantic encoding patterns, distinguishing between genuine linguistic ability and training data contamination via a three-tiered stimulus hierarchy design.

LLMpsycholinguisticssound symbolismtraining data contaminationbouba-kikiphonesthesiasemantic prosodyideophonecross-lingualinterpretability
Published 2026-05-26 16:46Recent activity 2026-05-26 16:50Estimated read 8 min
Do Large Language Models Truly Understand Phonetic Symbolism? A Psycholinguistic Framework for LLM Evaluation
1

Section 01

Introduction: A Psycholinguistic Evaluation Framework for LLMs' Phonetic Symbolism Comprehension

This study proposes a systematic evaluation framework aimed at investigating whether large language models (LLMs) truly possess psycholinguistically validated phonetic-semantic encoding patterns, rather than relying solely on superficial imitation from training data. The core method uses a three-tiered stimulus hierarchy design to distinguish between the model's genuine linguistic ability and the impact of training data contamination, testing five classic psycholinguistic theories (e.g., the bouba-kiki effect, phoneme synesthesia, etc.), and combining interpretability analysis to deeply understand the internal mechanisms of the models.

2

Section 02

Research Background and Motivation

Large language models perform excellently in natural language processing tasks, but core questions remain: Do models possess human-like linguistic abilities, or do they merely imitate surface patterns from training data? The field of psycholinguistics has established various phonetic-semantic association phenomena (e.g., the cross-cultural bouba-kiki effect), which are regarded as evidence of human deep cognitive mechanisms. This study aims to verify why LLMs exhibit such patterns—genuine ability or data memorization?

3

Section 03

Core Evaluation Framework: Five Psycholinguistic Theories

The framework tests five classic theories:

  1. Phonetic Symbolism: e.g., the bouba-kiki effect (rounded syllables are associated with rounded shapes, sharp syllables with sharp shapes);
  2. Phoneme Synesthesia: stable semantic associations of specific consonant clusters (e.g., English words starting with "gl-" are often related to light/vision);
  3. Vowel-Size Symbolism: high front vowels (e.g., /i/) are associated with "small", low back vowels (e.g., /a/) with "large";
  4. Semantic Prosody: neutral phrases gain evaluative meaning through collocations (e.g., "set in" is often paired with negative contexts);
  5. Onomatopoeia Compositionality: the superposition of phonological features of onomatopoeia predicts meaning (e.g., voiced consonants are associated with "heavier").
4

Section 04

Three-Tiered Stimulus Hierarchy Design: Distinguishing Genuine Ability from Data Contamination

To distinguish between genuine ability and data contamination, a three-tiered stimulus design is adopted:

  • Tier1: Famous stimuli from classic papers (e.g., bouba/kiki), high contamination level;
  • Tier2: Validated stimuli from less-cited studies, medium contamination level;
  • Tier3: Newly constructed unpublished stimuli, extremely low contamination level. Hypothesis: If model performance decreases significantly from Tier1 to Tier3, it relies on data memorization; if Tier3 performance remains good, it possesses genuine ability.
5

Section 05

Experimental Design and Technical Implementation

The project provides a complete experimental pipeline:

  • Stimulus Construction: construct_stimuli.py generates stimulus sets covering five theories, three tiers, and five languages;
  • Behavioral Experiments: run_behavioral.py includes three tasks: forced choice, rating, and generation;
  • Contamination Detection: run_contamination.py uses four methods to detect data contamination;
  • Cross-Language Validation: run_multilingual.py tests phonetic symbolism effects in languages like Japanese and Korean;
  • Compositionality Experiments: run_compositionality.py uses a 2×2×2×2 factorial design to test onomatopoeia compositionality. Supported models include GPT-4o, Llama3.3 70B, etc. The API design supports multi-key polling and response caching.
6

Section 06

Interpretability Analysis: Exploring LLM Internal Mechanisms

The project includes GPU-supported interpretability experiments:

  • Linear Probe Classifier: layer-by-layer analysis of internal representations;
  • Logit Lens: examining inter-layer evolution of semantic consistency;
  • Attention Analysis: studying attention processing of phoneme clusters (e.g., "gl-";
  • Causal Tracing: ROME-style activation patching to locate key neurons;
  • Contamination Trajectory: searching for stimulus occurrence frequencies in the Pile corpus and correlating with model performance.
7

Section 07

Research Significance and Implications

The multiple significances of this study:

  1. Methodological Contribution: Providing an operational framework to distinguish between "genuine understanding" and "data memorization";
  2. Theoretical Dialogue: Introducing psycholinguistic paradigms into LLM evaluation, promoting cross-disciplinary interaction between cognitive science and AI;
  3. Practical Value: Helping identify and quantify training data contamination, providing references for model development;
  4. Interdisciplinary Inspiration: The method of inferring internal mechanisms through behavioral experiments can be extended to the evaluation of other abilities.
8

Section 08

Conclusion

This project represents an important shift in AI evaluation: from focusing on task performance to exploring the sources and mechanisms of performance. Drawing on the experimental tradition of psycholinguistics, it provides new tools and perspectives for understanding "whether machines truly understand language". Regardless of the results, it offers valuable insights for building more reliable and interpretable AI systems.