Zing Forum

Reading

SAHM: A New Benchmark for Arabic Financial and Shari'ah Compliance Reasoning

The research team launched the SAHM benchmark, covering 14,380 expert-validated data entries. Evaluations show that Arabic fluency does not equate to evidence-based financial reasoning ability, providing a crucial tool for Arabic financial NLP research.

阿拉伯语NLP金融AI伊斯兰金融Shari'ah合规基准测试大语言模型评估AAOIFI
Published 2026-04-21 13:24Recent activity 2026-04-22 12:39Estimated read 5 min
SAHM: A New Benchmark for Arabic Financial and Shari'ah Compliance Reasoning
1

Section 01

SAHM Benchmark: A New Tool for Arabic Financial and Shari'ah Compliance Reasoning

The research team launched the SAHM benchmark, covering 14,380 expert-validated data entries. Evaluations show that Arabic fluency does not equate to evidence-based financial reasoning ability, providing a crucial tool for Arabic financial NLP research. This benchmark focuses on Islamic financial compliance reasoning, filling the gap in Arabic financial AI evaluation.

2

Section 02

Background: Gaps in Arabic Financial NLP and Unique Challenges of Islamic Finance

Current financial AI progress is concentrated in English scenarios; English financial NLP already has a well-established benchmark system, but Arabic financial NLP lacks high-quality evaluation benchmarks. The Arab world has a large financial market, and Islamic finance follows Shari'ah rules (such as prohibiting interest, investing in forbidden industries, requiring risk-sharing, etc.). AI needs cross-domain reasoning, which goes far beyond simple translation or retrieval.

3

Section 03

SAHM Benchmark Construction: Data and Task Design

SAHM is a document-anchored benchmark and instruction-tuning dataset. Data sources include AAOIFI regulatory documents, real fatwa legal rulings, professional exam materials, and corporate documents, totaling 14,380 expert-validated instances. Seven tasks are designed: AAOIFI standard Q&A, Fatwa Q&A and multiple-choice questions, accounting and business exams, financial sentiment analysis, extractive summarization, event-causal reasoning, etc., to comprehensively evaluate model capabilities.

4

Section 04

Evaluation Evidence: Fluency ≠ Reasoning Ability, Significant Differences in Task Performance

Evaluations of 19 top LLMs found: Arabic fluency cannot be converted into evidence-based financial reasoning ability; models perform well in recognition tasks such as sentiment analysis and multiple-choice questions, but their performance in generative tasks (e.g., open-ended answers) and event-causal reasoning tasks drops significantly, with causal reasoning being the biggest shortcoming.

5

Section 05

Conclusion: Financial AI Needs to Balance Language and Professional Competence, Emphasizing Interpretability

Evaluating financial AI cannot only focus on language fluency; it needs specialized domain benchmarks to test substantive capabilities. Progress in English financial NLP cannot be automatically transferred to Arabic; Islamic finance requires specialized data and training. Financial decisions need traceable evidence, and SAHM emphasizes document-anchored interpretability requirements.

6

Section 06

Open Source and Applications: Practical Directions to Promote Arabic Financial AI Development

The research team open-sourced the SAHM benchmark data, evaluation framework, and instruction-tuned models. Application scenarios include: Islamic financial compliance AI assistants, Arabic financial education intelligent tutoring systems, compliance review tools, market intelligence analysis (sentiment signal and event-causal extraction), etc.

7

Section 07

Limitations and Future: Paths to Improve the SAHM Benchmark

SAHM has limitations such as geographic coverage (mainly Gulf regions), timeliness (needing regular updates of regulations and Shari'ah rules), multimodal expansion (needing to support charts and tables), and adversarial testing (evaluating robustness). In the future, it is necessary to expand geographic representation, update data, add multimodal capability evaluation, and adversarial testing.