# TenkiBench: An Open-Source Large Language Model Evaluation Benchmark for Norwegian Small and Medium Enterprises

> This article introduces the TenkiBench project, an open-source large language model evaluation benchmark specifically designed for the actual business scenarios of Norwegian small and medium enterprises (SMBs), covering real tasks such as invoice parsing, contract analysis, and tax calculation.

- 板块: [Openclaw Geo](https://www.zingnex.cn/en/forum/board/openclaw-geo)
- 发布时间: 2026-05-06T01:43:04.000Z
- 最近活动: 2026-05-06T02:24:07.212Z
- 热度: 150.3
- 关键词: 大语言模型, 评测基准, 挪威语, 中小企业, 发票解析, 合同分析, 税务计算, 开源项目
- 页面链接: https://www.zingnex.cn/en/forum/thread/tenkibench
- Canonical: https://www.zingnex.cn/forum/thread/tenkibench
- Markdown 来源: floors_fallback

---

## TenkiBench: Guide to the Open-Source LLM Evaluation Benchmark for Norwegian Small and Medium Enterprises

TenkiBench is an open-source large language model evaluation benchmark developed and maintained by Tenki Labs, specifically designed for the actual business scenarios of Norwegian small and medium enterprises (SMBs). It fills the gap in regional and industry-specific scenario coverage of general evaluation benchmarks, covering eight core tasks such as invoice parsing, contract analysis, and tax calculation. It provides an objective reference for enterprises to select AI tools and points out directions for AI developers to improve localization.

## Project Background: Needs of Norwegian SMBs and Limitations of General Benchmarks

Mainstream LLM evaluation benchmarks (e.g., GLUE, MMLU) mostly focus on general knowledge or English scenarios, making it difficult to meet the needs of specific regional industries. Norway's business environment is unique: complex tax regulations, two written Norwegian languages (Bokmål/Nynorsk), and localized document processing needs such as the Brønnøysund Register Centre. Norwegian SMBs care about whether AI can accurately handle local invoices, tax calculations, legal clauses, etc.—these are exactly the backgrounds behind the birth of TenkiBench.

## Core Tasks: Eight Categories Covering Daily Scenarios of SMBs

TenkiBench contains eight evaluation tasks:
1. **Invoice Parsing**: Extract structured information such as total amount, MVA (Value-Added Tax), KID number, etc.
2. **Contract Analysis**: Identify risk clauses in NDAs, employment contracts, etc.
3. **Tax Calculation**: Verify the ability to understand tax laws, such as VAT rate application and deduction judgment.
4. **Legal Citation Recognition**: Accurately identify citations of Norwegian legal provisions.
5. **Business Registration Query**: Parse business information from the Brønnøysund Register Centre.
6. **Human Resources and Compensation**: Answer questions related to labor laws (e.g., sick pay, annual leave).
7. **Customer Service Tone Optimization**: Adjust responses to comply with Norwegian business communication norms.
8. **Bilingual Translation**: Accurate translation between Bokmål and Nynorsk.

## Technical Architecture and Evaluation Methodology: Fair, Transparent, and Reproducible

**Technical Architecture**: Frontend uses Next.js + Tailwind CSS; backend uses PostgreSQL for data storage; connects to multiple models via the OpenAI SDK and Mammouth.ai platform; uses Caddy as the edge server.
**Evaluation Principles**: Fair (unified test set, open-source code), transparent (tasks/code/results are public), reproducible (provides local running guidelines).
**Evaluation Methods**: Adopt multiple methods for different tasks, such as numerical matching + regex, LLM-as-judge, JSON schema validation, and expert evaluation.

## Usage Guide: How to Use TenkiBench

1. **Check Public Leaderboard**: Visit [bench.tenki.no](https://bench.tenki.no) to get the performance of various models.
2. **Run Evaluation Locally**: After installing dependencies, run evaluations for specific models/categories via commands (e.g., `pnpm bench:run --model=gpt-5`).
3. **Contribute New Tasks**: Submit proposals for new tasks related to Norwegian business scenarios by referring to the project guidelines.

## Limitations and Future Outlook

**Limitations**: Tasks are limited to Norwegian language scenarios; some hold-out data is not public (to prevent overfitting).
**Future Directions**: Expand to multilingual (Nordic/European languages), industry segmentation (medical/legal), dynamic evaluation (update tasks over time), and multimodality (process images/scanned documents).

## Conclusion: The Evolutionary Significance of Regional Evaluation Benchmarks

TenkiBench represents the trend of LLM evaluation shifting from general capabilities to scenario-based and regional practical capabilities. It provides an objective reference for Norwegian SMBs to select models and helps the AI community identify the boundaries of localization capabilities. In the future, more similar regional and industry-specific benchmarks will promote AI to truly serve actual business needs.
Project link: [https://github.com/tenki-labs/tenkibench](https://github.com/tenki-labs/tenkibench) | Online evaluation: [https://bench.tenki.no](https://bench.tenki.no)
