Zing Forum

Reading

TenkiBench: An Open-Source Large Language Model Evaluation Benchmark for Norwegian Small and Medium Enterprises

This article introduces the TenkiBench project, an open-source large language model evaluation benchmark specifically designed for the actual business scenarios of Norwegian small and medium enterprises (SMBs), covering real tasks such as invoice parsing, contract analysis, and tax calculation.

大语言模型评测基准挪威语中小企业发票解析合同分析税务计算开源项目
Published 2026-05-06 09:43Recent activity 2026-05-06 10:24Estimated read 7 min
TenkiBench: An Open-Source Large Language Model Evaluation Benchmark for Norwegian Small and Medium Enterprises
1

Section 01

TenkiBench: Guide to the Open-Source LLM Evaluation Benchmark for Norwegian Small and Medium Enterprises

TenkiBench is an open-source large language model evaluation benchmark developed and maintained by Tenki Labs, specifically designed for the actual business scenarios of Norwegian small and medium enterprises (SMBs). It fills the gap in regional and industry-specific scenario coverage of general evaluation benchmarks, covering eight core tasks such as invoice parsing, contract analysis, and tax calculation. It provides an objective reference for enterprises to select AI tools and points out directions for AI developers to improve localization.

2

Section 02

Project Background: Needs of Norwegian SMBs and Limitations of General Benchmarks

Mainstream LLM evaluation benchmarks (e.g., GLUE, MMLU) mostly focus on general knowledge or English scenarios, making it difficult to meet the needs of specific regional industries. Norway's business environment is unique: complex tax regulations, two written Norwegian languages (Bokmål/Nynorsk), and localized document processing needs such as the Brønnøysund Register Centre. Norwegian SMBs care about whether AI can accurately handle local invoices, tax calculations, legal clauses, etc.—these are exactly the backgrounds behind the birth of TenkiBench.

3

Section 03

Core Tasks: Eight Categories Covering Daily Scenarios of SMBs

TenkiBench contains eight evaluation tasks:

  1. Invoice Parsing: Extract structured information such as total amount, MVA (Value-Added Tax), KID number, etc.
  2. Contract Analysis: Identify risk clauses in NDAs, employment contracts, etc.
  3. Tax Calculation: Verify the ability to understand tax laws, such as VAT rate application and deduction judgment.
  4. Legal Citation Recognition: Accurately identify citations of Norwegian legal provisions.
  5. Business Registration Query: Parse business information from the Brønnøysund Register Centre.
  6. Human Resources and Compensation: Answer questions related to labor laws (e.g., sick pay, annual leave).
  7. Customer Service Tone Optimization: Adjust responses to comply with Norwegian business communication norms.
  8. Bilingual Translation: Accurate translation between Bokmål and Nynorsk.
4

Section 04

Technical Architecture and Evaluation Methodology: Fair, Transparent, and Reproducible

Technical Architecture: Frontend uses Next.js + Tailwind CSS; backend uses PostgreSQL for data storage; connects to multiple models via the OpenAI SDK and Mammouth.ai platform; uses Caddy as the edge server. Evaluation Principles: Fair (unified test set, open-source code), transparent (tasks/code/results are public), reproducible (provides local running guidelines). Evaluation Methods: Adopt multiple methods for different tasks, such as numerical matching + regex, LLM-as-judge, JSON schema validation, and expert evaluation.

5

Section 05

Usage Guide: How to Use TenkiBench

  1. Check Public Leaderboard: Visit bench.tenki.no to get the performance of various models.
  2. Run Evaluation Locally: After installing dependencies, run evaluations for specific models/categories via commands (e.g., pnpm bench:run --model=gpt-5).
  3. Contribute New Tasks: Submit proposals for new tasks related to Norwegian business scenarios by referring to the project guidelines.
6

Section 06

Limitations and Future Outlook

Limitations: Tasks are limited to Norwegian language scenarios; some hold-out data is not public (to prevent overfitting). Future Directions: Expand to multilingual (Nordic/European languages), industry segmentation (medical/legal), dynamic evaluation (update tasks over time), and multimodality (process images/scanned documents).

7

Section 07

Conclusion: The Evolutionary Significance of Regional Evaluation Benchmarks

TenkiBench represents the trend of LLM evaluation shifting from general capabilities to scenario-based and regional practical capabilities. It provides an objective reference for Norwegian SMBs to select models and helps the AI community identify the boundaries of localization capabilities. In the future, more similar regional and industry-specific benchmarks will promote AI to truly serve actual business needs. Project link: https://github.com/tenki-labs/tenkibench | Online evaluation: https://bench.tenki.no