Reading

TenkiBench: An Open-Source Large Language Model Evaluation Benchmark for Norwegian Small and Medium Enterprises

This article introduces the TenkiBench project, an open-source large language model evaluation benchmark specifically designed for the actual business scenarios of Norwegian small and medium enterprises (SMBs), covering real tasks such as invoice parsing, contract analysis, and tax calculation.

大语言模型评测基准挪威语中小企业发票解析合同分析税务计算开源项目

Published 2026-05-06 09:43Recent activity 2026-05-06 10:24Estimated read 7 min

TenkiBench: An Open-Source Large Language Model Evaluation Benchmark for Norwegian Small and Medium Enterprises

Section 01

TenkiBench: Guide to the Open-Source LLM Evaluation Benchmark for Norwegian Small and Medium Enterprises

TenkiBench is an open-source large language model evaluation benchmark developed and maintained by Tenki Labs, specifically designed for the actual business scenarios of Norwegian small and medium enterprises (SMBs). It fills the gap in regional and industry-specific scenario coverage of general evaluation benchmarks, covering eight core tasks such as invoice parsing, contract analysis, and tax calculation. It provides an objective reference for enterprises to select AI tools and points out directions for AI developers to improve localization.

Section 02

Project Background: Needs of Norwegian SMBs and Limitations of General Benchmarks

Mainstream LLM evaluation benchmarks (e.g., GLUE, MMLU) mostly focus on general knowledge or English scenarios, making it difficult to meet the needs of specific regional industries. Norway's business environment is unique: complex tax regulations, two written Norwegian languages (Bokmål/Nynorsk), and localized document processing needs such as the Brønnøysund Register Centre. Norwegian SMBs care about whether AI can accurately handle local invoices, tax calculations, legal clauses, etc.—these are exactly the backgrounds behind the birth of TenkiBench.

Section 03

Core Tasks: Eight Categories Covering Daily Scenarios of SMBs

TenkiBench contains eight evaluation tasks:

Invoice Parsing: Extract structured information such as total amount, MVA (Value-Added Tax), KID number, etc.
Contract Analysis: Identify risk clauses in NDAs, employment contracts, etc.
Tax Calculation: Verify the ability to understand tax laws, such as VAT rate application and deduction judgment.
Legal Citation Recognition: Accurately identify citations of Norwegian legal provisions.
Business Registration Query: Parse business information from the Brønnøysund Register Centre.
Human Resources and Compensation: Answer questions related to labor laws (e.g., sick pay, annual leave).
Customer Service Tone Optimization: Adjust responses to comply with Norwegian business communication norms.
Bilingual Translation: Accurate translation between Bokmål and Nynorsk.

Section 04

Technical Architecture and Evaluation Methodology: Fair, Transparent, and Reproducible

Technical Architecture: Frontend uses Next.js + Tailwind CSS; backend uses PostgreSQL for data storage; connects to multiple models via the OpenAI SDK and Mammouth.ai platform; uses Caddy as the edge server. Evaluation Principles: Fair (unified test set, open-source code), transparent (tasks/code/results are public), reproducible (provides local running guidelines). Evaluation Methods: Adopt multiple methods for different tasks, such as numerical matching + regex, LLM-as-judge, JSON schema validation, and expert evaluation.

Section 05

Usage Guide: How to Use TenkiBench

Check Public Leaderboard: Visit bench.tenki.no to get the performance of various models.
Run Evaluation Locally: After installing dependencies, run evaluations for specific models/categories via commands (e.g., pnpm bench:run --model=gpt-5).
Contribute New Tasks: Submit proposals for new tasks related to Norwegian business scenarios by referring to the project guidelines.

Section 06

Limitations and Future Outlook

Limitations: Tasks are limited to Norwegian language scenarios; some hold-out data is not public (to prevent overfitting). Future Directions: Expand to multilingual (Nordic/European languages), industry segmentation (medical/legal), dynamic evaluation (update tasks over time), and multimodality (process images/scanned documents).

Section 07

Conclusion: The Evolutionary Significance of Regional Evaluation Benchmarks

TenkiBench represents the trend of LLM evaluation shifting from general capabilities to scenario-based and regional practical capabilities. It provides an objective reference for Norwegian SMBs to select models and helps the AI community identify the boundaries of localization capabilities. In the future, more similar regional and industry-specific benchmarks will promote AI to truly serve actual business needs. Project link: https://github.com/tenki-labs/tenkibench | Online evaluation: https://bench.tenki.no

Continue Reading

Keep going with more reads from the same topic.

SignalCut: An Intelligent Tool for Turning AI Search Visibility Gaps into Video Marketing Campaigns

SignalCut is an innovative web application that analyzes brands' visibility gaps in AI search, automatically generates evidence-based marketing strategies, and creates Hera video materials, helping early-stage brands gain a competitive edge in the AI answer engine era.

Recent activity 2026-04-26 11:27

AWS Open-Sources AI Search Citation Analysis System: Track Brand Exposure in AI Search Engines

An open-source project officially released by AWS, built on Amazon Bedrock, Step Functions, and React to form a complete serverless citation analysis system. It helps enterprises monitor their brand's citation status and competitive landscape in AI searches like ChatGPT, Perplexity, Gemini, and Claude.

Recent activity 2026-03-31 20:49

Next.js Application SEO and GEO Integrated Optimization Solution: Comprehensive Visibility from Search Engines to AI Assistants

This article delves into the stevewerme/seo-geo-nextjs project, an open-source tool designed specifically for Next.js applications to simultaneously optimize traditional search engine rankings (SEO) and generative engine visibility (GEO). It analyzes the project's core architecture, implementation mechanisms, practical application scenarios, and its strategic significance for developers and content creators.

Recent activity 2026-04-03 14:48

Baiyuan GEO Platform Technical White Paper: SaaS Engineering Practice for Generative Engine Optimization (GEO)

This article deeply analyzes the GEO Platform technical white paper developed by Baiyuan Technology, covering the seven-dimensional AI citation rate scoring algorithm, AXP shadow document delivery mechanism, Schema.org three-layer entity knowledge graph, and the hallucination automatic detection and repair closed-loop system, providing an engineering solution for brands to gain visibility in generative AI such as ChatGPT and Claude.

Recent activity 2026-04-18 22:54