Zing Forum

Reading

modelscan/registry: Open Source Large Language Model Metadata Unified Registry

A machine-readable open-source metadata registry for large language models, which uniformly collects model identity, author, modality, context constraints, capabilities, and lifecycle information. It supports coexistence of multi-source commercial pricing data and uses the CC BY 4.0 license.

大语言模型LLM元数据注册表模型目录OpenAPI模型选型定价策略开源项目GitHubCC BY 4.0
Published 2026-06-05 04:43Recent activity 2026-06-05 04:48Estimated read 7 min
modelscan/registry: Open Source Large Language Model Metadata Unified Registry
1

Section 01

Introduction / Main Floor: modelscan/registry: Open Source Large Language Model Metadata Unified Registry

A machine-readable open-source metadata registry for large language models, which uniformly collects model identity, author, modality, context constraints, capabilities, and lifecycle information. It supports coexistence of multi-source commercial pricing data and uses the CC BY 4.0 license.

2

Section 02

Original Author and Source

  • Original Author/Maintainer: modelscan team
  • Source Platform: GitHub
  • Original Title: registry - Open registry of large-language-model metadata
  • Original Link: https://github.com/modelscan/registry
  • Publication Date: June 4, 2026
  • License Agreement: Creative Commons Attribution 4.0 International (CC BY 4.0)

3

Section 03

Project Background and Problems

With the explosive growth of the large language model (LLM) ecosystem, developers and enterprises face an increasingly severe challenge: how to obtain complete, accurate, and up-to-date metadata information for all models in one place.

Different platforms (OpenAI, Anthropic, Alibaba Cloud Tongyi Qianwen, Volcano Engine Ark, OpenRouter, etc.) maintain their own model lists with inconsistent formats, varying fields, and different update frequencies. This forces developers to switch between multiple API documents and even write their own crawlers to integrate data.

To make matters more complex, the same model may have different naming conventions (e.g., gpt-4-turbo vs openai/gpt-4-turbo), different version snapshots, and different pricing strategies across platforms. This fragmentation not only increases development costs but also easily leads to configuration errors and cost estimation deviations.

The modelscan/registry project was created to address this pain point; it aims to build an open, unified, machine-readable metadata registry for large language models.


4

Section 04

Single Trusted Data Source

The core of the project is a single JSON file named models.json, hosted on GitHub and distributed via CDN. This file contains complete metadata for all included models, covering everything from basic identity information to complex commercial pricing strategies. As of June 2026, the registry has included over 1197 models, covering mainstream commercial and open-source models.

5

Section 05

Stable Identity Recognition

Each model in the registry has a standardized id that is stable across platforms. Even if the same model has different names across sources (e.g., version snapshots with date suffixes), the registry merges them under the same base ID and retains the original naming format in the alias_id field. This design ensures that the same model is not split into multiple records, making it easy for developers to manage and query uniformly.

6

Section 06

Dual Currency Pricing Support

Considering the actual situation of the global model market, the registry supports retaining pricing information in multiple currencies simultaneously. USD quotes from OpenRouter and LiteLLM can coexist with CNY quotes from Alibaba Cloud Tongyi Qianwen and Volcano Engine Ark in the same model record. Developers can choose the appropriate pricing source based on their needs without precision loss from exchange rate conversion.

7

Section 07

Separation of Facts and Quotes

The registry divides data into two layers: top-level fields store source-agnostic factual information (such as context length, maximum input/output token count, supported modalities, etc.), which is derived by merging data from multiple sources; commercial data (prices, endpoint paths, rate limits, etc.) is stored in the offers array, and each quote clearly indicates its source to ensure data traceability.

8

Section 08

Tiered and Conditional Pricing

In reality, model pricing is often not a simple single price but includes multiple tiers (e.g., tiered pricing based on input token volume) or conditions (e.g., variations in video resolution, audio duration, etc.). The prices field in the registry is an array where each element represents a pricing tier, which can include condition thresholds and variant labels to accurately reflect real-world commercial models.