Reading

poster2json: Extracting Structured Metadata from Academic Posters Using Large Language Models

poster2json is an open-source tool specifically designed to extract structured metadata from academic conference posters (in PDF or image format) and convert it into machine-readable JSON format. The project combines vision-language models with a specially trained JSON structuring model to achieve high-precision digitization of academic content.

学术海报OCR元数据提取Llama-3.1Qwen2-VLDataCiteJSON模式科研工具

Published 2026-04-07 04:43Recent activity 2026-04-07 04:51Estimated read 11 min

poster2json: Extracting Structured Metadata from Academic Posters Using Large Language Models

Section 01

Introduction: poster2json—An Open-Source Tool for Extracting Metadata from Academic Posters

poster2json is an open-source tool focused on extracting structured metadata from academic conference posters (in PDF or image format) and converting it into machine-readable JSON format. It combines vision-language models with a specially trained JSON structuring model to address the pain points in digitizing academic posters, achieving high-precision digitization of academic content and enhancing the discoverability, citability, and analyzability of academic content.

Section 02

Pain Points in Academic Poster Digitization and Limitations of Traditional Solutions

Academic conference posters are important carriers for disseminating scientific research results, but they have long faced a structural problem: posters exist in visual PDF or image formats, and key information such as titles, authors, institutions, abstracts, methods, and results cannot be directly read and processed by machines. This "visually rich but semantically closed" characteristic severely limits the discoverability, citability, and analyzability of academic content.

Traditional solutions rely on manual entry or simple OCR technology; the former is costly and difficult to scale, while the latter has limited accuracy in the face of complex academic layouts. With the rapid development of large language models and multimodal vision models, automated and high-precision poster content extraction has become possible. The poster2json project is a typical representative of this technological trend.

Section 03

Technical Route and Core Capabilities of poster2json

The core goal of poster2json is to convert scientific posters into structured JSON data that complies with the poster-json-schema standard, which is based on the widely adopted DataCite 4.7 metadata specification. The project uses a multi-model collaboration technical architecture, selecting the most suitable model for different types of input and extraction tasks.

For JSON structuring tasks, the project uses a specially fine-tuned Llama-3.1-8B-Poster-Extraction model. This model has been specifically trained on academic poster corpora, enabling it to understand the organizational structure of academic content and organize extracted text information into compliant JSON objects.

For image-format posters, the project uses the Qwen2-VL-7B vision-language model for OCR recognition. This model has strong visual understanding capabilities, allowing it to handle complex mixed text-image layouts in posters and accurately identify text areas and extract content.

For PDF-format posters, the project uses the pdfalto tool for layout-aware text extraction, which preserves the document's structural information instead of simply outputting plain text. This multi-stage, multi-model processing flow ensures high-quality extraction results under various input conditions.

Section 04

Standardized Output Format and Downstream Applications

The output of poster2json strictly follows the poster-json-schema standard, a metadata schema specifically designed for academic posters. The output JSON includes the following main fields:

creators: Author information, including name, affiliated institution, etc.
titles: Poster title, supporting multiple languages
content: Content section, including structured chapters such as abstract, method, results, etc.
imageCaptions: Image captions
tableCaptions: Table captions

This standardized output format allows the extracted data to be seamlessly integrated into downstream applications such as academic search engines, knowledge graphs, and literature management systems. Researchers can perform advanced analysis tasks such as citation analysis, topic clustering, and trend tracking based on these structured data.

Section 05

Performance Evaluation Results and Accuracy Verification

The project team conducted a rigorous performance evaluation of poster2json using 10 manually annotated academic posters as the test set. Evaluation metrics include Word Capture rate, ROUGE-L score, number capture rate, and field proportion, etc.

The test results show that poster2json meets or exceeds the preset thresholds in all metrics: Word Capture rate reaches 0.96 (threshold 0.75), ROUGE-L score is 0.89 (threshold 0.75), and number capture rate is 0.93 (threshold 0.75). In terms of overall pass rate, all 10 test posters passed the verification, achieving a 100% pass rate.

These metrics indicate that poster2json has reached an accuracy level suitable for production environments and can reliably handle real-world academic posters.

Section 06

Application Scenarios and Value Extension

poster2json has a wide range of application scenarios. For academic conference organizers, it can batch process submitted posters to build searchable digital archives. For research institutions, it can integrate historical poster resources to establish internal knowledge management systems. For academic search engines, it can expand the indexing scope to include poster content in search results.

The deeper value lies in that poster2json provides a reusable technical paradigm for the automated processing of academic content. The multi-model collaboration architecture, specially designed JSON schema, and rigorous evaluation methods used in the project can all be migrated to other types of academic document processing tasks.

Section 07

System Requirements and Deployment Methods

Since it involves the inference of large language models, poster2json has certain hardware requirements. The official recommended configuration includes: NVIDIA CUDA-compatible graphics card (with at least 16GB of VRAM), at least 32GB of system memory, and Python version 3.10 or higher. The operating system supports Linux, macOS, and Windows running via WSL2.

The project uses Poetry for dependency management, and the installation process is relatively simple. Users can directly install the released version via pip, or clone the source code and use Poetry to install development dependencies. The project also provides a convenient command-line interface that supports functions such as single-file extraction, batch processing, and result verification.

Section 08

Open-Source Ecosystem and Sustainable Development

poster2json is developed and maintained by the fairdataihub team and uses the MIT open-source license. The project code is hosted on GitHub and accepts community contributions. The development team also released the supporting poster-json-schema standard to promote the standardization of academic poster metadata.

The project has received funding from The Navigation Fund, demonstrating the academic community's attention to such infrastructure tools. With the continuous progress of open-source models and the advancement of the academic open data movement, tools like poster2json will play an increasingly important role in the dissemination of academic knowledge.

Continue Reading

Keep going with more reads from the same topic.

Nornir MCP Server: An Enterprise-Grade Bridge for Integrating Large Language Models into Network Automation

Nornir MCP Server is an enterprise-level server based on the Model Context Protocol (MCP). It seamlessly integrates large language models (such as Claude) with the Nornir network automation framework, supporting natural language orchestration for multi-vendor network devices (Cisco, Arista, Juniper, etc.), and providing production-grade features like a dual-engine architecture (NAPALM + Netmiko), intelligent filtering, and a secure sandbox.

Recent activity 2026-05-06 20:51

Bibliothèque Française LLM: A French Public Domain Literature Index System Optimized for Large Language Models

Bibliothèque Française LLM is a structured indexing and annotation project for French public domain literature designed specifically for large language models (LLMs). It integrates multiple authoritative sources such as DraCor, Common Corpus, and Wikisource, providing metadata indexing categorized by genre, author, and era, as well as in-depth annotations for dramatic texts (including characters, lines, stage directions, etc.). Its aim is to enable LLMs to efficiently read and understand classic French literary works.

Recent activity 2026-05-06 20:50

Splinter: A Lock-Free Zero-Copy Shared Memory KV and Vector Storage Library That Eliminates Socket and Memcpy Overhead for LLM Inference

Splinter is a minimalist, high-performance key-value (KV) and vector storage system enabling zero-latency inter-process communication via shared memory and atomic operations. With only 766 lines of core code, it supports millions of operations per second and 768-dimensional vector storage, offering a new architectural approach for local LLM inference and data-intensive applications.

Recent activity 2026-04-03 08:49

Folkering OS: When the Operating System Itself Is AI—A Self-Evolving Bare-Metal Rust System

Folkering OS is the world's first AI-native bare-metal operating system, entirely written in Rust no_std without relying on Linux, POSIX, or libc. It can generate commands from scratch, compile them into WASM, and run them in 10 seconds, achieving true self-evolution.

Recent activity 2026-04-09 16:15