Reading

Interpreting History with AI: How Large Language Models Classify 19th-Century Swedish Patent Documents

A research project combining the KB-BERT model and generative large language models has successfully automated the classification of 19th-century Swedish historical patents, demonstrating the potential of AI in the digitization and knowledge mining of historical documents.

大语言模型历史文献专利分类BERT数字人文瑞典语NLP文本分类KB-BERT预训练模型

Published 2026-05-27 16:44Recent activity 2026-05-27 16:51Estimated read 7 min

Interpreting History with AI: How Large Language Models Classify 19th-Century Swedish Patent Documents

Section 01

Introduction: Core Achievements and Significance of AI Classification for 19th-Century Swedish Patent Documents

Core Insights

Basic Information

Original Author/Maintainer: yuntingxie
Source Platform: GitHub
Original Title: patent_classification
Original Link: https://github.com/yuntingxie/patent_classification
Publication Date: May 27, 2026
Related Paper: "You have no class! Large Language Model Classification of Nineteenth Century Patents in Sweden, 1852-1914"

Section 02

Project Background and Research Significance

The digitization and automatic analysis of historical documents are important topics in the field of digital humanities. The Swedish Historical Patent Infrastructure Project preserves a large number of patent documents from 1852 to 1914, recording the trajectory of technological innovation during the Industrial Revolution. However, manual classification is time-consuming, labor-intensive, and requires professional knowledge. With the development of large language model technology, this project explores the use of AI to automate the classification of historical documents and verifies its effectiveness.

Section 03

Technical Solution and Implementation Methods

Core Models

KB-BERT Fine-tuning Scheme: Based on the KB-BERT model trained by the National Library of Sweden, using patent titles as input, and performing supervised fine-tuning on the DPK classification system.
Generative Large Language Model Scheme: Guiding the generative model to output classification results through prompt engineering.

Data Processing

Data Source: Swedish Historical Patent Infrastructure (https://svenskahistoriskapatent.se/) patent documents from 1852 to 1914
Classification System: DPK (Det Preliminära Klassifikationssystemet) historical patent classification standard

Technical Details

Environment Requirements: Python 3.10+, dependencies include pandas, numpy, torch, transformers, scikit-learn, tqdm
Hardware Support: NVIDIA T4 GPU or CPU
KB-BERT Acquisition: https://huggingface.co/KB/bert-base-swedish-cased

Section 04

Research Results and Academic Value

Key Findings

The fine-tuned KB-BERT model performed excellently in the 19th-century Swedish patent classification task, effectively identifying technical categories and verifying the potential of pre-trained models in historical document processing.

Academic Contributions

Methodological Innovation: Applying modern NLP technology to historical document research
Dataset Construction: Providing reusable technical solutions
Interdisciplinary Integration: Connecting computer science and history

Data Openness

The team commits to releasing the complete dataset along with a data paper to facilitate subsequent research.

Section 05

Application Prospects and Implications

Digitization of Historical Documents

It can be extended to large-scale historical document digitization tasks such as ancient book classification, archive organization, and topic modeling of historical newspapers.

Digital Humanities Paradigm

AI technology greatly improves the efficiency of document organization, allowing researchers to focus on in-depth analysis and knowledge discovery.

Low-Resource Language Processing

The success of KB-BERT provides a reference for processing medium-resource languages such as Swedish, and domain-specific fine-tuning can achieve practical results.

Section 06

Summary of Technical Highlights

Domain Adaptation: Optimizing the model for the special language style of 19th-century Swedish patents
Multi-Model Comparison: Systematically comparing the performance differences between discriminative (KB-BERT) and generative models
Reproducibility: Complete code and data release plans ensure the reproducibility of the research
Cross-Language Application: Demonstrating the effectiveness of pre-trained models in historical low-resource language processing