Zing Forum

Reading

End-to-End Pipeline for Knowledge Graph Construction from Unstructured Documents Using Large Language Models

A master's thesis research project that explores how to use large language models to extract structured knowledge from unstructured documents and build a complete data pipeline for large-scale knowledge graphs.

知识图谱大语言模型信息抽取非结构化数据NLP实体识别关系抽取数据流水线图数据库
Published 2026-05-16 06:42Recent activity 2026-05-16 06:49Estimated read 10 min
End-to-End Pipeline for Knowledge Graph Construction from Unstructured Documents Using Large Language Models
1

Section 01

Research Guide to End-to-End Pipeline for Knowledge Graph Construction from Unstructured Documents Using Large Language Models

This master's thesis project explores how to use large language models (LLMs) to extract structured knowledge from unstructured documents and build a complete data pipeline for large-scale knowledge graphs. It aims to address issues in traditional knowledge graph construction such as high manual annotation costs, poor generalization of rule-based systems, and difficulty in maintenance and updates. By integrating LLM capabilities, it achieves end-to-end automated conversion from unstructured documents to structured knowledge bases, with application value in multiple scenarios like enterprise knowledge management and scientific literature analysis.

2

Section 02

Research Background and Problem Definition

Research Background

In the era of information explosion, enterprises and research institutions accumulate massive unstructured documents (PDFs, Word files, scanned documents, etc.) that contain valuable knowledge but lack structured representation, making them difficult to be effectively retrieved and reasoned by machines.

Traditional Method Challenges

  • High manual annotation cost: Requires domain experts to annotate entity relationships for each document one by one
  • Poor generalization of rule-based systems: Regular expression/template extraction struggles to handle diverse document formats
  • Difficulty in maintenance and updates: Knowledge bases become outdated easily, and maintenance costs accumulate continuously

The emergence of LLMs provides new possibilities to solve these problems, and this project explores integrating LLMs into the knowledge graph construction pipeline.

3

Section 03

End-to-End Pipeline Architecture

The pipeline designed in the project includes five core phases:

Phase 1: Document Ingestion and Preprocessing

  • Format recognition and unified conversion (PDF to text, OCR, scanned document processing)
  • Document structure parsing (chapter identification, table extraction)
  • Noise cleaning (header/footer removal, encoding repair)

Phase 2: Document Chunking and Semantic Segmentation

Balances granularity control, semantic integrity, and overlap strategies to implement intelligent chunking comparisons based on fixed length, semantic similarity, and structure.

Phase 3: Entity and Relationship Extraction

Core link, leveraging LLM reasoning capabilities:

  • Entity extraction: Identify key concepts such as names of people and organizations
  • Relationship extraction: Discover associations between entities (e.g., "belongs to", "cooperates with")
  • Attribute extraction: Extract entity features (e.g., establishment time) Uses few-shot prompting strategy to guide the model.

Phase 4: Knowledge Fusion and Deduplication

  • Entity alignment: Identify multiple mentions of the same object (e.g., "微软" [Microsoft], "Microsoft")
  • Relationship disambiguation: Handle semantic differences in different contexts
  • Conflict resolution: Evaluate credibility to resolve factual conflicts

Phase 5: Graph Storage and Query

Stored in graph databases (e.g., Neo4j), supporting complex queries, reasoning, and visualization.

4

Section 04

Role of Large Language Models in the Pipeline

Compared to traditional NLP pipelines, LLMs bring a paradigm shift:

From "Training Task-Specific Models" to "Calling General Capabilities"

Traditionally, task-specific models need to be trained for each task; LLMs adapt to multiple tasks through prompt engineering, reducing development costs.

From "Closed Label Set" to "Open-Domain Extraction"

Pretrained models only recognize predefined labels; LLMs can understand natural language instructions, support open-domain entity/relationship definitions, and improve flexibility.

From "Local Context" to "Global Understanding"

LLMs' large context windows support cross-paragraph/chapter reasoning, extracting implicit relationships that are hard to find with traditional methods.

5

Section 05

Technical Challenges and Countermeasures

Challenge 1: Hallucinations and Factual Accuracy

  • Citation tracing: Require the model to label the source location of information
  • Confidence scoring: Evaluate the credibility of extracted results
  • Manual review: High-confidence results are automatically stored, while low-confidence ones undergo manual review

Challenge 2: Cost and Efficiency Trade-off

  • Layered processing: Lightweight rules filter irrelevant content, and complex segments call LLMs
  • Batch processing and caching: Merge similar requests and cache repeated queries
  • Model selection: Use lightweight models for simple tasks and powerful models for complex tasks

Challenge 3: Maintainability of Prompt Engineering

  • Prompt version control system
  • Benchmark test set for extraction quality evaluation
  • A/B testing framework to compare prompt effects
6

Section 06

Application Scenarios and Value

This pipeline can be applied to knowledge-intensive scenarios:

Enterprise Knowledge Management: Convert tacit knowledge in department reports and emails into queryable graphs Scientific Literature Analysis: Extract research trends, author collaborations, and technology evolution from papers Compliance and Audit: Organize key clauses and associations in contracts and regulations Intelligence Analysis: Integrate open-source intelligence to build association networks of people, organizations, and events

7

Section 07

Limitations and Future Directions

Current Limitations

  • Multilingual document support needs to be enhanced
  • Real-time incremental update mechanism is not yet perfect
  • Linking and fusion with external knowledge bases (e.g., Wikidata) can be deepened

Future Research Directions

  • Explore multimodal LLMs for processing documents with mixed text and images
  • Introduce agent architecture to achieve active knowledge verification
  • Develop domain-adaptive few-shot learning strategies
8

Section 08

Summary

This master's thesis project demonstrates the great potential of LLMs in the field of knowledge engineering. By embedding LLMs into the end-to-end pipeline, it significantly automates knowledge graph construction tasks that traditionally require a lot of manual work. Although there is still a distance from fully autonomous "machine reading", an important step has been taken.