Zing Forum

Reading

Research on Data-Constrained File Fragment Classification of Heterogeneous File Types Using Large Language Models

A research team from Hong Kong has open-sourced the complete dataset and experimental evaluation results for file fragment classification using large language models, providing a new technical path for the fields of digital forensics and file recovery.

大语言模型文件片段分类数字取证数据恢复异构文件类型机器学习深度学习
Published 2026-04-14 09:13Recent activity 2026-04-14 09:21Estimated read 7 min
Research on Data-Constrained File Fragment Classification of Heterogeneous File Types Using Large Language Models
1

Section 01

Using Large Language Models to Solve File Fragment Classification Challenges: Open-Source Achievements from Hong Kong Team Empower Digital Forensics

This article introduces a study by a Hong Kong research team that uses large language models to achieve data-constrained file fragment classification of heterogeneous file types. They have open-sourced the complete dataset and experimental evaluation results, providing a new technical path for the fields of digital forensics and file recovery. The study addresses the limitations of traditional methods in file fragment classification, explores the application value of large language models, verifies their effectiveness through experiments, and proposes future research directions.

2

Section 02

Research Background: Technical Challenges in File Fragment Classification

In the fields of digital forensics and data recovery, file fragment classification is a major challenge. When storage media are damaged or metadata is lost, only scattered fragments can be obtained, and traditional methods relying on file header magic numbers or signatures fail. Heterogeneous file types (documents, images, videos, etc.) have large structural differences, and fragment positions are random, making traditional machine learning methods ineffective. Additionally, in data-constrained scenarios, labeled samples are limited, further increasing the difficulty.

3

Section 03

Advantages of Large Language Models: Breaking Through Limitations of Traditional Methods

Large Language Models (LLMs) demonstrate strong context understanding and pattern recognition capabilities in natural language processing. They not only process text but also learn the intrinsic structure of various types of data. Compared to traditional methods, LLMs have three major advantages: 1. Pre-training on massive data gives them strong generalization ability, allowing quick adaptation to new tasks with a small number of samples; 2. The attention mechanism captures long-distance dependencies, enabling extraction of key features regardless of their positions; 3. Semantic understanding ability identifies the generation logic and patterns behind file types, rather than just surface features.

4

Section 04

Dataset and Experimental Design: Simulating Real Data-Constrained Scenarios

The study provides a complete dataset and experimental process, with part of the backup data hosted on the Figshare platform for easy reproduction. The dataset covers various heterogeneous file types (PDF, DOCX, JPEG, PNG, MP4, MP3, executable files, etc.). The experiments use strict data-constrained settings to simulate scenarios with scarce labeled data, controlling the number and diversity of training samples. Evaluation metrics include accuracy, macro-average F1 score, and precision-recall curves, which comprehensively reflect the performance on imbalanced data.

5

Section 05

Key Findings: Small Models Can Also Perform Well, with Significant Cross-Type Transfer Effects

Experimental results show that relatively small LLMs, after appropriate fine-tuning, perform excellently in file fragment classification tasks. The models' semantic understanding ability exceeded expectations, such as distinguishing between JPEG quantization tables and pixel regions, and between PDF text streams and binary object boundaries. Cross-file type transfer learning effects are significant; there are deep structural commonalities between different types, and LLMs are good at capturing abstract patterns. A moderate fragment length (512 bytes to 4KB) balances information integrity and computational efficiency.

6

Section 06

Application Prospects: Benefiting Multiple Fields Including Digital Forensics and Cybersecurity

This research is directly practical for digital forensics: it can quickly filter and classify recovered file fragments, improving efficiency (without the need for complete header information). In the field of cybersecurity, it can detect obfuscated/encrypted malicious files (by identifying type patterns outside the header). Cloud storage service providers can optimize data deduplication and compression strategies by selecting appropriate algorithms based on file types.

7

Section 07

Open Source and Future: Next Steps to Advance the Field

The research team has open-sourced the preprocessed dataset and experimental evaluation tables (GitHub repository), and will open-source the model code after the paper is published. Future directions include: expanding to more file types (especially emerging proprietary formats); exploring multimodal large models to process files with mixed content; developing efficient inference solutions to enable real-time operation on resource-constrained devices. The evolution of LLM technology will bring more innovative solutions to file fragment classification.