Zing Forum

Reading

Robust Semantic Steganography Based on Large Language Models: Maintaining Information Hiding Under Extreme Rewriting Attacks

This project proposes a secure and robust semantic steganography scheme that leverages the semantic channels of natural language generation tasks to maintain reliable information hiding and recovery even under extreme global rewriting attacks.

隐写术大语言模型鲁棒性语义编码隐私保护文本安全RAG改写攻击
Published 2026-06-05 00:16Recent activity 2026-06-05 00:22Estimated read 7 min
Robust Semantic Steganography Based on Large Language Models: Maintaining Information Hiding Under Extreme Rewriting Attacks
1

Section 01

Introduction: Core Overview of Robust Semantic Steganography Based on Large Language Models

This project proposes a secure and robust semantic steganography scheme that uses the semantic channels of Large Language Models (LLMs) to maintain reliable information hiding and recovery even under extreme global rewriting attacks. This project is the official code repository for the paper "Robust Semantic Steganography with Large Language Models", maintained by ChihshengJ and released on GitHub (link: https://github.com/ChihshengJ/robust-steganography) on June 4, 2026. Its core advantage is breaking through the limitation of traditional steganography techniques that are easily destroyed by rewriting, and realizing attack-resistant information recovery through the semantic generation capability of LLMs.

2

Section 02

Research Background: Challenges of Traditional Steganography and New Opportunities from LLMs

Challenges of Traditional Steganography: 1. Vulnerability: Methods based on statistical features or word replacement are easily destroyed by rewriting; 2. Detectability: Modified texts tend to show abnormal statistical features; 3. Capacity limitation: The amount of hidden information is limited under the premise of naturalness; 4. Semantic preservation: Rewriting attacks may lead to information being unrecoverable.

Opportunities from LLMs: 1. Semantic understanding ability: Can generate coherent texts; 2. Controllable generation: Conditional control of specific semantic content; 3. Diversity: The same semantics can be expressed in multiple ways.

3

Section 03

Technical Scheme: Dynamic Semantic Unit Encoding and Multi-Steganography Systems

The core technology is dynamic semantic unit encoding, with principles including: 1. Semantic channel selection (using semantic spaces of tasks such as question-answering and story generation); 2. Semantic unit mapping (mapping secret information to combinations of semantic units); 3. Dynamic generation (LLMs generate texts with specific semantic structures); 4. Rejection sampling (ensuring texts are natural and encoding is correct).

Supported systems: TopicQA (semantic encoding of question-answer pairs), Story (narrative structure encoding), LitReview (literature review structure encoding) and baseline systems.

4

Section 04

Attack Models and Robustness Verification Scheme

The project implements various attacks to test robustness: 1. N-gram shuffling attack (randomly shuffling segmented units); 2. Synonym replacement attack (WordNet replacement, maintaining structure); 3. LLM rewriting attack (GPT-4 complete rewrite, maintaining semantics); 4. Round-trip translation attack (cross-language semantic drift). Among these, the LLM rewriting attack is the strongest, expected to defeat traditional watermarking schemes.

5

Section 05

Experimental Design and Evaluation Metric System

Four-stage experimental process: 1. Text generation (steganographic texts and cover texts); 2. Metric calculation and steganalysis (perplexity, BERTScore, etc., classifier detection, LLM judgment); 3. Attack application (generating attacked text datasets); 4. Decoding and scoring (recovery accuracy, attack curves).

Evaluation metrics: Undetectability (classifier detection, embedding similarity, etc.); Robustness (recovery accuracy, attack curves, capacity analysis).

6

Section 06

Application Scenarios: Privacy Protection and Anti-Censorship Practices

  1. Anti-censorship communication: Secure communication in monitored environments, where information can still be recovered even if the text is modified; 2. Covert file storage: Hiding binary data in natural texts (such as poems, diaries); 3. Cloud storage privacy: Storing sensitive data in the form of creative writing to avoid drawing attention to encrypted files.
7

Section 07

Innovation Value and Technical Limitations

Innovation Value: 1. New paradigm of semantic steganography (elevated from the lexical level to the semantic level); 2. Robustness theory (proving information integrity under extreme attacks); 3. Evaluation framework (complete methodology).

Technical Limitations: 1. Capacity limitation (native capacity of each system is limited); 2. API dependency (OpenAI API is required for generation/attacks); 3. High computational cost (large number of API calls and GPU resources). Ethically, it is necessary to comply with laws and regulations and use it for legitimate privacy protection.

8

Section 08

Summary and Future Development Directions

This project represents an important progress in text steganography technology, realizing information recovery under extreme attacks through the semantic capabilities of LLMs. Future directions: Improve steganographic capacity, expand languages and text types, develop stronger defense mechanisms, and combine with other privacy technologies. It provides code and experimental frameworks for research in steganography, privacy protection, and AI security.