Zing Forum

Reading

Fine-tuning Large Language Models with Reinforcement Learning: A Comparative Study of PPO and GRPO in User Behavior Analysis

An in-depth study on fine-tuning large language models using reinforcement learning methods, comparing the performance of PPO and GRPO in insider threat detection scenarios, covering key dimensions such as training efficiency, memory usage, and output quality.

强化学习大语言模型PPOGRPO用户行为分析UEBA内部威胁检测QwenLoRA安全AI
Published 2026-06-01 03:15Recent activity 2026-06-01 03:18Estimated read 6 min
Fine-tuning Large Language Models with Reinforcement Learning: A Comparative Study of PPO and GRPO in User Behavior Analysis
1

Section 01

[Introduction] Fine-tuning LLMs with Reinforcement Learning: A Comparative Study of PPO and GRPO in Insider Threat Detection

This study conducts an in-depth analysis of fine-tuning large language models using reinforcement learning methods, comparing the performance of PPO and GRPO in insider threat detection scenarios, covering key dimensions such as training efficiency, memory usage, and output quality. Based on the CERT Insider Threat Dataset R4.2, it adopts a pragmatic model selection strategy (e.g., Qwen series) and engineering implementation, verifying the advantages of GRPO in resource-constrained environments and providing references for LLM applications in the security domain.

2

Section 02

Research Background and Motivation

With the increasing application of LLMs in the security domain, using RL for domain-specific fine-tuning of models has become a focus. The UEBA scenario requires models to understand complex sequences of security events and output structured judgment results. While traditional SFT can learn specific formats, it lacks flexibility in handling open-ended reasoning; RL balances exploration and exploitation through reward signals, generating more insightful analysis conclusions.

3

Section 03

Technical Methods and Architecture

Project Overview: Focusing on the insider threat detection task, the model needs to output risk levels (normal/suspicious/malicious), 2-4 risk features, and judgment basis. Candidate models include Qwen3-4B-Instruct, Qwen2.5-3B, etc., selected based on their performance on the development set.

Technical Stack: Data processing uses pandas+datasets; baseline models use scikit-learn; inference engines use transformers/vLLM; training optimization uses Unsloth+TRL+PEFT; experiment tracking uses wandb.

RL Methods Comparison: PPO requires policy/value networks and an independent reward model, leading to high memory usage; GRPO does not need a value network and is trained end-to-end, making it more suitable for small models. The reward function evolved from traditional ones (accuracy/reasoning, etc.) to UEBA-specific ones (ueba_accuracy/format/evidence).

4

Section 04

Experimental Design and Evaluation System

Data Division: Divide training/test sets by user to avoid data leakage.

Evaluation Metrics: Classification performance (accuracy, macro_f1, etc.), output quality (valid format rate, evidence hit rate), resource efficiency (training time, peak memory usage).

Selection Criteria: Not only focus on F1 scores but also value the ability to stably output structured and interpretable results.

5

Section 05

Highlights of Engineering Practice

Environment Management: Provide dedicated configuration schemes for uv/pip/GPU.

Experiment Reproducibility: The process includes data preparation (synthetic/real data), baseline comparison, model selection, RL fine-tuning, and comprehensive evaluation.

Automated Pipeline: nightly_ueba_pipeline.sh supports end-to-end automation, including environment configuration, multi-backend support, result packaging, etc.

6

Section 06

Research Insights and Practical Value

Methodology: Demonstrate the feasibility of RL implementation in resource-constrained scenarios, achieving high-quality fine-tuning through reward function design and model selection.

Security Applications: The UEBA output format (risk level + features + basis) embodies the concept of human-machine collaboration and is worth learning from.

Engineering Practice: The Unsloth+TRL+PEFT combination balances memory efficiency and training effectiveness, providing a reusable template.

7

Section 07

Conclusion and Outlook

This study systematically explores the application of RL in LLM fine-tuning, verifying the advantages of GRPO in resource-constrained scenarios. The project's value lies in its pragmatic engineering implementation and understanding of business requirements, providing references for LLM applications in the security domain. Future directions include exploring multi-modal input fusion, online learning mechanisms, and fine-grained risk explanation generation.