Zing Forum

Reading

CAFUNE: A Brazilian Portuguese Large Language Model Based on Discrete Masked Diffusion

CAFUNE is a fully locally trained bidirectional Transformer model that uses LLaDA-style discrete masked diffusion technology to generate Brazilian Portuguese text. This project demonstrates how to build a language model with approximately 5 million parameters from scratch without external APIs or call costs, equipped with a complete RLAIF teacher system and ethical monitoring mechanism.

扩散模型DLLM巴西葡萄牙语JuliaRLAIF本地训练BitNetFlair NLP伦理监控离散掩码扩散
Published 2026-04-18 12:12Recent activity 2026-04-18 12:24Estimated read 4 min
CAFUNE: A Brazilian Portuguese Large Language Model Based on Discrete Masked Diffusion
1

Section 01

CAFUNE Model Guide: A Locally Trained Brazilian Portuguese Discrete Masked Diffusion Large Model

CAFUNE is a fully locally trained bidirectional Transformer model optimized for Brazilian Portuguese, using LLaDA-style discrete masked diffusion technology to generate text. The project demonstrates the feasibility of building an approximately 5 million parameter model without external API call costs and zero data privacy risks, equipped with a complete RLAIF teacher system and ethical monitoring mechanism.

2

Section 02

Project Background and Motivation: Exploration of Local Models Under Resource Constraints

Most large language models rely on massive computing resources and expensive API calls. CAFUNE chooses the local training path, focusing on Brazilian Portuguese culture. The core idea is to prove that a fully functional model can be built without an enterprise-level budget, and 100% local training ensures zero call costs and data privacy.

3

Section 03

Core Technical Architecture: Discrete Masked Diffusion Engine Implemented in Julia

The core of the model is a 5 million parameter bidirectional Transformer implemented in Julia (d_model=256, 8 attention heads, 6-layer encoder), using discrete masked diffusion technology (20 denoising steps, temperature=0.5). The Adam optimizer (learning rate=5e-6) is used for training, and the dataset contains 6000 Brazilian Portuguese sentence pairs.

4

Section 04

RLAIF Teacher System and Ethical Monitoring Mechanism

The RLAIF teacher system generates MNS scores through hybrid evaluation (60% BitNet semantic coherence, 40% Flair sentiment/part-of-speech/vocabulary coverage); the sentinel monitoring includes Raegis ethical flattery detection and Guardian anomaly detection, which apply penalty scores and flag bits respectively.

5

Section 05

Memory-Mapped Communication and Tokenization Strategy Design

A 2048-byte memory-mapped file is used to achieve efficient communication between Julia and Python components, dividing fields such as handshake area, loss area, and text buffer; a BPE tokenizer (500 tokens, including 38 Portuguese accented characters) is used, with a sequence length limit of 128 tokens.

6

Section 06

Training and Deployment Process and Technical Highlights

Deployment requires configuring environment variables, installing dependencies, and starting all components via start_all_services.bat; technical highlights include discrete diffusion application, performance advantages of the Julia language, local-first paradigm, and built-in ethical monitoring mechanism.

7

Section 07

Limitations, Future Directions, and Summary

Limitations: Portuguese-focused, limited vocabulary/data scale, 5 million parameter capacity, 128-token length constraint; Future directions: expand languages/parameters, advanced diffusion scheduling, complex teacher system; Summary: Proves that building a complete pipeline under limited resources is feasible, and it is an excellent platform for learning and experimentation.