Reading

CAFUNE: A Brazilian Portuguese Large Language Model Based on Discrete Masked Diffusion

CAFUNE is a fully locally trained bidirectional Transformer model that uses LLaDA-style discrete masked diffusion technology to generate Brazilian Portuguese text. This project demonstrates how to build a language model with approximately 5 million parameters from scratch without external APIs or call costs, equipped with a complete RLAIF teacher system and ethical monitoring mechanism.

扩散模型DLLM巴西葡萄牙语JuliaRLAIF本地训练BitNetFlair NLP伦理监控离散掩码扩散

Published 2026-04-18 12:12Recent activity 2026-04-18 12:24Estimated read 4 min

CAFUNE: A Brazilian Portuguese Large Language Model Based on Discrete Masked Diffusion

Section 01

CAFUNE Model Guide: A Locally Trained Brazilian Portuguese Discrete Masked Diffusion Large Model

CAFUNE is a fully locally trained bidirectional Transformer model optimized for Brazilian Portuguese, using LLaDA-style discrete masked diffusion technology to generate text. The project demonstrates the feasibility of building an approximately 5 million parameter model without external API call costs and zero data privacy risks, equipped with a complete RLAIF teacher system and ethical monitoring mechanism.

Section 02

Project Background and Motivation: Exploration of Local Models Under Resource Constraints

Most large language models rely on massive computing resources and expensive API calls. CAFUNE chooses the local training path, focusing on Brazilian Portuguese culture. The core idea is to prove that a fully functional model can be built without an enterprise-level budget, and 100% local training ensures zero call costs and data privacy.

Section 03

Core Technical Architecture: Discrete Masked Diffusion Engine Implemented in Julia

The core of the model is a 5 million parameter bidirectional Transformer implemented in Julia (d_model=256, 8 attention heads, 6-layer encoder), using discrete masked diffusion technology (20 denoising steps, temperature=0.5). The Adam optimizer (learning rate=5e-6) is used for training, and the dataset contains 6000 Brazilian Portuguese sentence pairs.

Section 04

RLAIF Teacher System and Ethical Monitoring Mechanism

The RLAIF teacher system generates MNS scores through hybrid evaluation (60% BitNet semantic coherence, 40% Flair sentiment/part-of-speech/vocabulary coverage); the sentinel monitoring includes Raegis ethical flattery detection and Guardian anomaly detection, which apply penalty scores and flag bits respectively.

Section 05

Memory-Mapped Communication and Tokenization Strategy Design

A 2048-byte memory-mapped file is used to achieve efficient communication between Julia and Python components, dividing fields such as handshake area, loss area, and text buffer; a BPE tokenizer (500 tokens, including 38 Portuguese accented characters) is used, with a sequence length limit of 128 tokens.

Section 06

Training and Deployment Process and Technical Highlights

Deployment requires configuring environment variables, installing dependencies, and starting all components via start_all_services.bat; technical highlights include discrete diffusion application, performance advantages of the Julia language, local-first paradigm, and built-in ethical monitoring mechanism.

Section 07

Limitations, Future Directions, and Summary

Limitations: Portuguese-focused, limited vocabulary/data scale, 5 million parameter capacity, 128-token length constraint; Future directions: expand languages/parameters, advanced diffusion scheduling, complex teacher system; Summary: Proves that building a complete pipeline under limited resources is feasible, and it is an excellent platform for learning and experimentation.

Continue Reading

Keep going with more reads from the same topic.

Nornir MCP Server: An Enterprise-Grade Bridge for Integrating Large Language Models into Network Automation

Nornir MCP Server is an enterprise-level server based on the Model Context Protocol (MCP). It seamlessly integrates large language models (such as Claude) with the Nornir network automation framework, supporting natural language orchestration for multi-vendor network devices (Cisco, Arista, Juniper, etc.), and providing production-grade features like a dual-engine architecture (NAPALM + Netmiko), intelligent filtering, and a secure sandbox.

Recent activity 2026-05-06 20:51

Bibliothèque Française LLM: A French Public Domain Literature Index System Optimized for Large Language Models

Bibliothèque Française LLM is a structured indexing and annotation project for French public domain literature designed specifically for large language models (LLMs). It integrates multiple authoritative sources such as DraCor, Common Corpus, and Wikisource, providing metadata indexing categorized by genre, author, and era, as well as in-depth annotations for dramatic texts (including characters, lines, stage directions, etc.). Its aim is to enable LLMs to efficiently read and understand classic French literary works.

Recent activity 2026-05-06 20:50

Splinter: A Lock-Free Zero-Copy Shared Memory KV and Vector Storage Library That Eliminates Socket and Memcpy Overhead for LLM Inference

Splinter is a minimalist, high-performance key-value (KV) and vector storage system enabling zero-latency inter-process communication via shared memory and atomic operations. With only 766 lines of core code, it supports millions of operations per second and 768-dimensional vector storage, offering a new architectural approach for local LLM inference and data-intensive applications.

Recent activity 2026-04-03 08:49

Folkering OS: When the Operating System Itself Is AI—A Self-Evolving Bare-Metal Rust System

Folkering OS is the world's first AI-native bare-metal operating system, entirely written in Rust no_std without relying on Linux, POSIX, or libc. It can generate commands from scratch, compile them into WASM, and run them in 10 seconds, achieving true self-evolution.

Recent activity 2026-04-09 16:15