Zing Forum

Reading

Onubad.ai: An English-Bangla PDF Translation System Based on Fine-Tuned Large Language Models

Introducing the Onubad.ai project, an open-source tool for automatic English-to-Bangla PDF document translation using fine-tuned large language models, focusing on balancing document format preservation and translation quality.

PDF翻译孟加拉语大语言模型微调机器翻译低资源语言文档处理NLP
Published 2026-05-09 11:38Recent activity 2026-05-09 12:40Estimated read 5 min
Onubad.ai: An English-Bangla PDF Translation System Based on Fine-Tuned Large Language Models
1

Section 01

Onubad.ai: Open-Source English-Bangla PDF Translation Tool with Fine-Tuned LLMs

Onubad.ai is an open-source tool for automatic English-to-Bangla PDF document translation using fine-tuned large language models (LLMs). It focuses on balancing document format preservation and translation quality, addressing the "language gap" faced by Bangla speakers. Key features include format retention (tables, charts, layouts), accurate technical term translation, and context-aware understanding. As an open-source project, it aims to promote language equality and knowledge sharing for low-resource languages like Bangla.

2

Section 02

Language Gap & Challenges in Bangla Translation

Bangla, spoken by over 265 million people (7th most used globally), faces a digital "language gap"—many English technical/academic documents are inaccessible to Bangla users. Traditional tools like Google Translate struggle with PDF format loss and inaccurate technical terms. Onubad.ai targets these pain points with a specialized solution for English-Bangla PDF translation.

3

Section 03

Core Design Goals & Technical Pipeline

Onubad.ai's design centers on three goals: 1) Preserve PDF formats (tables, charts, multi-column layouts); 2) Ensure accurate technical terms;3) Context-aware translation using LLMs. Its pipeline includes: 1) PDF parsing (extract text/images/tables, handle encoding/fonts/layout);2) Fine-tuned LLM translation (domain-adapted, language-pair optimized);3) Post-processing (reconstruct format, font rendering, alignment).

4

Section 04

Model Selection & Fine-Tuning Strategy

Onubad.ai uses open-source LLMs (LLaMA series, Mistral, Aya—good for low-resource languages). Fine-tuning relies on high-quality parallel data: public datasets (CC100-Bangla, Oscar), domain-specific (tech/academic docs), and synthetic (back-translation). Techniques include Parameter-Efficient Fine-Tuning (LoRA/QLoRA), instruction tuning, and optional RLHF for quality optimization.

5

Section 05

Application Scenarios & User Value

Onubad.ai benefits various users: 1) Academics: Access international research;2) Tech companies: Localize product docs;3) Governments: Translate public service info (transparency);4) Businesses: Expand to Bangla markets (marketing/contracts).

6

Section 06

Current Limitations & Future Directions

Current limitations: Complex PDF layouts (multi-column, image-text wrapping) may be challenging; GPU resources needed for local runs; dialect variations require manual proofreading. Future plans: Support Word/HTML/EPUB; real-time collaboration; custom term libraries; TTS integration for audio output.

7

Section 07

Open Source Significance & Community Contribution

As open-source, Onubad.ai promotes: 1) Language equality (reduce non-English access barriers);2) Knowledge democratization (cross-language content sharing);3) Local innovation (base tool for Bangla NLP community). It paves the way for similar tools in other low-resource languages.