Zing Forum

Reading

Indonesian Political Fake News Detection: A Comparative Study of Naive Bayes and SVM Methods Based on Text Mining

This project addresses the issue of fake news detection in Indonesian political news, using text mining techniques to compare the performance of two classic machine learning methods—Naive Bayes and SVM—providing practical references for automatic fake news identification in low-resource language environments.

假新闻检测文本挖掘朴素贝叶斯SVM印尼语NLP政治新闻机器学习文本分类
Published 2026-06-13 12:45Recent activity 2026-06-13 12:57Estimated read 8 min
Indonesian Political Fake News Detection: A Comparative Study of Naive Bayes and SVM Methods Based on Text Mining
1

Section 01

Introduction to Indonesian Political Fake News Detection Research

This project addresses the issue of fake news detection in Indonesian political news, using text mining techniques to compare the performance of two classic machine learning methods—Naive Bayes (NB) and Support Vector Machine (SVM)—providing practical references for automatic fake news identification in low-resource language environments. The original project is maintained by maranathagresya, hosted on GitHub, link: https://github.com/maranathagresya/Project-Machine-Learning-, published on 2026-06-13.

2

Section 02

Research Background: Global Challenges of Fake News and Special Issues in Indonesian

Fake news has become a severe social problem in the digital age, especially in the political field—its spread can mislead cognition, incite opposition, and even threaten democracy. As a populous country with a large number of social media users, Indonesia faces prominent political fake news issues. Compared with English, Indonesian fake news detection faces three main challenges: 1. Scarce language resources (few pre-trained models and labeled data); 2. Dialect diversity; 3. Grammatical flexibility (agglutinative language with rich affix changes).

3

Section 03

Technical Solution: Selection of Classic Machine Learning Methods and Preprocessing Pipeline

The project selected NB and SVM for comparative experiments:

  • NB advantages: Fast training, low data demand, strong interpretability, robust to noise—suitable for Indonesian scenarios with scarce data.
  • SVM advantages: Excellent high-dimensional data processing, strong generalization ability, flexible kernel tricks, solid theoretical foundation. Preprocessing pipeline: Original text → Cleaning (remove HTML/special characters) → Tokenization → Stopword removal → Stemming → TF-IDF vectorization → Classifier training/prediction. Key steps: Tokenization (space-separated, handle agglutinative morphemes), stopword filtering (e.g., "dan", "yang"), stemming (reduce to root words like "berjalan" → "jalan"), TF-IDF vectorization (highlight document-specific vocabulary).
4

Section 04

Experimental Design and Evaluation Metrics

Binary classification metrics are used for evaluation: Accuracy (overall correct proportion), Precision (proportion of correctly predicted fake news), Recall (proportion of real fake news identified), F1-score (harmonic mean of precision and recall). Cross-validation strategy: K-fold cross-validation (split into K subsets, alternate training and testing, take average results) to ensure stable evaluation of generalization ability.

5

Section 05

Trade-off Analysis Between Classic Methods and Deep Learning

Comparison between classic methods (NB/SVM) and deep learning:

Dimension Naive Bayes/SVM Deep Learning
Training data requirement Small number of samples Large amount of labeled data
Training time Seconds to minutes Hours to days
Inference speed Extremely fast Fast (depends on model size)
Interpretability High (view important feature words) Low (black box)
Hardware requirements Ordinary CPU GPU acceleration required
Deployment cost Extremely low Relatively high
Classic methods are suitable for: Resource-constrained environments, rapid prototyping, difficult data labeling, high interpretability requirements, high real-time requirements.
6

Section 06

Special Considerations for Indonesian NLP

Challenges of Indonesian language characteristics:

  • Agglutinative language: Change word meaning through affixes (e.g., "baca" → "membaca" (reading), "pembaca" (reader))—need correct affix handling.
  • No morphological changes: Verbs do not change with tense or person, simplifying processing but losing some semantics.
  • Rich loanwords: Influenced by Dutch, Arabic, and English. Available tools: Sastrawi (stemming), NLTK/Spacy (tokenization), Indonesian Stopwords (stopword list).
7

Section 07

Potential Improvement Directions

Optimization directions:

  1. Feature engineering: Add N-grams (2-3 word phrases), word embeddings (Word2Vec/FastText), syntactic features (sentence length, emotional word density), metadata (release time, source).
  2. Model fusion: Voting ensemble, stacking ensemble, weighted fusion.
  3. Deep learning exploration: CNN (local n-grams), LSTM/GRU (long-distance dependencies), transfer learning with IndoBERT pre-trained models.
  4. Data augmentation: Back translation, synonym replacement, self-training, active learning.
8

Section 08

Ethical Considerations and Research Summary

Ethical issues:

  • Misjudgment risk: False positives (misclassifying real news) affect freedom of speech; false negatives (missing fake news) spread harm—need threshold balancing.
  • Bias fairness: Training data bias leads to systemic deviations—regular audits required.
  • Transparency and accountability: Users have the right to know decision-making basis—manual review mechanism needed.
  • Adversarial attacks: Maliciously modified fake news to bypass detection—continuous model updates required. Summary: This project demonstrates the application value of classic methods in low-resource languages, providing references for multilingual fake news detection. We look forward to more work to help different language communities address fake news challenges.