Zing Forum

Reading

Cross-Domain Sentiment Analysis: When DistilBERT Meets TF-IDF, Are Large Models Always Better?

A comparative study reveals a counterintuitive finding: in cross-domain scenarios, the simple TF-IDF + Logistic Regression model performs almost on par with DistilBERT, while the performance decay rate of expensive Transformer models during domain transfer is 2.4 times that of traditional methods.

Sentiment AnalysisCross-DomainDistilBERTTF-IDFLogistic RegressionDomain ShiftTransformerMachine LearningNLP
Published 2026-05-30 11:14Recent activity 2026-05-30 11:24Estimated read 6 min
Cross-Domain Sentiment Analysis: When DistilBERT Meets TF-IDF, Are Large Models Always Better?
1

Section 01

Cross-Domain Sentiment Analysis: DistilBERT vs TF-IDF+LR, Are Large Models Always Better?

This article explores the performance comparison between classic methods and modern large models in cross-domain sentiment analysis scenarios. Key findings: In the Twitter→IMDB cross-domain transfer, TF-IDF + Logistic Regression performs almost on par with DistilBERT, while the performance decay rate of Transformer models is 2.4 times that of traditional methods. Original author: aarogyaojha, Source: GitHub (link: https://github.com/aarogyaojha/sentiment_analysis), Publication date: May 30, 2026.

2

Section 02

Research Background and Motivation

The research stems from a practical problem: how to choose a model when it needs to run outside the distribution of training data? Two methods are compared: classic (TF-IDF + Logistic Regression) and modern (DistilBERT). Core question: Do the advantages of large models persist in domain transfer scenarios?

3

Section 03

Experimental Design Details

Dataset Configuration: The training set is Sentiment140 tweets (1.6 million for DistilBERT, 160,000 for TF-IDF + LR); the test set is IMDB movie reviews (25,000), forming the Twitter→IMDB cross-domain scenario. Evaluation Metrics: Accuracy, precision, recall, F1, and McNemar's test and chi-square test are used to verify significance.

4

Section 04

Intra-Domain vs Cross-Domain Performance Comparison

Intra-Domain Performance: On the Twitter dataset, DistilBERT achieves an accuracy of 85.0%, while TF-IDF + LR is 77.7%, leading by 7.3 percentage points (p < 0.001). Cross-Domain Performance: After transferring to IMDB, TF-IDF + LR has an accuracy of 72.3%, and DistilBERT 71.9%, which is statistically equivalent (chi-square value = 1.056, p = 0.304). Performance Decay: DistilBERT's accuracy drops by 13.1%, TF-IDF + LR by 5.4%, so the decay rate is 2.4 times.

5

Section 05

Reasons for Faster Decay of Large Models

The decay is mainly "precision-dominated". It is speculated that DistilBERT over-relies on local positive markers (emojis, slang) in Twitter, which have different meanings in movie reviews, leading to more false positives and impairing precision. In contrast, TF-IDF + LR has a more conservative decision boundary, relies less on local features, and is more robust across domains.

6

Section 06

Practical Implications

  1. Intra-domain accuracy is not sufficient to guide model selection; 2. It is recommended to use the "precision-recall decay ratio" as a cross-domain diagnostic indicator; 3. If cross-domain performance is equivalent, expensive models may not be the best choice; 4. In critical applications, robustness takes priority over peak performance.
7

Section 07

Limitations and Future Directions

Limitations: Only targeted at sentiment analysis and the Twitter→IMDB scenario; results may differ for other tasks or more drastic transfers. Future Directions: Explore whether domain adaptation techniques can narrow the gap, whether larger models (GPT-4, Llama) have similar patterns, and whether multi-task learning can improve cross-domain robustness.

8

Section 08

Research Summary

This study breaks the "bigger is better" intuition: in cross-domain deployment or resource-constrained scenarios, TF-IDF + Logistic Regression may be underestimated. Model selection should consider the characteristics of the deployment environment rather than blindly pursuing new architectures. Models with significant intra-domain advantages may lose their advantages or even be inferior to simple baselines in cross-domain scenarios.