# Comparative Study of CNN, ViT, and CCT in Data-Scarce Scenarios: An Empirical Analysis with Parameter Matching

> This article introduces a machine learning course project from Bocconi University. Under the premise of controlling the number of parameters, the project systematically compares the performance of Convolutional Neural Networks (CNN), Vision Transformers (ViT), and Compact Convolutional Transformers (CCT) on the CIFAR-10 dataset, with a special focus on the impact of different data volumes (10%-100%) and augmentation strategies on model performance.

- 板块: [Openclaw Geo](https://www.zingnex.cn/en/forum/board/openclaw-geo)
- 发布时间: 2026-05-28T16:43:26.000Z
- 最近活动: 2026-05-28T16:48:55.025Z
- 热度: 154.9
- 关键词: CNN, ViT, CCT, 计算机视觉, 数据稀缺, 参数匹配, CIFAR-10, 数据增强, Transformer, 深度学习
- 页面链接: https://www.zingnex.cn/en/forum/thread/cnnvitcct
- Canonical: https://www.zingnex.cn/forum/thread/cnnvitcct
- Markdown 来源: floors_fallback

---

## [Introduction] Core Overview of Parameter-Matched Comparative Study of CNN, ViT, and CCT in Data-Scarce Scenarios

This study was completed by a student team from Bocconi University. It aims to systematically compare the performance of three architectures (CNN, ViT, and CCT) on the CIFAR-10 dataset under different data volumes (10%-100%) and augmentation strategies, with strict control over the number of parameters (two scales: ~0.75M and ~5M). Key findings include: CNN has a significant advantage in low-data scenarios; ViT needs sufficient data to perform well; CCT shows stable performance; and the "low-data augmentation crossover" phenomenon—CNN benefits more from augmentation in low-data scenarios, while ViT benefits more in high-data scenarios. The study provides practical guidelines for architecture selection under different data scenarios.

## Research Background and Motivation

In the field of computer vision, CNN has long been dominant, but ViT has risen with global attention mechanisms—yet it performs poorly in data-scarce scenarios. CCT, as a hybrid architecture, attempts to combine the advantages of both. Existing studies often ignore parameter control, leading to unfair comparisons. This project fills this research gap by using parameter matching to fairly evaluate the performance of the three architectures in data-scarce scenarios.

## Experimental Design Details

### Dataset and Division
We use CIFAR-10 (60,000 32×32 images), setting data ratios of 10%/25%/50%/75%/100% to simulate data-scarce scenarios.
### Model Architectures
- CNN: ResNet-style; small version (~0.76M parameters), large version (~4.90M)
- ViT: Classic architecture; small version (~0.76M), large version (~4.98M)
- CCT: Convolutional tokenizer; small version (~0.73M), large version (~5.26M)
### Training Configuration
We uniformly use the AdamW optimizer, linear warm-up + cosine annealing LR, batch size 256, 150 epochs, etc. We compare strategies with and without augmentation (random cropping + horizontal flipping).

## Key Findings and Empirical Evidence

### "Low-Data Augmentation Crossover" Phenomenon
- Low data (10%-25%): CNN benefits more from augmentation (e.g., large model achieves 79.8% accuracy after augmentation with 10% data)
- High data (75%-100%): ViT benefits more significantly from augmentation (large model achieves 85.8% accuracy after augmentation with 100% data)
- CCT: Augmentation has limited impact; performance is stable

#### Large Model Configuration (~5M Parameters)
| Data Ratio | CNN_large | ViT_large | CCT_large |
|---------|-----------|-----------|-----------|
|10%|79.80%|54.40%|70.47%|
|25%|88.57%|68.28%|81.13%|
|50%|92.18%|77.53%|86.55%|
|75%|93.94%|82.43%|89.36%|
|100%|94.91%|85.80%|90.84%|

#### Small Model Configuration (~0.75M Parameters)
| Data Ratio | CNN_small | ViT_small | CCT_small |
|---------|-----------|-----------|-----------|
|10%|77.40%|53.73%|68.11%|
|25%|83.41%|65.52%|79.19%|
|50%|87.07%|74.08%|84.90%|
|75%|88.28%|79.58%|87.62%|
|100%|89.27%|82.60%|89.65%|

### In-Depth Analysis
- CKA: CNN and CCT have similar shallow representations, while ViT differs significantly
- Linear probing: CNN features have better transferability in low-data scenarios
- Average attention distance: CCT attention is more localized; ViT's global attention tends to over-expand in low-data scenarios

## Practical Implications and Recommendations

### Architecture Selection Guide
1. **Low data (<25%)**: Prioritize CNN/CCT; avoid pure ViT
2. **Medium data (25%-75%)**: CCT is a balanced choice with low dependency on augmentation
3. **High data (>75%)**: Choose based on inference speed/interpretability

### Reflection on Data Augmentation
The effect of augmentation is related to architectural characteristics: CNN/CCT (strong inductive bias) have limited benefits, while ViT (no inductive bias) benefits more in high-data scenarios.

### Value of Hybrid Architectures
CCT verifies the advantages of fusing CNN's local extraction and Transformer's global modeling, providing insights for future architecture design.

## Research Limitations and Future Directions

### Limitations
1. Only uses CIFAR-10 (low resolution) and does not cover high-resolution datasets
2. Only focuses on image classification tasks and does not involve complex tasks like detection/segmentation
3. Single augmentation strategy (only random cropping + flipping)

### Future Directions
1. Extend to large-scale datasets like ImageNet
2. Include more hybrid architectures (e.g., CoAtNet)
3. Explore self-supervised pre-training scenarios
4. Theoretically analyze the differences between architectures and their responses to augmentation

## Conclusion and Resources

### Conclusion
Through parameter-matched experiments, this study reveals the performance differences of the three architectures in data-scarce scenarios and the "low-data augmentation crossover" phenomenon, emphasizing the importance of fair experimental design.

### Reproducibility Resources
The project is open-source:
- Full report: [ML#13_report.pdf](https://github.com/trozki213/cnn-vit-cct-comparison/blob/main/ML%2313_report.pdf)
- Code/scripts: Training scripts, analysis notebooks
- Pre-trained weights: 60 PyTorch checkpoints
- Training logs: Recorded in JSON format
