Reading

Comparative Study of CNN, ViT, and CCT in Data-Scarce Scenarios: An Empirical Analysis with Parameter Matching

This article introduces a machine learning course project from Bocconi University. Under the premise of controlling the number of parameters, the project systematically compares the performance of Convolutional Neural Networks (CNN), Vision Transformers (ViT), and Compact Convolutional Transformers (CCT) on the CIFAR-10 dataset, with a special focus on the impact of different data volumes (10%-100%) and augmentation strategies on model performance.

CNNViTCCT计算机视觉数据稀缺参数匹配CIFAR-10数据增强Transformer深度学习

Published 2026-05-29 00:43Recent activity 2026-05-29 00:48Estimated read 8 min

Comparative Study of CNN, ViT, and CCT in Data-Scarce Scenarios: An Empirical Analysis with Parameter Matching

Section 01

[Introduction] Core Overview of Parameter-Matched Comparative Study of CNN, ViT, and CCT in Data-Scarce Scenarios

This study was completed by a student team from Bocconi University. It aims to systematically compare the performance of three architectures (CNN, ViT, and CCT) on the CIFAR-10 dataset under different data volumes (10%-100%) and augmentation strategies, with strict control over the number of parameters (two scales: ~0.75M and ~5M). Key findings include: CNN has a significant advantage in low-data scenarios; ViT needs sufficient data to perform well; CCT shows stable performance; and the "low-data augmentation crossover" phenomenon—CNN benefits more from augmentation in low-data scenarios, while ViT benefits more in high-data scenarios. The study provides practical guidelines for architecture selection under different data scenarios.

Section 02

Research Background and Motivation

In the field of computer vision, CNN has long been dominant, but ViT has risen with global attention mechanisms—yet it performs poorly in data-scarce scenarios. CCT, as a hybrid architecture, attempts to combine the advantages of both. Existing studies often ignore parameter control, leading to unfair comparisons. This project fills this research gap by using parameter matching to fairly evaluate the performance of the three architectures in data-scarce scenarios.

Section 03

Experimental Design Details

Dataset and Division

We use CIFAR-10 (60,000 32×32 images), setting data ratios of 10%/25%/50%/75%/100% to simulate data-scarce scenarios.

Model Architectures

CNN: ResNet-style; small version (0.76M parameters), large version (4.90M)
ViT: Classic architecture; small version (0.76M), large version (4.98M)
CCT: Convolutional tokenizer; small version (0.73M), large version (5.26M)

Training Configuration

We uniformly use the AdamW optimizer, linear warm-up + cosine annealing LR, batch size 256, 150 epochs, etc. We compare strategies with and without augmentation (random cropping + horizontal flipping).

Section 04

Key Findings and Empirical Evidence

"Low-Data Augmentation Crossover" Phenomenon

Low data (10%-25%): CNN benefits more from augmentation (e.g., large model achieves 79.8% accuracy after augmentation with 10% data)
High data (75%-100%): ViT benefits more significantly from augmentation (large model achieves 85.8% accuracy after augmentation with 100% data)
CCT: Augmentation has limited impact; performance is stable

Large Model Configuration (~5M Parameters)

Data Ratio	CNN_large	ViT_large	CCT_large
10%	79.80%	54.40%	70.47%
25%	88.57%	68.28%	81.13%
50%	92.18%	77.53%	86.55%
75%	93.94%	82.43%	89.36%
100%	94.91%	85.80%	90.84%

Small Model Configuration (~0.75M Parameters)

Data Ratio	CNN_small	ViT_small	CCT_small
10%	77.40%	53.73%	68.11%
25%	83.41%	65.52%	79.19%
50%	87.07%	74.08%	84.90%
75%	88.28%	79.58%	87.62%
100%	89.27%	82.60%	89.65%

In-Depth Analysis

CKA: CNN and CCT have similar shallow representations, while ViT differs significantly
Linear probing: CNN features have better transferability in low-data scenarios
Average attention distance: CCT attention is more localized; ViT's global attention tends to over-expand in low-data scenarios

Section 05

Practical Implications and Recommendations

Architecture Selection Guide

Low data (<25%): Prioritize CNN/CCT; avoid pure ViT
Medium data (25%-75%): CCT is a balanced choice with low dependency on augmentation
High data (>75%): Choose based on inference speed/interpretability

Reflection on Data Augmentation

The effect of augmentation is related to architectural characteristics: CNN/CCT (strong inductive bias) have limited benefits, while ViT (no inductive bias) benefits more in high-data scenarios.

Value of Hybrid Architectures

CCT verifies the advantages of fusing CNN's local extraction and Transformer's global modeling, providing insights for future architecture design.

Section 06

Research Limitations and Future Directions

Limitations

Only uses CIFAR-10 (low resolution) and does not cover high-resolution datasets
Only focuses on image classification tasks and does not involve complex tasks like detection/segmentation
Single augmentation strategy (only random cropping + flipping)

Future Directions

Extend to large-scale datasets like ImageNet
Include more hybrid architectures (e.g., CoAtNet)
Explore self-supervised pre-training scenarios
Theoretically analyze the differences between architectures and their responses to augmentation

Section 07

Conclusion and Resources

Conclusion

Through parameter-matched experiments, this study reveals the performance differences of the three architectures in data-scarce scenarios and the "low-data augmentation crossover" phenomenon, emphasizing the importance of fair experimental design.

Reproducibility Resources

The project is open-source:

Full report: ML#13_report.pdf
Code/scripts: Training scripts, analysis notebooks
Pre-trained weights: 60 PyTorch checkpoints
Training logs: Recorded in JSON format