Reading

Fusion of Convolution and Attention Mechanism: Analysis of the Convolutional Nearest Neighbors (ConvNN) Unified Framework

This article introduces a new neural network architecture called Convolutional Nearest Neighbors (ConvNN), which unifies convolution and self-attention mechanisms through a k-nearest neighbor aggregation framework, providing a new theoretical perspective for computer vision model design.

卷积神经网络注意力机制Transformer计算机视觉k近邻深度学习模型架构CIFAR

Published 2026-05-27 11:15Recent activity 2026-05-27 11:18Estimated read 6 min

Fusion of Convolution and Attention Mechanism: Analysis of the Convolutional Nearest Neighbors (ConvNN) Unified Framework

Section 01

[Main Floor/Introduction] Convolutional Nearest Neighbors (ConvNN): A New Framework Unifying Convolution and Attention Mechanisms

This article introduces a new neural network architecture called Convolutional Nearest Neighbors (ConvNN), whose core innovation lies in unifying convolution and self-attention mechanisms through a k-nearest neighbor aggregation framework, providing a new theoretical perspective for computer vision model design. ConvNN treats both as special cases of neighbor selection and aggregation (convolution based on spatial proximity, attention based on feature similarity) and reveals a continuous spectrum between them. Experiments show that ConvNN outperforms pure convolution or pure attention schemes on the CIFAR dataset and can be integrated into existing architectures as a plug-and-play module.

Section 02

Background: The Divide Between Convolution and Attention and the Opportunity for Unification

In the field of computer vision, CNN and Transformer represent two feature extraction paradigms: convolution captures local features through fixed spatial neighborhoods, while self-attention dynamically models global dependencies through feature similarity. For a long time, the two have been regarded as independent approaches, but the Bowdoin College team found that they are essentially special cases of neighbor selection and aggregation, providing an opportunity for a unified framework.

Section 03

Core Ideas and Technical Implementation of ConvNN

Core Ideas

ConvNN unifies convolution and self-attention as two extremes of k-nearest neighbor aggregation:

Convolution: selects neighbors based on spatial proximity
Self-attention: selects neighbors based on feature similarity There is a continuous spectrum between the two, allowing smooth interpolation, and ConvNN can be used as a plug-and-play module.

Technical Implementation

Hybrid Branch Architecture: In a VGG-style architecture, spatial convolution and feature similarity aggregation branches are used simultaneously to fuse local and global information, achieving better accuracy on the CIFAR dataset.
ViT Replacement Experiment: After replacing the self-attention layers of ViT, the performance surpasses the original attention and its variants, balancing local details and global context.

Section 04

Ablation Experiments: Key Findings and Regularization Effects

The research team obtained the following findings through ablation experiments:

Impact of k Value: A small k leans toward local features (similar to CNN), a large k leans toward global features (similar to Transformer), and a medium k achieves the best performance.
Regularization Effect: The interpolation strategy can avoid over-focusing on distant noise, preserve local details, improve generalization ability, and reduce overfitting.

Section 05

Research Significance: Theoretical and Practical Value

Theoretical Contribution

It eliminates the superficial differences between convolution and attention, proving that they are different instances of the same mathematical operation, and provides a unified perspective for architecture design.

Practical Value

ConvNN can be seamlessly integrated into existing CNN/Transformer architectures, offering a precision-efficiency trade-off for resource-constrained scenarios and helping to explore optimal strategies for local-global feature interaction.

Open Source Ecosystem

The project has been open-sourced, providing a ConvNN-Attention implementation repository and an undergraduate thesis that explains the mathematical foundations and experimental details.

Section 06

Summary and Insights: From Binary Opposition to Continuous Optimization

ConvNN represents a shift in architecture design thinking: from the binary opposition of 'convolution vs attention' to the perspective of 'choosing the optimal strategy in a continuous spectrum'. For practitioners, it provides tools to improve visual models; for researchers, it opens up new paths for exploring hybrid architectures. In the future, mechanisms that flexibly balance local and global information will become more important.