Zing Forum

Reading

Multimodal Influencer Profiling System: An Attention Neural Network Classification Method Fusing BERT Text and InceptionV3 Visual Features

A multimodal influencer classification system combining BERT text embeddings and InceptionV3 image embeddings, achieving an 85% classification accuracy via an attention mechanism neural network, providing an automated influencer screening solution for brands' precision marketing.

多模态学习网红画像BERTInceptionV3注意力机制社交媒体分析网红营销深度学习图像分类文本嵌入
Published 2026-05-19 22:05Recent activity 2026-05-19 22:20Estimated read 6 min
Multimodal Influencer Profiling System: An Attention Neural Network Classification Method Fusing BERT Text and InceptionV3 Visual Features
1

Section 01

[Introduction] Core Introduction to the Multimodal Influencer Profiling System

This study proposes a multimodal influencer profiling classification system that fuses BERT text embeddings and InceptionV3 visual embeddings, achieving an 85% classification accuracy through an attention mechanism neural network. It aims to solve the problems of low efficiency and difficulty in scaling manual influencer screening for brands, providing an automated influencer screening solution for precision marketing.

2

Section 02

Research Background: Screening Challenges in Influencer Marketing

In the era of social media, influencer marketing is a core channel for brand promotion, but millions of creators make it difficult for brands to quickly match suitable influencers. Traditional manual screening relies on subjective judgment, which is inefficient and cannot be scaled. This project builds an automated multimodal framework to analyze the text and images of influencers' content, helping brands accurately identify influencers, reduce costs, and improve the precision of placements.

3

Section 03

Methodology: Dataset and Multimodal Feature Extraction

Dataset Construction

Using an Instagram influencer dataset (33,000 influencers, 1.6 million posts), we stratified sampled 1500 influencers, extracting 20 posts per person to ensure class balance.

Multimodal Feature Extraction

  • Text Features: Use BERT-base-multilingual-cased to encode copy, with preprocessing including URL removal, emoji-to-text conversion, etc., outputting a 768-dimensional vector.
  • Visual Features: Use pre-trained InceptionV3 to extract image features, with preprocessing including size adjustment, normalization, etc., outputting a 1024-dimensional vector.
  • Fusion Layer: Concatenate text and image vectors to form a 1792-dimensional multimodal feature.

Model Comparison Design

Compare traditional machine learning (Random Forest, SVM, etc.) with deep learning (attention neural network), testing three input conditions: text-only, image-only, and multimodal.

4

Section 04

Experimental Results and Performance Analysis

Experimental results show:

Model Text-only Image-only Multimodal
Random Forest 45% 73.33% 75%
KNN 39% 58% 74%
SVM 51% 78% 83%
Gaussian Naive Bayes 27.67% 65% 76.33%
Attention Neural Network 56% 79% 85%

Key Findings: Visual information has better discriminative power than text; multimodal fusion improves performance; the attention neural network performs best (85% accuracy); among traditional models, Naive Bayes performs worst in the text modality.

5

Section 05

Working Principle of the Attention Mechanism

Working Principle of the Attention Mechanism:

  1. Each post generates a feature pair via BERT and InceptionV3;
  2. The model learns the importance weights of posts;
  3. Weighted aggregation of 20 feature groups to get the final representation of the influencer;
  4. Fully connected layer + Softmax outputs class probabilities.

This mechanism focuses on representative posts and suppresses noise interference.

6

Section 06

Application Scenarios and Commercial Value

Application Scenarios and Commercial Value:

  • Brand-Influencer Matching: Input target audience and theme to automatically recommend matching influencers;
  • Automated Annotation: Tag influencers for marketing platforms, reducing labor costs;
  • Precision Placement: Select vertical niche influencers to improve conversion rates;
  • Competitor Monitoring: Track the types of influencers that competitors collaborate with, providing strategic intelligence.
7

Section 07

Technical Limitations and Future Directions

Technical Limitations

  • Only uses text and static images, not integrating video, audio, etc.;
  • Does not utilize interactive data such as likes and comments;
  • Interpretability is not transparent enough for non-technical users.

Future Directions

  • Introduce advanced multimodal models such as CLIP/ViLT;
  • Build a real-time influencer recommendation system;
  • Develop an interpretable AI module;
  • Expand to multilingual and multi-platform (TikTok, YouTube).