Zing Forum

Reading

Multimodal Product Classification System: Machine Learning Practice Integrating Image and Text Embeddings

A multimodal machine learning system built on pre-trained deep learning models, using ResNet50 and ConvNextV2 to extract image features, combined with MiniLM text embeddings, to achieve accurate multi-category classification of products.

多模态学习商品分类ResNet50ConvNextV2MiniLM迁移学习嵌入提取机器学习工程
Published 2026-04-16 01:44Recent activity 2026-04-16 01:48Estimated read 8 min
Multimodal Product Classification System: Machine Learning Practice Integrating Image and Text Embeddings
1

Section 01

[Introduction] Practice of Multimodal Product Classification System: Enhancing Classification Accuracy by Integrating Images and Text

This project focuses on the product classification needs in the e-commerce retail field. Aiming at the limitations of traditional single-modal classification (only images or text), we build a multimodal machine learning system integrating image and text embeddings. The core uses ResNet50 and ConvNextV2 to extract image features, combined with MiniLM text embeddings. The goal is to achieve ≥85% accuracy and ≥80% F1 score for the multimodal model, providing more accurate classification support for scenarios such as inventory management and recommendation systems.

2

Section 02

Project Background and Objectives

Project Background

In e-commerce retail, product classification is the foundation of inventory management, recommendation systems, and SEO. Traditional classification relies on single-modal information, while human decision-making usually combines appearance and text descriptions, so single-modal methods have accuracy bottlenecks.

Project Tasks and Objectives

Task: Classify BestBuy platform products into predefined categories, with input as 224×224 product images + text descriptions, output as category labels. Performance goals:

  • Multimodal model: ≥85% accuracy, ≥80% F1 score
  • Pure text model: ≥85% accuracy
  • Pure image model: ≥75% accuracy
3

Section 03

Technical Architecture: Multimodal Embedding and Classifier Design

Image Embedding Extraction

Two pre-trained visual models are used:

  1. ResNet50: A classic CNN, pre-trained on ImageNet, with strong general visual feature extraction capabilities;
  2. ConvNextV2: A new model in the Hugging Face ecosystem, designed with Transformer architecture, showing excellent performance in visual tasks.

Text Embedding Extraction

Use MiniLM (a lightweight variant of BERT, balancing performance and efficiency through knowledge distillation) from the Hugging Face Transformers library, with reserved expansion interfaces for BERT/OpenAI embeddings.

Classifier Design

  • Traditional ML: Random Forest, Logistic Regression, SVM;
  • Deep Learning: Multilayer Perceptron (MLP), using an early fusion strategy to concatenate image and text embeddings as input.
4

Section 04

Development Environment and Toolchain

Development Environment

Based on Python3.9+, core toolchain:

  • Deep learning: TensorFlow (image tasks), Hugging Face Transformers (text/visual Transformers);
  • Traditional ML: Scikit-learn (algorithms/preprocessing);
  • Data operations: Pandas, NumPy;
  • Visualization: Matplotlib, Seaborn;
  • Development process: Jupyter Notebook (experiments), Pytest (code quality), Black (code style), Docker (containerized deployment).

Dependency Configuration

Three versions of dependency files are provided:

  • requirements.txt: CPU environment;
  • requirements_mac.txt: Apple Silicon GPU optimization;
  • requirements_gpu.txt: NVIDIA GPU CUDA acceleration.
5

Section 05

Data Preparation and Project Structure

Data Preparation

Core dataset: processed_products_with_images.csv + 224×224 product images; Processing flow: Place the CSV in the data/ directory, download the image compression package from Google Drive and extract it to data/images/ to ensure reproducibility.

Project Structure

  • src/: Core modules (vision_embeddings_tf.py, nlp_models.py, classifiers_classic_ml.py, classifiers_mlp.py, utils.py);
  • tests/: Unit tests;
  • results/: Model evaluation outputs;
  • Embeddings/: Store embedding vectors (added to .gitignore to avoid repository bloat).
6

Section 06

Model Evaluation and Practical Value

Model Evaluation Metrics

Classification accuracy, F1 score, and confusion matrix are used as core metrics. By comparing the performance of multimodal and single-modal models, the fusion gain is quantified.

Practical Value and Learning Points

The project covers key topics in modern ML engineering:

  • Transfer learning (application of pre-trained models);
  • Multimodal learning (heterogeneous data fusion);
  • Embedding technology (converting unstructured data to numerical representations);
  • Feature engineering (embedding preprocessing and fusion);
  • Model evaluation (comprehensive indicator analysis); It is a complete and well-documented practical case, suitable for developers to deeply understand related concepts.
7

Section 07

Summary: Practical Significance of Multimodal Learning

Multimodal learning is an important direction in AI development. Through the product classification scenario, this project demonstrates the effective integration of visual and language information. From pre-trained model selection, classifier design, to data pipeline construction and evaluation system establishment, it forms a complete closed loop from research to engineering. For developers learning ML or building similar systems, it is a reference implementation worth in-depth study.