Zing Forum

Reading

Automatic Classification of Skin Lesions: A Multi-Model Deep Learning Ensemble Approach

This article introduces a deep learning-based dermoscopy image classification system that uses a weighted ensemble of three backbone networks—ResNet50, DenseNet121, and EfficientNet-B3—combined with Test-Time Augmentation (TTA) and clinical threshold calibration techniques. It achieves excellent classification performance on the ISIC 2018 Challenge dataset, with particular optimization for the sensitivity of malignant lesions.

深度学习医学影像皮肤病变分类卷积神经网络模型集成测试时增强ISIC皮肤镜PyTorch计算机辅助诊断
Published 2026-06-01 16:14Recent activity 2026-06-01 16:22Estimated read 9 min
Automatic Classification of Skin Lesions: A Multi-Model Deep Learning Ensemble Approach
1

Section 01

Introduction / Main Floor: Automatic Classification of Skin Lesions: A Multi-Model Deep Learning Ensemble Approach

This article introduces a deep learning-based dermoscopy image classification system that uses a weighted ensemble of three backbone networks—ResNet50, DenseNet121, and EfficientNet-B3—combined with Test-Time Augmentation (TTA) and clinical threshold calibration techniques. It achieves excellent classification performance on the ISIC 2018 Challenge dataset, with particular optimization for the sensitivity of malignant lesions.

2

Section 02

Original Author and Source

  • Original Author/Maintainer: daorre1202 (Daniel Ortiz Requena)
  • Source Platform: GitHub
  • Original Project Title: skin-lesion-classifier
  • Original Link: https://github.com/daorre1202/skin-lesion-classifier
  • Open Source License: The project uses an open-source license (see the repository's LICENSE file for details)
  • Release Date: June 1, 2026
3

Section 03

Project Background and Clinical Significance

Dermoscopy is a non-invasive skin imaging technique that magnifies skin lesion areas, helping doctors observe the fine structures of the epidermis and superficial dermis. However, interpreting dermoscopy images requires extensive clinical experience, and there are differences in diagnostic consistency among different doctors.

The goal of Task 3 in the 2018 ISIC (International Skin Imaging Collaboration) Challenge is to perform seven-class diagnosis of dermoscopy images, including: Melanoma (MEL), Melanocytic Nevus (NV), Basal Cell Carcinoma (BCC), Actinic Keratosis (AKIEC), Benign Keratosis (BKL), Dermatofibroma (DF), and Vascular Lesion (VASC). Among these, early identification of malignant lesions is particularly important.

4

Section 04

Multi-Backbone Network Ensemble Strategy

The project uses three ImageNet-pre-trained convolutional neural networks as backbones:

  1. ResNet50: A residual network that solves the gradient vanishing problem in deep networks via skip connections
  2. DenseNet121: A densely connected network with high feature reuse efficiency and relatively few parameters
  3. EfficientNet-B3: A compound scaling network that achieves a good balance between accuracy and efficiency

After independent training of the three networks, they are weighted and fused using the balanced accuracy (BACC) from the validation set to form the final ensemble prediction. This strategy fully leverages the complementarity of different architectures and effectively reduces the bias of a single model.

5

Section 05

Test-Time Augmentation (TTA) Technique

Traditional deep learning models perform only one forward pass on the input image during inference. This project introduces the TTA technique, which applies 10 different geometric transformations (such as rotation, flipping, scaling, etc.) to each image during the inference phase, then averages the prediction results from all transformed images. This technique can significantly improve the model's robustness and reduce the risk of overfitting.

6

Section 06

Clinical Threshold Calibration Mechanism

This is the most clinically valuable innovation of the project. Standard deep learning models usually use 0.5 as the classification threshold, but this is often not reasonable in medical diagnosis scenarios. For example, for malignant lesions like melanoma (MEL), the cost of missed diagnosis is far higher than that of misdiagnosis.

The project authors designed a class-specific threshold calibration strategy:

  • Melanoma (MEL): Require sensitivity ≥0.85 and specificity ≥0.85
  • Actinic Keratosis (AKIEC): Require sensitivity ≥0.75 and specificity ≥0.70

This two-level fallback mechanism (strict threshold → relaxed threshold → standard argmax) can ensure the detection rate of malignant lesions while maintaining overall classification performance. Experiments show that this strategy increases the balanced accuracy of malignant categories by 0.02 to 0.05, with negligible impact on the global BACC.

7

Section 07

Dataset and Experimental Setup

The project uses the HAM10000 dataset, which contains 10015 dermoscopy images covering 7 diagnostic categories. The data is split using a hierarchical segmentation strategy: 60% training set, 20% validation set, and 20% test set, ensuring consistent proportions across all categories.

Notably, the dataset has a serious class imbalance problem: there are over 4000 samples of Melanocytic Nevus (NV), while only 96 samples of Dermatofibroma (DF). The project addresses this challenge with a hierarchical augmentation strategy, applying stronger data augmentation to minority classes.

In addition, the project uses Focal Loss as the loss function, assigning higher weights to hard-to-classify samples to further improve the model's ability to recognize rare categories.

8

Section 08

Experimental Results and Performance Analysis

Experiments on three independent random seeds (42,7,123) show that the system has excellent stability and generalization ability:

Metric Value
TTA Ensemble BACC (Mean ± Std) 0.846 ± 0.009
Best Single TTA BACC 0.8607
BACC for Malignant Categories (MEL+BCC+AKIEC) Up to 0.839
Melanoma Sensitivity (Clinical Threshold) Up to 0.877

Comparison with other methods in the ISIC 2018 Challenge shows that although the champion solution (MetaOptima) used external datasets and a larger model ensemble, this project's performance is close to the top level while using only the HAM10000 dataset. More importantly, the standard deviation of ±0.009 proves the system's robustness.