Zing Forum

Reading

Plant Disease Image Classification Based on Spark: Application of Distributed Machine Learning in Agricultural Detection

This project demonstrates how to use the Apache Spark distributed computing framework to process a large-scale plant image dataset of 19.47GB, implement binary classification (healthy/diseased) and multi-class classification (plant species) image recognition tasks, and provide a scalable technical solution for agricultural disease detection.

Apache Spark图像分类植物病害检测分布式机器学习深度学习农业AI类别不平衡数据预处理精准农业计算机视觉
Published 2026-06-02 10:15Recent activity 2026-06-02 10:22Estimated read 5 min
Plant Disease Image Classification Based on Spark: Application of Distributed Machine Learning in Agricultural Detection
1

Section 01

[Introduction] Core Overview of the Spark-Based Plant Disease Image Classification Project

The project titled 'Plant Disease Image Classification Based on Spark: Application of Distributed Machine Learning in Agricultural Detection' was published by dessiejohnson on GitHub (Project link: https://github.com/dessiejohnson/Spark-232-Diseased-Plants, published on June 2, 2026). Its core goal is to use the Apache Spark distributed computing framework to process a large-scale plant image dataset of 19.47GB (containing 52134 images, 62 categories, and 17 plant species), implement binary classification (healthy/diseased) and multi-class classification (plant species) tasks, and provide a scalable technical solution for agricultural disease detection.

2

Section 02

Project Background and Motivation

Global climate change has intensified, leading to faster spread of crop diseases. Traditional manual inspection is inefficient and difficult to cover large-scale farmland. The success of machine learning in image recognition provides a direction for agricultural intelligence, but this project faces the challenge of processing a large-scale dataset of 19.47GB. Single-machine processing is inefficient and hard to scale, so the Apache Spark distributed framework was chosen.

3

Section 03

Dataset Characteristics and Challenges

The dataset contains 52134 images covering 62 categories (healthy and various diseases) of 17 plant species. Image sizes vary greatly (up to 4740×6000 pixels) and need standardization; there is a serious class imbalance problem: the Tomato Yellow Leaf Curl Virus category is dominant, while categories like pepper are undersampled, which may affect model fairness.

4

Section 04

Technical Architecture and Preprocessing Strategy

Technical Architecture: Spark was chosen because it supports distributed storage and parallel computing, fault tolerance mechanisms, reduces execution time, and integrates MLlib to build end-to-end ML workflows. Data loading uses the binaryFile format of the Spark DataFrame API to recursively read files, and extracts metadata such as category labels and plant species from file paths. Preprocessing: 1. Label construction: Binary classification (healthy/diseased) is merged via regular expression matching of keywords; multi-class classification (plant species) uses the plant column; 2. Stratified random sampling to solve class imbalance; 3. Image normalization to 224×224 pixels.

5

Section 05

Model Design and Training

The project designs two models: 1. Binary classification model: Distinguishes between healthy and diseased plants to meet basic agricultural needs; 2. Multi-class classification model: Identifies plant species to support precision agricultural management. Both models share the preprocessing process, and use training-validation-test splits to ensure evaluation reliability.

6

Section 06

Project Significance and Application Prospects

This project verifies the effectiveness of deep learning + distributed computing in agricultural image analysis, and demonstrates the processing practice of large-scale image data and methods to solve class imbalance. Application prospects include deployment to edge devices or the cloud to realize automated farmland disease monitoring, providing technical support to address food security challenges under climate change.