Zing Forum

Reading

Machine Learning Prediction of Flaky Tests in CI/CD: An Intelligent Solution to Improve Software Testing Stability

This article introduces an open-source project that uses machine learning to predict and detect flaky tests in CI/CD pipelines, helping development teams identify unstable test cases, reduce false failure reports, and improve the reliability of continuous integration.

脆弱测试CI/CD机器学习软件测试持续集成测试稳定性自动化测试MLOps
Published 2026-06-15 00:15Recent activity 2026-06-15 00:24Estimated read 12 min
Machine Learning Prediction of Flaky Tests in CI/CD: An Intelligent Solution to Improve Software Testing Stability
1

Section 01

[Main Post/Introduction] Machine Learning Prediction of Flaky Tests in CI/CD: An Intelligent Solution to Improve Software Testing Stability

The open-source project introduced in this article is maintained by Gogeta767, hosted on GitHub with the project name flaky-test-prediction-ml, and the link is https://github.com/Gogeta767/flaky-test-prediction-ml. It was released on June 14, 2026.

In continuous integration/continuous deployment (CI/CD) practices, testing is a key link to ensure code quality. However, flaky tests often trouble development teams—these test cases produce inconsistent results under the same code and environment, causing great distress to the process. This project demonstrates how to use machine learning technology to intelligently predict and detect flaky tests, providing new ideas to solve this industry problem, helping teams identify unstable test cases, reduce false failure reports, and improve the reliability of continuous integration.

2

Section 02

Background: Definition, Harms, and Causes of Flaky Tests

What are flaky tests?

Flaky Tests refer to test cases that produce inconsistent results under the same code and environment—sometimes passing, sometimes failing—without a clear reason for failure. They are particularly common in large-scale projects.

Harms of flaky tests

  1. Trust crisis: Frequent false reports make developers lose confidence in the test suite, leading to ignoring real bugs;
  2. Low efficiency: Developers spend a lot of time investigating failures not caused by code changes;
  3. Deployment blocking: In strict CI/CD processes, test failures prevent code merging or deployment;
  4. Resource waste: Re-running tests consumes computing resources and increases CI/CD costs.

Common causes

  • Asynchronous waiting issues: Race conditions caused by not properly waiting for asynchronous operations to complete;
  • External dependencies: Dependencies on unstable resources like networks or databases;
  • Time-sensitive logic: Dependencies on current time or timeout settings;
  • Concurrency issues: Resource competition in multi-threaded/process environments;
  • Environment differences: Behavioral differences between local and CI environments;
  • Random data: Using randomly generated data to trigger boundary cases;
  • Order dependencies: Interdependencies between tests leading to failures when run in batches.
3

Section 03

Method: Technical Path of Machine Learning for Identifying Flaky Tests

Core idea

Flaky tests have identifiable characteristic patterns. By analyzing historical test run data, code features, dependency relationships, etc., machine learning models can predict which tests may be flaky.

Technical route

  1. Data collection: Collect historical test run data from CI/CD systems;
  2. Feature engineering: Extract historical run features (failure rate, volatility, retry success rate, etc.), code features (complexity, asynchronous operations, external dependencies, etc.), dependency features (shared resources, execution order, etc.), and environment features (differences in runtime environments, resource usage, etc.);
  3. Model training: Train prediction models using classification algorithms (logistic regression, random forest, XGBoost, etc.);
  4. Real-time prediction: Predict test flakiness when new code is submitted;
  5. Feedback optimization: Continuously optimize the model based on actual results.

Key points for model training

  • Class imbalance: Flaky tests account for a low proportion (5-20%), which needs to be handled through resampling, class weights, and threshold adjustment;
  • Time series: Split training/test sets by time to avoid data leakage, and retrain regularly to address concept drift.
4

Section 04

Application Scenarios: Implementation Practices in CI/CD

Scenario 1: Test priority sorting

Prioritize running stable tests, postpone or run potentially flaky tests separately to quickly get reliable feedback and isolate unstable factors.

Scenario 2: Flaky test early warning

Issue warnings to remind developers to prevent problems when new code affects tests predicted to be flaky.

Scenario 3: Test suite optimization

After identifying flaky tests, teams can fix, isolate, automatically retry, or monitor them.

Scenario 4: Code review assistance

When changes affect flaky tests, prompt reviewers to pay extra attention and find potential issues before merging.

5

Section 05

Challenges and Solutions

Challenge 1: Data quality

Problem: Incomplete CI/CD historical data, inconsistencies caused by test name changes; Solution: Establish unique test identifiers (e.g., code fingerprints), standardize data cleaning, and handle renaming and refactoring.

Challenge 2: Feature calculation cost

Problem: Static code analysis is time-consuming; Solution: Incremental calculation, caching mechanisms, and asynchronous processing.

Challenge 3: Model interpretability

Problem: Developers need to understand why a test is marked as flaky; Solution: Use interpretable models (decision trees, linear models), provide SHAP values/feature importance, and visual indicators.

Challenge 4: False positive control

Problem: Stable tests are misjudged as flaky; Solution: Adjust classification thresholds, multi-model voting, and mark high-confidence predictions.

6

Section 06

Industry Practices: Case References from Tech Companies

Google

Google uses machine learning to identify flaky tests. Its research shows that about 1.5% of tests are flaky but generate a lot of noise. The model helps prioritize handling the most problematic tests.

Meta

Meta developed an intelligent test selection system that combines historical data and code changes to predict which tests need to be run, reducing CI time and improving reliability.

Microsoft

Microsoft Research has published multiple papers on flaky test detection, exploring methods from heuristics to deep learning. Hybrid models (code features + historical data) yield the best results.

7

Section 07

Project Value and Future Directions

Technical value of the project

  1. Reproducible benchmark: Provides a unified evaluation benchmark for flaky test prediction research;
  2. Best practice reference: Demonstrates methods to handle practical issues like class imbalance and time series;
  3. Integration examples: Shows integration with CI/CD systems like Jenkins and GitHub Actions;
  4. Extension foundation: Developers can add new features or try new models.

Future development directions

  1. Deep learning application: Use AST (Abstract Syntax Tree) and graph neural networks to learn code structure features;
  2. Causal inference: Identify the root causes of flakiness to help targeted fixes;
  3. Active learning: Select the most valuable tests to run for labeling to efficiently improve the model;
  4. Cross-project migration: Develop pre-trained models to reduce data requirements for new projects;
  5. Real-time adaptation: Online learning mechanisms to adapt to codebase changes in real time.
8

Section 08

Conclusion: Data-Driven Improvement of Software Testing Quality

Flaky tests are a long-standing problem in software engineering, directly affecting development efficiency and team confidence. This machine learning prediction project demonstrates the potential of data-driven methods in solving software quality problems.

By intelligently identifying flaky tests, teams can focus resources on tests that need attention, gradually fix and optimize the test suite, and improve CI/CD reliability and developer experience.

For teams hoping to improve the quality of their test infrastructure, this project provides a good starting point—whether used directly, learned from, or innovated upon, it can bring practical value to software quality assurance.