Zing Forum

Reading

GenAI-GreenML: A Curated Dataset for Generative AI and Green Machine Learning

A curated dataset containing 50 small open-source machine learning repositories, specifically designed for research on generative AI-assisted code generation and energy-efficient machine learning development.

生成式AI绿色机器学习数据集代码生成能效优化可持续软件工程LLM碳足迹基准测试
Published 2026-06-09 18:13Recent activity 2026-06-09 18:27Estimated read 6 min
GenAI-GreenML: A Curated Dataset for Generative AI and Green Machine Learning
1

Section 01

Introduction / Main Floor: GenAI-GreenML: A Curated Dataset for Generative AI and Green Machine Learning

A curated dataset containing 50 small open-source machine learning repositories, specifically designed for research on generative AI-assisted code generation and energy-efficient machine learning development.

2

Section 02

Original Author and Source

3

Section 03

Research Background and Problem Definition

Generative AI is reshaping all aspects of software development, from code completion to automated testing, from document generation to architecture design. However, behind this convenience lies an increasingly serious issue: the environmental cost of AI-assisted programming.

Large Language Models (LLMs) consume a great deal of energy during training and inference, resulting in significant carbon emissions. At the same time, is code generated by LLMs more energy-efficient than manually written code? Do generated machine learning models consider energy efficiency optimization? These questions currently lack systematic research data support.

GenAI-GreenML dataset was created to fill this research gap. It provides a carefully selected benchmark dataset specifically for evaluating the environmental impact and energy efficiency performance of generative AI in code generation tasks.

4

Section 04

Dataset Overview

GenAI-GreenML is a curated collection of 50 small open-source machine learning repositories, all of which are under 500 MB in size, covering two major domains: Tabular data and Natural Language Processing (NLP).

5

Section 05

Design Principles

  1. Small-Scale Priority: Select repositories smaller than 500 MB to lower the computational resource threshold for experiments, enabling more researchers to reproduce and extend studies.

  2. Domain Representativeness: Cover the two core ML domains of tabular data processing and NLP to ensure the generalizability of research conclusions.

  3. Open-Source License: All included projects use open-source licenses, supporting academic and commercial research use.

  4. Practicality Orientation: Select projects with real-world application scenarios rather than purely academic research code.

6

Section 06

Value 1: Benchmark Testing for LLM-Assisted Code Generation

This dataset provides a standardized testing platform for evaluating the code generation capabilities of different LLMs (GPT-4, Claude, Llama, etc.):

  • Functional Correctness: Does the generated code correctly implement the intended function?
  • Code Quality: How is the readability, maintainability, and completeness of comments of the generated code?
  • Security Vulnerabilities: Does the generated code contain common security vulnerabilities?
7

Section 07

Value 2: Energy-Efficient Machine Learning Development

By comparing the energy efficiency performance of manually written code and LLM-generated code, researchers can:

  • Identify the advantages and limitations of LLMs in energy efficiency optimization
  • Develop prompt engineering strategies to guide LLMs to generate more energy-efficient code
  • Establish best practice guidelines for green AI coding
8

Section 08

Value 3: Sustainable Software Engineering Research

Provide empirical data for researchers in the field of software engineering to explore:

  • The long-term impact of AI-assisted development on software carbon footprint
  • Environmental cost-benefit analysis of code generation tools
  • Evolution trends of green programming paradigms