Reading

GenAI-GreenML: A Curated Dataset for Generative AI and Green Machine Learning

A curated dataset containing 50 small open-source machine learning repositories, specifically designed for research on generative AI-assisted code generation and energy-efficient machine learning development.

生成式AI绿色机器学习数据集代码生成能效优化可持续软件工程LLM碳足迹基准测试

Published 2026-06-09 18:13Recent activity 2026-06-09 18:27Estimated read 6 min

Section 01

Introduction / Main Floor: GenAI-GreenML: A Curated Dataset for Generative AI and Green Machine Learning

Section 02

Original Author and Source

Original Author/Maintainer: Bearwick
Source Platform: GitHub
Original Title: GenAI-GreenML
Original Link: https://github.com/Bearwick/GenAI-GreenML
Publication Date: 2026-06-09

Section 03

Research Background and Problem Definition

Generative AI is reshaping all aspects of software development, from code completion to automated testing, from document generation to architecture design. However, behind this convenience lies an increasingly serious issue: the environmental cost of AI-assisted programming.

Large Language Models (LLMs) consume a great deal of energy during training and inference, resulting in significant carbon emissions. At the same time, is code generated by LLMs more energy-efficient than manually written code? Do generated machine learning models consider energy efficiency optimization? These questions currently lack systematic research data support.

GenAI-GreenML dataset was created to fill this research gap. It provides a carefully selected benchmark dataset specifically for evaluating the environmental impact and energy efficiency performance of generative AI in code generation tasks.

Section 04

Dataset Overview

GenAI-GreenML is a curated collection of 50 small open-source machine learning repositories, all of which are under 500 MB in size, covering two major domains: Tabular data and Natural Language Processing (NLP).

Section 05

Design Principles

Small-Scale Priority: Select repositories smaller than 500 MB to lower the computational resource threshold for experiments, enabling more researchers to reproduce and extend studies.
Domain Representativeness: Cover the two core ML domains of tabular data processing and NLP to ensure the generalizability of research conclusions.
Open-Source License: All included projects use open-source licenses, supporting academic and commercial research use.
Practicality Orientation: Select projects with real-world application scenarios rather than purely academic research code.

Section 06

Value 1: Benchmark Testing for LLM-Assisted Code Generation

This dataset provides a standardized testing platform for evaluating the code generation capabilities of different LLMs (GPT-4, Claude, Llama, etc.):

Functional Correctness: Does the generated code correctly implement the intended function?
Code Quality: How is the readability, maintainability, and completeness of comments of the generated code?
Security Vulnerabilities: Does the generated code contain common security vulnerabilities?

Section 07

Value 2: Energy-Efficient Machine Learning Development

By comparing the energy efficiency performance of manually written code and LLM-generated code, researchers can:

Identify the advantages and limitations of LLMs in energy efficiency optimization
Develop prompt engineering strategies to guide LLMs to generate more energy-efficient code
Establish best practice guidelines for green AI coding

Section 08

Value 3: Sustainable Software Engineering Research

Provide empirical data for researchers in the field of software engineering to explore:

The long-term impact of AI-assisted development on software carbon footprint
Environmental cost-benefit analysis of code generation tools
Evolution trends of green programming paradigms

GenAI-GreenML: A Curated Dataset for Generative AI and Green Machine Learning

Introduction / Main Floor: GenAI-GreenML: A Curated Dataset for Generative AI and Green Machine Learning

Original Author and Source

Research Background and Problem Definition

Dataset Overview

Design Principles

Value 1: Benchmark Testing for LLM-Assisted Code Generation

Value 2: Energy-Efficient Machine Learning Development

Value 3: Sustainable Software Engineering Research

Continue Reading

SignalCut: An Intelligent Tool for Turning AI Search Visibility Gaps into Video Marketing Campaigns

Graph Neural Networks Revolutionize Global Weather Forecasting: From Graph Weather to Open-Source Practice of Multi-Model Fusion

ExoVision: AI-Driven Exoplanet Detection and Habitability Assessment Platform

Vertica Expert Skills: A One-Stop Guide to Enterprise Database Migration and Optimization