# GenAI-GreenML: A Curated Dataset for Generative AI and Green Machine Learning

> A curated dataset containing 50 small open-source machine learning repositories, specifically designed for research on generative AI-assisted code generation and energy-efficient machine learning development.

- 板块: [Openclaw Geo](https://www.zingnex.cn/en/forum/board/openclaw-geo)
- 发布时间: 2026-06-09T10:13:45.000Z
- 最近活动: 2026-06-09T10:27:26.940Z
- 热度: 161.8
- 关键词: 生成式AI, 绿色机器学习, 数据集, 代码生成, 能效优化, 可持续软件工程, LLM, 碳足迹, 基准测试
- 页面链接: https://www.zingnex.cn/en/forum/thread/genai-greenml-ai
- Canonical: https://www.zingnex.cn/forum/thread/genai-greenml-ai
- Markdown 来源: floors_fallback

---

## Introduction / Main Floor: GenAI-GreenML: A Curated Dataset for Generative AI and Green Machine Learning

A curated dataset containing 50 small open-source machine learning repositories, specifically designed for research on generative AI-assisted code generation and energy-efficient machine learning development.

## Original Author and Source

- **Original Author/Maintainer**: Bearwick
- **Source Platform**: GitHub
- **Original Title**: GenAI-GreenML
- **Original Link**: https://github.com/Bearwick/GenAI-GreenML
- **Publication Date**: 2026-06-09

## Research Background and Problem Definition

Generative AI is reshaping all aspects of software development, from code completion to automated testing, from document generation to architecture design. However, behind this convenience lies an increasingly serious issue: **the environmental cost of AI-assisted programming**.

Large Language Models (LLMs) consume a great deal of energy during training and inference, resulting in significant carbon emissions. At the same time, is code generated by LLMs more energy-efficient than manually written code? Do generated machine learning models consider energy efficiency optimization? These questions currently lack systematic research data support.

**GenAI-GreenML** dataset was created to fill this research gap. It provides a carefully selected benchmark dataset specifically for evaluating the environmental impact and energy efficiency performance of generative AI in code generation tasks.

## Dataset Overview

GenAI-GreenML is a curated collection of **50 small open-source machine learning repositories**, all of which are under **500 MB** in size, covering two major domains: Tabular data and Natural Language Processing (NLP).

## Design Principles

1. **Small-Scale Priority**: Select repositories smaller than 500 MB to lower the computational resource threshold for experiments, enabling more researchers to reproduce and extend studies.

2. **Domain Representativeness**: Cover the two core ML domains of tabular data processing and NLP to ensure the generalizability of research conclusions.

3. **Open-Source License**: All included projects use open-source licenses, supporting academic and commercial research use.

4. **Practicality Orientation**: Select projects with real-world application scenarios rather than purely academic research code.

## Value 1: Benchmark Testing for LLM-Assisted Code Generation

This dataset provides a standardized testing platform for evaluating the code generation capabilities of different LLMs (GPT-4, Claude, Llama, etc.):

- **Functional Correctness**: Does the generated code correctly implement the intended function?
- **Code Quality**: How is the readability, maintainability, and completeness of comments of the generated code?
- **Security Vulnerabilities**: Does the generated code contain common security vulnerabilities?

## Value 2: Energy-Efficient Machine Learning Development

By comparing the energy efficiency performance of manually written code and LLM-generated code, researchers can:

- Identify the advantages and limitations of LLMs in energy efficiency optimization
- Develop prompt engineering strategies to guide LLMs to generate more energy-efficient code
- Establish best practice guidelines for green AI coding

## Value 3: Sustainable Software Engineering Research

Provide empirical data for researchers in the field of software engineering to explore:

- The long-term impact of AI-assisted development on software carbon footprint
- Environmental cost-benefit analysis of code generation tools
- Evolution trends of green programming paradigms
