Zing Forum

Reading

SIS-LLM: A Unified Framework for Evaluating the Sustainability of Large Language Model Inference

SIS-LLM is a unified framework for evaluating the sustainability of large language model (LLM) inference. It integrates performance, efficiency, and environmental metrics to generate a single interpretable Sustainability Index Score (SIS).

LLMsustainabilityenergy efficiencycarbon emissionsinference optimizationgreen AISISQwenMistralLLaMA
Published 2026-06-16 06:46Recent activity 2026-06-16 06:49Estimated read 7 min
SIS-LLM: A Unified Framework for Evaluating the Sustainability of Large Language Model Inference
1

Section 01

SIS-LLM: A Unified Framework for LLM Inference Sustainability Evaluation

SIS-LLM is a unified framework for evaluating the sustainability of large language model (LLM) inference, developed by Urooj Asgher (Technological University Dublin) and released on GitHub (project name: SIS-LLM-InferenceTool) on June 15, 2026. It integrates performance, efficiency, and environmental metrics into a single interpretable Sustainability Index Score (SIS), helping developers and enterprises make informed decisions in model selection.

2

Section 02

Background & Motivation

With LLMs widely used across industries, energy consumption and environmental impact during inference are increasingly concerning. Current evaluations focus on accuracy and speed but ignore sustainability metrics like energy efficiency and carbon emissions. This single-dimensional approach fails to reflect real deployment costs or guide green AI development. SIS-LLM addresses this gap by unifying multiple metrics into an SIS score.

3

Section 03

Core Concept: SIS Score & Key Metrics

SIS Score Definition

SIS (Sustainability Index Score) is a 0-1 score where lower values indicate better sustainability.

SIS Rating Levels

SIS Range Sustainability Level
0.0-0.3 Low Impact
0.3-0.7 Medium Impact
0.7-1.0 High Impact

Key Metrics

  • Energy & Environment: Energy consumption (J/query), carbon emissions (g CO₂eq/query), token energy efficiency (tokens/J)
  • Performance: Execution time (s/query), throughput (tokens/s), accuracy (benchmark performance)
  • Resource Efficiency: Model efficiency (accuracy/energy), hardware efficiency (accuracy/CPU hours), memory usage (GB), FLOPs (operations/inference), model size (MB)
4

Section 04

Evaluation Setup

Evaluated Models

Model Name Parameters Quantization
Qwen2.5-7B-Instruct 7B GGUF Q4_K_M
Mistral-7B-Instruct-v0.3 7B GGUF Q4_K_M
Meta-Llama-3.1-8B-Instruct 8B GGUF Q4_K_M
Phi-3.5-mini-Instruct 3.8B GGUF Q4_K_M

Datasets

  • GSM8K (500 samples, math reasoning)
  • MMLU (500 samples, multi-disciplinary knowledge)
  • TruthfulQA (500 samples, factual accuracy) All tests use seed=42 for reproducibility.

Hardware & Software

  • Hardware: 2× Intel Xeon Gold 6430 (64 cores/128 threads), CPU-only (GPU disabled), Adcewatt power meter for real energy measurement.
  • Software: llama.cpp framework, core scripts (main runner, dataset builder, power monitoring, metric collection).
5

Section 05

Practical Application Value

  • Developers: Objective model selection tool (consider sustainability alongside performance), especially useful for edge/resource-limited environments.
  • Enterprises: Reduce operational costs (lower energy use), fulfill ESG responsibilities (quantify carbon footprint), optimize resource allocation.
  • Research: Standardized evaluation framework, open-source toolchain, and benchmark dataset for reproducible sustainability research.
6

Section 06

Usage & Deployment Guide

  1. Clone Repository: git clone https://github.com/urooj88/SIS-LLM-InferenceTool.git && cd SIS-LLM-InferenceTool
  2. Install Dependencies: pip install -r requirements.txt
  3. Build Dataset: python3 build_eval_dataset.py --reason 500 --mcq 500 --truth 500 --force-rebuild
  4. Run Evaluation: python3 main_sustainability_runner_LLM_CPU.py

Required Models

Download GGUF models from HuggingFace: Qwen2.5-7B-Instruct-GGUF, Mistral-7B-Instruct-v0.3-GGUF, Meta-Llama-3.1-8B-Instruct-GGUF, Phi-3.5-mini-instruct-GGUF.

7

Section 07

Limitations & Future Work

Limitations

  • Hardware dependency: Requires Adcewatt power meter for real energy measurement.
  • CPU-only: GPU inference evaluation is under development.
  • Limited model coverage: Only 4 7B-level models evaluated.

Future Directions

  • Extend to GPU inference evaluation.
  • Support more model architectures and quantization schemes.
  • Develop cloud deployment energy estimation models.
  • Establish industry-standard SIS benchmark database.
8

Section 08

Conclusion & Insights

SIS-LLM pioneers a unified approach to LLM inference sustainability evaluation. By integrating performance, efficiency, and environmental metrics into an interpretable score, it helps balance model performance with sustainability. This framework emphasizes that sustainability should be a core consideration in model design and selection, paving the way for greener AI systems.