Zing Forum

Reading

Can Large Language Models Predict Electricity Demand? A Comprehensive Comparison of 14 Models on Belgian Grid Data

A systematic study compares the performance of statistical models, machine learning, deep learning, and large language models (LLMs) on electricity load forecasting tasks, covering 14 configurations from ARIMA to GPT-4o, revealing the true capability boundaries of LLMs in time-series prediction.

大语言模型时间序列预测电力负荷预测Time-LLMGPT-4oXGBoostLSTM能源机器学习深度学习
Published 2026-06-08 07:15Recent activity 2026-06-08 07:18Estimated read 6 min
Can Large Language Models Predict Electricity Demand? A Comprehensive Comparison of 14 Models on Belgian Grid Data
1

Section 01

[Introduction] Can Large Language Models Predict Electricity Demand? A Comparison of 14 Models Reveals True Capability Boundaries

This study systematically compares the performance of statistical models, machine learning, deep learning, and large language models (14 configurations in total) on the electricity load forecasting task of the Belgian grid, aiming to reveal the capability boundaries of LLMs in time-series prediction. Using nearly 10 years of Belgian grid data, key findings include: Time-LLM (an architecture adapting GPT-2 via a reprogramming layer) outperforms traditional XGBoost and LSTM; directly prompting GPT-4o for prediction yields poor results; the ensemble model (XGB+LSTM+Time-LLM) achieves the best performance.

2

Section 02

Research Background and Motivation

Electricity load forecasting is a core issue in the energy industry; accurate short-term forecasting is crucial for grid dispatch, trading, and renewable energy integration. Traditional methods include statistical models (ARIMA, Prophet) and machine learning models (XGBoost, LSTM). However, with the rise of LLMs, we need to answer: Can these text models be directly applied to numerical time-series prediction? This study comes from a master's project at the University of Hull, using over 395,000 15-minute interval load data from the Belgian grid between 2015 and 2025 to compare 14 model configurations.

3

Section 03

Dataset Preprocessing and Model Lineup

Dataset and Preprocessing

The data comes from the Belgian Elia public portal, aggregated into hourly data resulting in approximately 99,000 records. Preprocessing steps include: linear interpolation to fill 0.19% missing values; constructing calendar, lag (t-1/t-24/t-168), and rolling statistical features for XGBoost; standardization using StandardScaler for LSTM and Time-LLM (fitted only on the training set).

Model Lineup

  • Statistical Baselines: Naive Persistence, ETS, ARIMA, Prophet
  • Machine Learning: XGBoost
  • Deep Learning: Two-layer LSTM (128 units)
  • LLM Methods: Time-LLM (frozen GPT-2 + reprogramming layer), GPT-4o zero-shot/few-shot
4

Section 04

Evaluation Methods and Key Findings

Evaluation Protocol

Split by time into 70% training /15% validation /15% test; metrics include MAE, RMSE, sMAPE, MASE.

24-hour Forecasting Results (MAE/MW)

Model MAE MASE
Ensemble Model 263 0.49
Time-LLM 271 0.50
XGBoost 277 0.51
GPT-4o Zero-shot 481 0.89

48-hour Forecasting Results (MAE/MW)

Model MAE MASE
Ensemble Model 299 0.55
Time-LLM 317 0.59
XGBoost 315 0.59
GPT-4o Zero-shot 535 0.99

Key Insights

  • Time-LLM performs best (among single models), direct GPT-4o yields poor results;
  • XGBoost is strong, highlighting the significant value of feature engineering;
  • The ensemble model is optimal, reflecting the value of diversity.
5

Section 05

Practical Significance and Application Implications

  1. Hybrid Strategy is Optimal: The ensemble of XGBoost, LSTM, and Time-LLM achieves the best results;
  2. LLMs Require Adaptation: Direct use of GPT-4o is impractical, Time-LLM-like architectures are feasible;
  3. Feature Engineering Remains Important: XGBoost's performance demonstrates the value of domain knowledge;
  4. Statistical Models as Baselines: Prophet and others are still useful in scenarios with limited data or where interpretability is needed.
6

Section 06

Research Limitations and Future Directions

Limitations

  • Only uses Belgian grid data; generalizability needs verification;

Future Directions

  • Explore the impact of different LLM backbone networks on time-series adaptation;
  • Optimize the design of few-shot prompts for GPT-4o;
  • Validate conclusions on more datasets.