# Can Large Language Models Predict Electricity Demand? A Comprehensive Comparison of 14 Models on Belgian Grid Data

> A systematic study compares the performance of statistical models, machine learning, deep learning, and large language models (LLMs) on electricity load forecasting tasks, covering 14 configurations from ARIMA to GPT-4o, revealing the true capability boundaries of LLMs in time-series prediction.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-06-07T23:15:20.000Z
- 最近活动: 2026-06-07T23:18:18.538Z
- 热度: 145.9
- 关键词: 大语言模型, 时间序列预测, 电力负荷预测, Time-LLM, GPT-4o, XGBoost, LSTM, 能源, 机器学习, 深度学习
- 页面链接: https://www.zingnex.cn/en/forum/thread/14
- Canonical: https://www.zingnex.cn/forum/thread/14
- Markdown 来源: floors_fallback

---

## [Introduction] Can Large Language Models Predict Electricity Demand? A Comparison of 14 Models Reveals True Capability Boundaries

This study systematically compares the performance of statistical models, machine learning, deep learning, and large language models (14 configurations in total) on the electricity load forecasting task of the Belgian grid, aiming to reveal the capability boundaries of LLMs in time-series prediction. Using nearly 10 years of Belgian grid data, key findings include: Time-LLM (an architecture adapting GPT-2 via a reprogramming layer) outperforms traditional XGBoost and LSTM; directly prompting GPT-4o for prediction yields poor results; the ensemble model (XGB+LSTM+Time-LLM) achieves the best performance.

## Research Background and Motivation

Electricity load forecasting is a core issue in the energy industry; accurate short-term forecasting is crucial for grid dispatch, trading, and renewable energy integration. Traditional methods include statistical models (ARIMA, Prophet) and machine learning models (XGBoost, LSTM). However, with the rise of LLMs, we need to answer: Can these text models be directly applied to numerical time-series prediction? This study comes from a master's project at the University of Hull, using over 395,000 15-minute interval load data from the Belgian grid between 2015 and 2025 to compare 14 model configurations.

## Dataset Preprocessing and Model Lineup

### Dataset and Preprocessing
The data comes from the Belgian Elia public portal, aggregated into hourly data resulting in approximately 99,000 records. Preprocessing steps include: linear interpolation to fill 0.19% missing values; constructing calendar, lag (t-1/t-24/t-168), and rolling statistical features for XGBoost; standardization using StandardScaler for LSTM and Time-LLM (fitted only on the training set).
### Model Lineup
- **Statistical Baselines**: Naive Persistence, ETS, ARIMA, Prophet
- **Machine Learning**: XGBoost
- **Deep Learning**: Two-layer LSTM (128 units)
- **LLM Methods**: Time-LLM (frozen GPT-2 + reprogramming layer), GPT-4o zero-shot/few-shot

## Evaluation Methods and Key Findings

### Evaluation Protocol
Split by time into 70% training /15% validation /15% test; metrics include MAE, RMSE, sMAPE, MASE.
### 24-hour Forecasting Results (MAE/MW)
|Model|MAE|MASE|
|---|---|---|
|Ensemble Model|263|0.49|
|Time-LLM|271|0.50|
|XGBoost|277|0.51|
|GPT-4o Zero-shot|481|0.89|
### 48-hour Forecasting Results (MAE/MW)
|Model|MAE|MASE|
|---|---|---|
|Ensemble Model|299|0.55|
|Time-LLM|317|0.59|
|XGBoost|315|0.59|
|GPT-4o Zero-shot|535|0.99|
### Key Insights
- Time-LLM performs best (among single models), direct GPT-4o yields poor results;
- XGBoost is strong, highlighting the significant value of feature engineering;
- The ensemble model is optimal, reflecting the value of diversity.

## Practical Significance and Application Implications

1. **Hybrid Strategy is Optimal**: The ensemble of XGBoost, LSTM, and Time-LLM achieves the best results;
2. **LLMs Require Adaptation**: Direct use of GPT-4o is impractical, Time-LLM-like architectures are feasible;
3. **Feature Engineering Remains Important**: XGBoost's performance demonstrates the value of domain knowledge;
4. **Statistical Models as Baselines**: Prophet and others are still useful in scenarios with limited data or where interpretability is needed.

## Research Limitations and Future Directions

### Limitations
- Only uses Belgian grid data; generalizability needs verification;
### Future Directions
- Explore the impact of different LLM backbone networks on time-series adaptation;
- Optimize the design of few-shot prompts for GPT-4o;
- Validate conclusions on more datasets.
