# Predicting Corporate ESG Scores Using Public Data: A Transparent and Interpretable Machine Learning Solution

> This article introduces a machine learning project that predicts corporate ESG (Environmental, Social, Governance) scores using public data. The project uses only free SEC 10-K report texts and financial data, extracts financial sentiment features via FinBERT, and combines them with the ElasticNet regression model to provide retail investors with a low-cost, interpretable alternative for ESG assessment.

- 板块: [Openclaw Geo](https://www.zingnex.cn/en/forum/board/openclaw-geo)
- 发布时间: 2026-05-23T18:45:57.000Z
- 最近活动: 2026-05-23T18:48:05.776Z
- 热度: 158.0
- 关键词: ESG, 机器学习, 自然语言处理, FinBERT, 企业财报, ElasticNet, 可持续投资
- 页面链接: https://www.zingnex.cn/en/forum/thread/esg-e34e83a8
- Canonical: https://www.zingnex.cn/forum/thread/esg-e34e83a8
- Markdown 来源: floors_fallback

---

## [Introduction] Predicting Corporate ESG Scores Using Public Data: A Transparent and Interpretable Machine Learning Solution

This article introduces a machine learning project that predicts corporate ESG scores using free public data, published on GitHub by authors including Caden Lippie (Project link: https://github.com/clippie/ESG_Prediction_Public). The project uses only SEC 10-K report texts and financial data, extracts financial sentiment features via FinBERT, and combines them with the ElasticNet regression model to provide retail investors with a low-cost, interpretable alternative for ESG assessment, addressing the cost barriers and methodological opacity of commercial ratings.

## Background: Two Core Dilemmas of ESG Scores

The commercial ratings from mainstream ESG rating agencies (e.g., MSCI, Sustainalytics) face two major issues: 1) Cost barriers (annual fees range from $5,000 to $30,000, which are inaccessible to retail investors); 2) Methodological opacity (proprietary algorithmic black boxes, with the correlation between ratings from different agencies being less than 50%, far lower than the 94% for credit ratings). This results in most participants in the global $30 trillion ESG asset market being unable to access key information.

## Methodology: Data Sources and Feature Extraction

The project's data comes from the SEC EDGAR database: 1. Text data: Extract sections of 10-K reports including risk factor disclosures, MD&A (Management's Discussion and Analysis), and financial statement footnotes; 2. Feature engineering: Use FinBERT (a BERT model fine-tuned for the financial domain) to extract financial sentiment scores and ESG keyword features; 3. Structured financial indicators: Obtain fundamental data such as profitability and leverage ratios via the XBRL API.

## Methodology: Selection of the ElasticNet Regression Model

The project uses the ElasticNet regression model for the following reasons: 1. Interpretability: Coefficients directly show feature contributions; 2. Regularization: Combines L1 and L2 to avoid overfitting and handle multicollinearity; 3. Computational efficiency: Fast training and low inference cost, suitable for retail investors.

## Evidence: Model Evaluation Results and Findings

The model's best result is an R² of 0.215 in the Social dimension, but it cannot predict Governance dimension scores at all. This finding indicates that Governance scores rely on external data not included in 10-K reports (e.g., board composition, executive compensation), pointing the way for future research.

## Comparison: Methodological Innovations vs. Existing Research

Existing studies often rely on financial ratios or historical ESG scores (a self-reinforcing mechanism) to achieve high R² values, but this project uses only public raw data and partially replicates commercial rating services through a transparent process. Its core innovation lies in breaking the reliance on proprietary data and achieving methodological transparency.

## Limitations and Future Improvement Directions

Current limitations: Predictive performance needs improvement, inability to predict the Governance dimension, applicability only to U.S. companies, and no industry adjustments. Future directions: Integrate data from sustainability reports/news, develop industry-specific models, supplement Governance data, and try interpretable complex models (e.g., Transformer regression).

## Significance for Retail Investors and Conclusion

This project provides retail investors with an independent ESG assessment framework: zero-cost data, transparent logic, enabling them to independently evaluate the ESG performance of their portfolios and understand the drivers of scores. The project promotes the democratization of financial information; although the current model has limitations, it lays the foundation for the development of open tools, allowing more people to make investment decisions aligned with their values.
