Zing Forum

Reading

Predicting Corporate ESG Scores Using Public Data: A Transparent and Interpretable Machine Learning Solution

This article introduces a machine learning project that predicts corporate ESG (Environmental, Social, Governance) scores using public data. The project uses only free SEC 10-K report texts and financial data, extracts financial sentiment features via FinBERT, and combines them with the ElasticNet regression model to provide retail investors with a low-cost, interpretable alternative for ESG assessment.

ESG机器学习自然语言处理FinBERT企业财报ElasticNet可持续投资
Published 2026-05-24 02:45Recent activity 2026-05-24 02:48Estimated read 6 min
Predicting Corporate ESG Scores Using Public Data: A Transparent and Interpretable Machine Learning Solution
1

Section 01

[Introduction] Predicting Corporate ESG Scores Using Public Data: A Transparent and Interpretable Machine Learning Solution

This article introduces a machine learning project that predicts corporate ESG scores using free public data, published on GitHub by authors including Caden Lippie (Project link: https://github.com/clippie/ESG_Prediction_Public). The project uses only SEC 10-K report texts and financial data, extracts financial sentiment features via FinBERT, and combines them with the ElasticNet regression model to provide retail investors with a low-cost, interpretable alternative for ESG assessment, addressing the cost barriers and methodological opacity of commercial ratings.

2

Section 02

Background: Two Core Dilemmas of ESG Scores

The commercial ratings from mainstream ESG rating agencies (e.g., MSCI, Sustainalytics) face two major issues: 1) Cost barriers (annual fees range from $5,000 to $30,000, which are inaccessible to retail investors); 2) Methodological opacity (proprietary algorithmic black boxes, with the correlation between ratings from different agencies being less than 50%, far lower than the 94% for credit ratings). This results in most participants in the global $30 trillion ESG asset market being unable to access key information.

3

Section 03

Methodology: Data Sources and Feature Extraction

The project's data comes from the SEC EDGAR database: 1. Text data: Extract sections of 10-K reports including risk factor disclosures, MD&A (Management's Discussion and Analysis), and financial statement footnotes; 2. Feature engineering: Use FinBERT (a BERT model fine-tuned for the financial domain) to extract financial sentiment scores and ESG keyword features; 3. Structured financial indicators: Obtain fundamental data such as profitability and leverage ratios via the XBRL API.

4

Section 04

Methodology: Selection of the ElasticNet Regression Model

The project uses the ElasticNet regression model for the following reasons: 1. Interpretability: Coefficients directly show feature contributions; 2. Regularization: Combines L1 and L2 to avoid overfitting and handle multicollinearity; 3. Computational efficiency: Fast training and low inference cost, suitable for retail investors.

5

Section 05

Evidence: Model Evaluation Results and Findings

The model's best result is an R² of 0.215 in the Social dimension, but it cannot predict Governance dimension scores at all. This finding indicates that Governance scores rely on external data not included in 10-K reports (e.g., board composition, executive compensation), pointing the way for future research.

6

Section 06

Comparison: Methodological Innovations vs. Existing Research

Existing studies often rely on financial ratios or historical ESG scores (a self-reinforcing mechanism) to achieve high R² values, but this project uses only public raw data and partially replicates commercial rating services through a transparent process. Its core innovation lies in breaking the reliance on proprietary data and achieving methodological transparency.

7

Section 07

Limitations and Future Improvement Directions

Current limitations: Predictive performance needs improvement, inability to predict the Governance dimension, applicability only to U.S. companies, and no industry adjustments. Future directions: Integrate data from sustainability reports/news, develop industry-specific models, supplement Governance data, and try interpretable complex models (e.g., Transformer regression).

8

Section 08

Significance for Retail Investors and Conclusion

This project provides retail investors with an independent ESG assessment framework: zero-cost data, transparent logic, enabling them to independently evaluate the ESG performance of their portfolios and understand the drivers of scores. The project promotes the democratization of financial information; although the current model has limitations, it lays the foundation for the development of open tools, allowing more people to make investment decisions aligned with their values.