Zing Forum

Reading

HITLLLMs: A Study on Consistency Between Human Experts and LLMs in Chemical Synthesis Plan Evaluation

A research project exploring the consistency of opinions between human chemistry experts and large language models (LLMs) when evaluating the quality of chemical synthesis plans, providing an empirical basis for AI-assisted decision-making in the chemical field.

化学信息学LLM评估人机一致性合成计划AIZynthFinder逆合成药物发现统计验证
Published 2026-04-20 22:45Recent activity 2026-04-20 22:51Estimated read 5 min
HITLLLMs: A Study on Consistency Between Human Experts and LLMs in Chemical Synthesis Plan Evaluation
1

Section 01

Introduction: Core Overview of the HITLLLMs Study

This study focuses on the consistency of opinions between human chemistry experts and large language models (LLMs) when evaluating the quality of chemical synthesis plans, providing an empirical basis for AI-assisted decision-making in the chemical field. The HITLLLMs project provides supporting code and raw feedback materials to facilitate the research in the paper titled 'Do humans and large language models agree on the quality of synthesis plans?'.

2

Section 02

Research Background: Challenges in Chemical Synthesis and AI Assistance

In the field of chemical synthesis, designing high-quality synthesis routes is a core challenge in drug discovery and materials science. With the improvement of LLM capabilities, researchers are exploring the possibility of using them to assist in the evaluation of synthesis plans, but the key issue of consistency between human and machine evaluations has not been fully addressed. The HITLLLMs project focuses on this problem.

3

Section 03

Technical Methods: Implementation of LLM Evaluation and Statistical Analysis

LLM Query System

LLM evaluation results are obtained by calling OpenAI and VertexAI services via llm_querying/llms_querying.py. Raw responses are stored in responses_llms, and master_paths.json contains the synthesis plans presented to experts.

Feasibility Evaluation Framework

feasibility.py defines LLM prompts to ensure the evaluation method is comparable to that of human experts.

Statistical Analysis Workflow

human_vs_llm.ipynb implements data loading and preprocessing, consistency measurement, statistical significance testing, and chart generation, which can reproduce the paper's results.

4

Section 04

Empirical Evidence: Dataset Composition and Integration

The dataset consists of three parts: 1. Professional evaluations of retrosynthetic trees by human experts; 2. Evaluation results of the same plans by multiple LLMs; 3. Comparative analysis of human and machine feedback. All raw data is integrated into expert_feedback_combined_llms.csv for easy statistical analysis and visualization.

5

Section 05

Research Conclusions: Implications for Cheminformatics and AI Assistance

Contributions to Cheminformatics

Provides empirical data to help understand the performance boundaries of LLMs in chemical tasks, patterns of human-machine differences, and types of synthesis plans where agreement or disagreement occurs.

Implications for AI-Assisted Design

Guides model selection, prompt engineering optimization, human-machine collaboration process design, and consistency-based quality screening mechanisms.

6

Section 06

Application Recommendations: Open-Source Reproducibility and Methodology Promotion

The project is open-sourced under the MIT license, supporting: verification of the paper's statistical results, extension to more LLM models, application to other chemical datasets, and improvement of evaluation metrics. Its method of comparing human and machine evaluations can be extended to fields such as medical diagnosis and legal analysis. Environment configuration is done via conda environment files, and API credentials need to be configured.

7

Section 07

Conclusion: The Value of Human-Machine Collaboration Research

The HITLLLMs project is an important case of human-machine collaboration research in cheminformatics, providing insights into the capabilities and limitations of AI through rigorous analysis. With the development of LLM technology, such basic research is of great significance for ensuring that AI tools effectively assist chemistry researchers.