Zing Forum

Reading

Multi-Model LLM Reasoning Comparison Platform: An Experimental Framework for Systematic Research on AI Reasoning Behavior

A full-stack multi-model LLM interaction platform that supports simultaneous comparison of reasoning behaviors across multiple large models, offering configurable RAG retrieval, three interaction modes (Direct Answer/Prompt First/Guided Reasoning), and an automated critical scoring system.

多模型对比LLM推理RAG检索增强生成交互模式批判评分FastAPIReact开源平台模型评估
Published 2026-05-17 00:43Recent activity 2026-05-17 00:51Estimated read 6 min
Multi-Model LLM Reasoning Comparison Platform: An Experimental Framework for Systematic Research on AI Reasoning Behavior
1

Section 01

Introduction / Main Floor: Multi-Model LLM Reasoning Comparison Platform: An Experimental Framework for Systematic Research on AI Reasoning Behavior

A full-stack multi-model LLM interaction platform that supports simultaneous comparison of reasoning behaviors across multiple large models, offering configurable RAG retrieval, three interaction modes (Direct Answer/Prompt First/Guided Reasoning), and an automated critical scoring system.

2

Section 02

Project Overview and Research Objectives

In today's era of flourishing large language models, how to quantify the performance differences of different models on the same task? How does the configuration of Retrieval-Augmented Generation (RAG) affect answer quality? Will different interaction strategies change the model's reasoning approach?

The adaptive-llm-reasoning-platform project is designed to answer these questions. It is a full-stack multi-model LLM interaction platform that allows users to upload documents, ask questions, and compare responses from multiple AI models in real time. Beyond a simple chatbot interface, it provides configurable retrieval strategies, multiple interaction modes, and an automatic critical engine to evaluate the correctness, evidence-based nature, and completeness of each answer.

3

Section 03

Multi-Model Parallel Comparison

The platform supports simultaneous queries to multiple LLMs and displays response results side by side in real time. Currently supported models include:

  • LLaMA 3.3 70B
  • LLaMA 3.1 8B
  • Qwen 3 32B (via Groq free API)
  • GPT-4o / GPT-4o Mini (via OpenAI API)

Adding a new model only requires modifying one configuration item, reflecting the platform's scalable design.

4

Section 04

Configurable RAG Retrieval Pipeline

Document processing uses a semantic chunking strategy, generating embedding vectors locally using the sentence-transformers all-MiniLM-L6-v2 model, stored in a lightweight JSONL vector database. When querying, the platform supports:

  • Multiple similarity metrics: cosine similarity, L2 distance, dot product
  • Adjustable Top-K retrieval count
  • Retrieval result reviewability: Each context chunk received by the model comes with a relevance score, fully transparent.
5

Section 05

Three Interaction Mode Designs

The platform implements three different prompt strategies to change how models organize responses:

Direct Mode: Standard question-answer generation where the model gives the answer directly.

Prompt First Mode: The model provides a prompt before giving the complete answer, encouraging users to think on their own first. This strategy may produce more evidence-based answers.

Guided Reasoning Mode: Breaks down the problem step by step, including sub-questions, evidence synthesis, and confidence rating. This structured approach helps improve answer completeness.

By comparing the same question, same context, and different interaction modes, the impact of interaction strategies on answer quality can be quantified.

6

Section 06

Automated Critical Scoring System

Each response can be evaluated through a multi-dimensional critical pipeline, with scoring dimensions including:

  • Correctness: Whether the answer is factually accurate in the given context
  • Evidence-based Nature: Whether the answer is strictly based on retrieved information or has hallucinations
  • Completeness: Whether the answer covers all aspects of the question

The critical system can also identify specific issues (hallucinations, misunderstandings, omissions) and propose improvement suggestions. This system uses the LLM-as-judge model, generating scores through structured JSON output.

7

Section 07

Backend Architecture

  • Framework: FastAPI (Python)
  • Asynchronous HTTP: httpx
  • Data Validation: Pydantic
  • Embedding Model: sentence-transformers (all-MiniLM-L6-v2, ~90MB, runs on CPU)
  • Document Processing: PyMuPDF
  • Vector Calculation: NumPy
8

Section 08

Frontend Architecture

  • Framework: React + TypeScript
  • Build Tool: Vite