Reading

Systematic Evaluation of Nine Prompt Strategies in Commonsense Reasoning Tasks

This article deeply analyzes an open-source project that comprehensively evaluates nine different prompt strategies (including zero-shot chain-of-thought, few-shot chain-of-thought, Plan-and-Solve, Tree-of-Thought, etc.) on the CommonsenseQA dataset, using the DeepSeek-R1-Distill-Qwen-7B model for testing.

提示工程思维链常识推理LLM评估DeepSeekCommonsenseQATree-of-ThoughtSelf-Consistency

Published 2026-04-30 17:11Recent activity 2026-04-30 17:17Estimated read 6 min

Section 01

[Introduction] Core Overview of Systematic Evaluation of Nine Prompt Strategies in Commonsense Reasoning Tasks

This article introduces an open-source project that conducts a comprehensive comparative test of nine mainstream prompt strategies (including zero-shot chain-of-thought, few-shot chain-of-thought, Plan-and-Solve, Tree-of-Thought, etc.) on the CommonsenseQA dataset, using the DeepSeek-R1-Distill-Qwen-7B model (7B parameters, inference-optimized). The core goal is to explore which prompt strategy achieves the best performance in commonsense reasoning tasks under resource-constrained environments, providing empirical data for prompt engineering practices.

Section 02

Project Background and Objectives

Prompt engineering has a significant impact on LLM performance, especially in tasks like commonsense reasoning that require going beyond pattern matching. CommonsenseQA is a classic commonsense reasoning benchmark built on ConceptNet, containing 12,154 training samples and 1,221 development samples, with 5 options per question. The core question of the project: Which prompt strategy works best under resource constraints (7B model)? The DeepSeek-R1-Distill-Qwen-7B model was chosen for testing because it balances reasoning ability, speed, and hardware accessibility.

Section 03

Detailed Explanation of Nine Prompt Strategies

Baseline method: Directly ask the question without reasoning guidance, used as a control;
Zero-shot chain-of-thought: Add "Let's think step by step" to guide the generation of intermediate steps, no examples needed;
Few-shot chain-of-thought: Provide examples with complete reasoning for in-context learning;
Restate and Expand (RE): Restate the question first before answering to reduce understanding bias;
RE+: Add additional context on top of RE;
Plan-and-Solve (PS): Split into planning and execution phases, suitable for complex problems;
PS+: Add explicit constraints on top of PS;
Tree-of-Thought (ToT): Explore multiple reasoning paths;
Self-consistency: Aggregate results via voting after multiple independent reasoning attempts to improve reliability.

Section 04

Evaluation Methods and Experimental Design

Experimental process:

Load the CommonsenseQA dataset using the Hugging Face datasets library;
Design prompt templates for each strategy;
The model receives formatted questions and options, generating answers with reasoning;
Parse the output to extract answers, compare with standard answers to calculate accuracy;
Save results as JSON and summarize into CSV reports for easy analysis and visualization.

Section 05

Key Findings and Practical Insights

Main conclusions:

Structured prompts (such as chain-of-thought, PS, ToT) are significantly better than the baseline;
In few-shot CoT, the quality of examples is more important than quantity;
Self-consistency has high cost-effectiveness, and voting aggregation improves reliability;
The PS strategy performs outstandingly in multi-step reasoning.

Section 06

Application Scenarios and Limitations

Application scenarios: Edge-side LLM deployment (maximizing performance with limited computing power), prompt engineering research (providing comparative data), educational AI (designing effective tutoring systems). Limitations: Only tested English commonsense, results may not apply to other languages/domains; based on a 7B model, conclusions may not be applicable to larger-scale models.

Section 07

Conclusion and Future Directions

The value of the project lies in providing a reproducible evaluation framework to facilitate the objective assessment of new prompt strategies. Key insight: There is no one-size-fits-all strategy; selection must be based on specific tasks, model scale, and performance requirements. Prompt engineering requires making informed choices based on method characteristics rather than looking for a silver bullet.

Systematic Evaluation of Nine Prompt Strategies in Commonsense Reasoning Tasks

[Introduction] Core Overview of Systematic Evaluation of Nine Prompt Strategies in Commonsense Reasoning Tasks

Project Background and Objectives

Detailed Explanation of Nine Prompt Strategies

Evaluation Methods and Experimental Design

Key Findings and Practical Insights

Application Scenarios and Limitations

Conclusion and Future Directions

Continue Reading

Splinter: A Lock-Free Zero-Copy Shared Memory KV and Vector Storage Library That Eliminates Socket and Memcpy Overhead for LLM Inference

Folkering OS: When the Operating System Itself Is AI—A Self-Evolving Bare-Metal Rust System

LLM-assisted-analysis: A New Approach to Detecting Logical Vulnerabilities in Smart Contracts Using Large Language Models

Building Modern LLM from Scratch: A Tutorial-level Implementation of Llama-style Language Model