Zing Forum

Reading

Systematic Evaluation of Nine Prompt Strategies in Commonsense Reasoning Tasks

This article deeply analyzes an open-source project that comprehensively evaluates nine different prompt strategies (including zero-shot chain-of-thought, few-shot chain-of-thought, Plan-and-Solve, Tree-of-Thought, etc.) on the CommonsenseQA dataset, using the DeepSeek-R1-Distill-Qwen-7B model for testing.

提示工程思维链常识推理LLM评估DeepSeekCommonsenseQATree-of-ThoughtSelf-Consistency
Published 2026-04-30 17:11Recent activity 2026-04-30 17:17Estimated read 6 min
Systematic Evaluation of Nine Prompt Strategies in Commonsense Reasoning Tasks
1

Section 01

[Introduction] Core Overview of Systematic Evaluation of Nine Prompt Strategies in Commonsense Reasoning Tasks

This article introduces an open-source project that conducts a comprehensive comparative test of nine mainstream prompt strategies (including zero-shot chain-of-thought, few-shot chain-of-thought, Plan-and-Solve, Tree-of-Thought, etc.) on the CommonsenseQA dataset, using the DeepSeek-R1-Distill-Qwen-7B model (7B parameters, inference-optimized). The core goal is to explore which prompt strategy achieves the best performance in commonsense reasoning tasks under resource-constrained environments, providing empirical data for prompt engineering practices.

2

Section 02

Project Background and Objectives

Prompt engineering has a significant impact on LLM performance, especially in tasks like commonsense reasoning that require going beyond pattern matching. CommonsenseQA is a classic commonsense reasoning benchmark built on ConceptNet, containing 12,154 training samples and 1,221 development samples, with 5 options per question. The core question of the project: Which prompt strategy works best under resource constraints (7B model)? The DeepSeek-R1-Distill-Qwen-7B model was chosen for testing because it balances reasoning ability, speed, and hardware accessibility.

3

Section 03

Detailed Explanation of Nine Prompt Strategies

  1. Baseline method: Directly ask the question without reasoning guidance, used as a control;
  2. Zero-shot chain-of-thought: Add "Let's think step by step" to guide the generation of intermediate steps, no examples needed;
  3. Few-shot chain-of-thought: Provide examples with complete reasoning for in-context learning;
  4. Restate and Expand (RE): Restate the question first before answering to reduce understanding bias;
  5. RE+: Add additional context on top of RE;
  6. Plan-and-Solve (PS): Split into planning and execution phases, suitable for complex problems;
  7. PS+: Add explicit constraints on top of PS;
  8. Tree-of-Thought (ToT): Explore multiple reasoning paths;
  9. Self-consistency: Aggregate results via voting after multiple independent reasoning attempts to improve reliability.
4

Section 04

Evaluation Methods and Experimental Design

Experimental process:

  1. Load the CommonsenseQA dataset using the Hugging Face datasets library;
  2. Design prompt templates for each strategy;
  3. The model receives formatted questions and options, generating answers with reasoning;
  4. Parse the output to extract answers, compare with standard answers to calculate accuracy;
  5. Save results as JSON and summarize into CSV reports for easy analysis and visualization.
5

Section 05

Key Findings and Practical Insights

Main conclusions:

  1. Structured prompts (such as chain-of-thought, PS, ToT) are significantly better than the baseline;
  2. In few-shot CoT, the quality of examples is more important than quantity;
  3. Self-consistency has high cost-effectiveness, and voting aggregation improves reliability;
  4. The PS strategy performs outstandingly in multi-step reasoning.
6

Section 06

Application Scenarios and Limitations

Application scenarios: Edge-side LLM deployment (maximizing performance with limited computing power), prompt engineering research (providing comparative data), educational AI (designing effective tutoring systems). Limitations: Only tested English commonsense, results may not apply to other languages/domains; based on a 7B model, conclusions may not be applicable to larger-scale models.

7

Section 07

Conclusion and Future Directions

The value of the project lies in providing a reproducible evaluation framework to facilitate the objective assessment of new prompt strategies. Key insight: There is no one-size-fits-all strategy; selection must be based on specific tasks, model scale, and performance requirements. Prompt engineering requires making informed choices based on method characteristics rather than looking for a silver bullet.