Section 01
IFHierBench: Guide to the Hierarchical Instruction-Following Capability Evaluation Benchmark
IFHierBench is an open-source benchmark for evaluating the hierarchical instruction-following capabilities of large language models (LLMs). Its core innovations include introducing tree-structured output constraints (depth 0-3 layers) and deterministic Python validators to avoid subjective bias; it provides 600 test samples (evenly distributed across 4 depth levels) and an automated evaluation pipeline to help locate and improve the boundary of model capabilities.