# LLM-SOTIF: Evaluation of Large Vision-Language Models for Object Detection Under SOTIF Conditions

> A comparative evaluation study on 2D object detection of large vision-language models under SOTIF (Safety Of The Intended Functionality) conditions, providing an important performance benchmark for visual perception systems in safety-critical applications such as autonomous driving.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-05-03T13:14:47.000Z
- 最近活动: 2026-05-03T13:33:45.499Z
- 热度: 157.7
- 关键词: 大视觉语言模型, LVLM, SOTIF, 目标检测, 自动驾驶, 安全评估, GitHub
- 页面链接: https://www.zingnex.cn/en/forum/thread/llm-sotif-sotif
- Canonical: https://www.zingnex.cn/forum/thread/llm-sotif-sotif
- Markdown 来源: floors_fallback

---

## [Introduction] LLM-SOTIF: Evaluation of Large Vision-Language Models for Object Detection Under SOTIF Conditions

LLM-SOTIF is an open-source project developed by the ftg team at Graz University of Technology, aiming to systematically evaluate the 2D object detection performance of Large Vision-Language Models (LVLMs) under SOTIF (Safety Of The Intended Functionality) conditions. This study constructs a SOTIF benchmark dataset, compares mainstream LVLMs (including open-source and closed-source ones), provides a performance analysis framework and open-source implementation, and offers an important performance benchmark for the selection and improvement of visual perception systems in safety-critical applications such as autonomous driving.

## Research Background and SOTIF Scenario Definition

### Research Background
Large Vision-Language Models (LVLMs) perform well in general visual tasks, but their performance in safety-critical applications (e.g., autonomous driving) needs in-depth evaluation. SOTIF (ISO 21448 standard) focuses on safety risks caused by performance limitations when the system is operating normally, such as detection failures in severe weather, misidentification of rare objects, etc. The LLM-SOTIF project aims to fill the gap in the evaluation of LVLMs under SOTIF scenarios.

### SOTIF Scenario Categories
- **Environmental condition challenges**: Severe weather (rain/snow/fog), lighting changes (glare/nighttime), sensor limitations
- **Object feature challenges**: Rare objects, occlusion, small objects, appearance changes
- **Scene complexity challenges**: Dense scenes, dynamic scenes, unstructured environments

## Evaluation Methodology

### Model Selection
- **Open-source models**: LLaVA series, InstructBLIP, Qwen-VL, Yi-VL, InternVL
- **Closed-source models**: GPT-4V, Gemini Pro Vision, Claude3

### Evaluation Metrics
Besides standard detection metrics (mAP/Recall/Precision), focus on:
- SOTIF-specific metrics: Conditional performance, failure mode analysis, confidence calibration
- Safety-related metrics: Key object detection rate, trade-off between false acceptance vs false rejection, worst-case performance

### Prompt Strategy
Explore the impact of prompts like zero-shot, few-shot, chain-of-thought, structured output on performance.

## Key Findings: Model Performance and Failure Modes

### Overall Performance
- Closed-source models (GPT-4V, Gemini Pro Vision) lead, while open-source models (Qwen-VL, InternVL) are narrowing the gap
- Model parameter size is positively correlated with SOTIF performance, but marginal returns diminish
- Fine-tuned models outperform general models

### Scene Sensitivity
- **Most challenging**: Nighttime low light, severe occlusion, snow/fog weather
- **Relatively robust**: Standard vehicles in normal lighting, clear pedestrians, structured roads

### Failure Modes
- Systematic biases: Missing detection of trucks/motorcycles, difficulty in recognizing specific colors/textures
- Confidence issues: High confidence in wrong detections, poor calibration for difficult samples
- Localization accuracy: Poor bounding box regression, large errors for small objects

## Key Technical Implementation Points

### Dataset Construction
- Sources: Public datasets like nuScenes/KITTI/Waymo, synthetic data, real-world severe weather images from the internet
- Annotations: COCO-format bounding boxes, scene labels (weather/lighting), difficulty scores

### Evaluation Framework
- Unified interface to call different LVLMs
- Parse text outputs into structured detection boxes
- SOTIF metric calculation and visualization tools

### Prompt Templates
Adopt prompts optimized for autonomous driving scenarios, requiring models to return object category, bounding box, and confidence in JSON format.

## Application Value and Impact

### Autonomous Driving R&D
- Technology selection: Provide data support for perception solutions
- Safety evaluation: Identify system performance boundaries
- Testing and verification: Standardized SOTIF test dataset and methods

### Model Development
- Improvement directions: Reveal weaknesses of LVLMs
- Benchmark competition: Open and transparent performance comparison
- Dataset reference: Methodology for constructing SOTIF datasets

### Standard Formulation
- Transform abstract safety concepts into quantitative metrics
- Provide test basis for autonomous driving safety certification

## Limitations and Future Work

### Limitations
- Limited sample size for some SOTIF scenarios
- Rapid emergence of new models requires continuous updates to evaluation
- Differences between synthetic/web images and real scenes
- No consideration of temporal consistency in video sequences

### Future Work
- Expand edge case datasets
- Develop SOTIF model fine-tuning methods
- Explore multi-modal fusion (camera + LiDAR)
- Video-level SOTIF evaluation

## Conclusion: Research Significance and Practical Guidance

LLM-SOTIF is the first to systematically evaluate the object detection performance of LVLMs under SOTIF conditions, revealing the strengths and limitations of current technologies. This study provides valuable references for the selection and improvement of visual perception systems in safety-critical applications (e.g., autonomous driving), and has important practical guidance significance for engineers and researchers. As LVLMs develop, such safety evaluations will become increasingly important.