In real-world industrial scenarios, many large language model applications fall into discriminative tasks, which only require the model to output a single token to complete the judgment. Typical application scenarios include:
- Reranking: Determine the relevance between documents and queries
- Retrieval/Embedding: Generate vector representations for semantic search
- Classification Tasks: Binary or multi-class classification
- Visual Question Answering: Answer yes/no questions about images
- Spatial Reasoning: Compare object sizes, positions, or relationships
- Attribute Recognition: Identify visual attributes like color and shape
With the rise of multimodal large models, these tasks are gradually shifting from traditional visual models to multimodal LLMs. For example: "Is there a dog in this picture?" "Which sign is the most eye-catching?" "Which picture best represents traditional Chinese architecture?"
However, when dealing with hundreds of millions of images, traditional vLLM solutions face severe challenges—KV cache becomes a performance bottleneck.