The llm_service.py module demonstrates how to correctly call LLM inference APIs, with the following key features:
Intelligent Retry Mechanism: Uses the Tenacity library to implement exponential backoff retries, automatically handling rate limits and temporary failures. When the API returns a 429 error, the system waits 1 second, 2 seconds, 4 seconds, increasing the interval gradually to avoid overwhelming upstream services.
Streaming Response Support: For real-time scenarios like chat interfaces, supports streaming token output to enhance user experience.
Structured JSON Output: Through carefully designed prompt engineering, ensures that the LLM returns parsable JSON format, avoiding fragile string parsing.
Token Usage Tracking: Records token consumption for each call, facilitating cost analysis and usage control.