With the popularity of Large Language Models (LLMs) and Computer Vision (CV) technologies, more and more developers and researchers are choosing to run AI models locally. Compared to cloud APIs, local deployment offers advantages such as better data privacy, no network latency, and lower long-term costs. However, local deployment also brings new challenges: how to accurately evaluate system performance to ensure it meets the Service Level Agreement (SLA) requirements of applications?
Local GPU SLA Profiler was created to address this issue. It is an independent Python benchmarking tool designed specifically for single-GPU systems (e.g., workstations equipped with RTX 3090), used to comprehensively analyze three key performance dimensions:
- GPU Memory (VRAM) Usage
- Vector Search Latency
- Local LLM Inference Speed