Zing Forum

Reading

Intel Meteor Lake iGPU AI Inference Test: Performance Comparison Between OpenVINO and llama.cpp SYCL

Based on actual test data from the Intel Core Ultra 7 155H processor's integrated GPU, this article compares and analyzes the performance of OpenVINO and llama.cpp SYCL in Embedding, Reranker, and LLM generation tasks, providing references for edge AI deployment.

OpenVINOIntel Meteor LakeAI推理核显量化llama.cppSYCLEmbeddingReranker边缘AI
Published 2026-06-13 12:42Recent activity 2026-06-13 12:54Estimated read 8 min
Intel Meteor Lake iGPU AI Inference Test: Performance Comparison Between OpenVINO and llama.cpp SYCL
1

Section 01

Introduction: Intel Meteor Lake iGPU AI Inference Test: OpenVINO vs llama.cpp SYCL Performance Comparison

This article, based on actual test data from the Intel Core Ultra 7 155H processor's integrated GPU, compares and analyzes the performance of OpenVINO and llama.cpp SYCL in Embedding, Reranker, and LLM generation tasks, providing references for edge AI deployment. The test was published by Oaklight in a GitHub open-source project with the original title "openvino-meteor-lake-ai-inference" on June 13, 2026.

2

Section 02

Test Background and Environment Configuration

Background

With the release of Intel's Meteor Lake architecture, the AI inference capability of Core Ultra processors integrated with Arc Graphics iGPU has significantly improved. Core question: Can laptop iGPUs handle AI tasks like Embedding, Reranker, and even LLM generation?

Test Environment

Component Specification
Laptop ThinkPad X1 Carbon Gen 12
Processor Intel Core Ultra7 155H (6P+8E+2LPE, 22 threads)
GPU Intel Arc Graphics (Meteor Lake, 128 EU)
Memory 32GB DDR5 (CPU/GPU shared)
OS Arch Linux
Kernel 7.0.11-arch1-1
GPU Driver xe (kernel module)
OpenVINO 2026.2.0
oneAPI 2026.0.0
This configuration represents the level of mainstream business laptops, and the results are highly referenceable for ordinary users.
3

Section 03

Embedding Task: Performance Advantages of INT8 Quantization and Batch Processing

Test Model and Results

Using the BGE-M3 model (568 million parameters), comparing FP32/INT8 precision performance on CPU/GPU:

Configuration Single Sample (samples/s) Batch16 (samples/s)
FP32 CPU 23.5 27.0
FP32 GPU 41.1 179.2
INT8 CPU 82.9 128.3
INT8 GPU 67.6 245.4

Key Conclusions

  1. INT8 quantization effect is significant: CPU performance improved by ~3.5x, GPU by ~1.6x (benefiting from VNNI instruction set);
  2. Batch processing unleashes GPU potential: GPU throughput reaches 245 samples/s under Batch16, which is 3.6x that of single sample;
  3. Scenario trade-off: Choose INT8 CPU single sample for low latency (82.9 samples/s, 12ms latency), and INT8 GPU batch processing for offline bulk tasks.
4

Section 04

Reranker Task: Full Utilization of GPU Parallel Computing Capability

Test Model and Results

Using BGE Reranker v2 M3 (568 million parameters):

Configuration Single Sample (pairs/s) Batch16 (pairs/s)
FP16 CPU 6.9 6.4
FP16 GPU 27.4 41.8
INT8 CPU 16.6 19.2
INT8 GPU 33.0 43.5

Key Findings

  1. GPU has huge advantages: 4-5x faster than CPU in single sample mode (cross-encoder architecture fully utilizes parallel computing);
  2. Limited batch processing gain: From 33 to 43.5 pairs/s (cross-encoder has high computational complexity);
  3. Production recommendation: INT8 GPU is recommended for real-time RAG applications (33 pairs/s throughput, low latency).
5

Section 05

LLM Generation Task: OpenVINO GenAI CPU Wins Unexpectedly

Test Model and Results

Comparing Qwen3 8B model performance under different backends/formats:

Backend Format Quantization Prompt Processing (tok/s) Generation Speed (tok/s)
llama.cpp SYCL GPU GGUF Q4_K_M 70.2 6.9
llama.cpp SYCL CPU GGUF Q4_K_M 88.5 3.9
llama.cpp OpenVINO CPU GGUF Q4_K_M 34.5 5.3
llama.cpp OpenVINO GPU GGUF Q4_K_M OOM
OpenVINO GenAI CPU OV IR INT4 8.5
OpenVINO GenAI GPU OV IR INT4 7.2

Key Insights

  1. Native format matters: OpenVINO GenAI using native INT4 IR reaches 8.5 tok/s, exceeding llama.cpp SYCL GPU's 6.9 tok/s;
  2. CPU is more suitable for LLM generation: Under shared memory bandwidth, CPU's L3 cache and VNNI instruction set have advantages;
  3. llama.cpp OpenVINO backend is immature: GPU has OOM errors, SYCL backend is more stable.
6

Section 06

Comprehensive Recommendations: Optimal Deployment Solutions for Different AI Workloads

Embedding/Reranker Tasks

  • First choice: OpenVINO INT8 GPU
  • Performance: Embedding batch processing 245 samples/s, Reranker single sample 33 pairs/s
  • Applicable scenarios: RAG pipeline, semantic search, document vectorization

LLM Generation Task

  • First choice: OpenVINO GenAI CPU (native INT4 format)
  • Performance: 8.5 tok/s generation speed
  • Alternative: llama.cpp SYCL GPU (6.9 tok/s)

Hybrid Deployment Strategy

For a complete RAG system:

  • Use OpenVINO INT8 GPU for Embedding/Reranker
  • Use OpenVINO GenAI CPU for LLM generation
7

Section 07

Value and Limitations of Quantization Technology

Value

  • Performance improvement: 2-3x speed gain
  • Memory saving: Model size reduced by half or more
  • Quality loss: Negligible precision loss in Embedding/Reranker tasks

Limitations

Scenarios requiring high-precision numerical computation (e.g., scientific computing models) still need FP32/FP16 precision.