Reading

Intel Meteor Lake iGPU AI Inference Test: Performance Comparison Between OpenVINO and llama.cpp SYCL

Based on actual test data from the Intel Core Ultra 7 155H processor's integrated GPU, this article compares and analyzes the performance of OpenVINO and llama.cpp SYCL in Embedding, Reranker, and LLM generation tasks, providing references for edge AI deployment.

OpenVINOIntel Meteor LakeAI推理核显量化llama.cppSYCLEmbeddingReranker边缘AI

Published 2026-06-13 12:42Recent activity 2026-06-13 12:54Estimated read 8 min

Intel Meteor Lake iGPU AI Inference Test: Performance Comparison Between OpenVINO and llama.cpp SYCL

Section 01

Introduction: Intel Meteor Lake iGPU AI Inference Test: OpenVINO vs llama.cpp SYCL Performance Comparison

This article, based on actual test data from the Intel Core Ultra 7 155H processor's integrated GPU, compares and analyzes the performance of OpenVINO and llama.cpp SYCL in Embedding, Reranker, and LLM generation tasks, providing references for edge AI deployment. The test was published by Oaklight in a GitHub open-source project with the original title "openvino-meteor-lake-ai-inference" on June 13, 2026.

Section 02

Test Background and Environment Configuration

Background

With the release of Intel's Meteor Lake architecture, the AI inference capability of Core Ultra processors integrated with Arc Graphics iGPU has significantly improved. Core question: Can laptop iGPUs handle AI tasks like Embedding, Reranker, and even LLM generation?

Test Environment

Component	Specification
Laptop	ThinkPad X1 Carbon Gen 12
Processor	Intel Core Ultra7 155H (6P+8E+2LPE, 22 threads)
GPU	Intel Arc Graphics (Meteor Lake, 128 EU)
Memory	32GB DDR5 (CPU/GPU shared)
OS	Arch Linux
Kernel	7.0.11-arch1-1
GPU Driver	xe (kernel module)
OpenVINO	2026.2.0
oneAPI	2026.0.0
This configuration represents the level of mainstream business laptops, and the results are highly referenceable for ordinary users.

Section 03

Embedding Task: Performance Advantages of INT8 Quantization and Batch Processing

Test Model and Results

Using the BGE-M3 model (568 million parameters), comparing FP32/INT8 precision performance on CPU/GPU:

Configuration	Single Sample (samples/s)	Batch16 (samples/s)
FP32 CPU	23.5	27.0
FP32 GPU	41.1	179.2
INT8 CPU	82.9	128.3
INT8 GPU	67.6	245.4

Key Conclusions

INT8 quantization effect is significant: CPU performance improved by ~3.5x, GPU by ~1.6x (benefiting from VNNI instruction set);
Batch processing unleashes GPU potential: GPU throughput reaches 245 samples/s under Batch16, which is 3.6x that of single sample;
Scenario trade-off: Choose INT8 CPU single sample for low latency (82.9 samples/s, 12ms latency), and INT8 GPU batch processing for offline bulk tasks.

Section 04

Reranker Task: Full Utilization of GPU Parallel Computing Capability

Test Model and Results

Using BGE Reranker v2 M3 (568 million parameters):

Configuration	Single Sample (pairs/s)	Batch16 (pairs/s)
FP16 CPU	6.9	6.4
FP16 GPU	27.4	41.8
INT8 CPU	16.6	19.2
INT8 GPU	33.0	43.5

Key Findings

GPU has huge advantages: 4-5x faster than CPU in single sample mode (cross-encoder architecture fully utilizes parallel computing);
Limited batch processing gain: From 33 to 43.5 pairs/s (cross-encoder has high computational complexity);
Production recommendation: INT8 GPU is recommended for real-time RAG applications (33 pairs/s throughput, low latency).

Section 05

LLM Generation Task: OpenVINO GenAI CPU Wins Unexpectedly

Test Model and Results

Comparing Qwen3 8B model performance under different backends/formats:

Backend	Format	Quantization	Prompt Processing (tok/s)	Generation Speed (tok/s)
llama.cpp SYCL GPU	GGUF	Q4_K_M	70.2	6.9
llama.cpp SYCL CPU	GGUF	Q4_K_M	88.5	3.9
llama.cpp OpenVINO CPU	GGUF	Q4_K_M	34.5	5.3
llama.cpp OpenVINO GPU	GGUF	Q4_K_M	OOM	—
OpenVINO GenAI CPU	OV IR	INT4	—	8.5
OpenVINO GenAI GPU	OV IR	INT4	—	7.2

Key Insights

Native format matters: OpenVINO GenAI using native INT4 IR reaches 8.5 tok/s, exceeding llama.cpp SYCL GPU's 6.9 tok/s;
CPU is more suitable for LLM generation: Under shared memory bandwidth, CPU's L3 cache and VNNI instruction set have advantages;
llama.cpp OpenVINO backend is immature: GPU has OOM errors, SYCL backend is more stable.

Section 06

Comprehensive Recommendations: Optimal Deployment Solutions for Different AI Workloads

Embedding/Reranker Tasks

First choice: OpenVINO INT8 GPU
Performance: Embedding batch processing 245 samples/s, Reranker single sample 33 pairs/s
Applicable scenarios: RAG pipeline, semantic search, document vectorization

LLM Generation Task

First choice: OpenVINO GenAI CPU (native INT4 format)
Performance: 8.5 tok/s generation speed
Alternative: llama.cpp SYCL GPU (6.9 tok/s)

Hybrid Deployment Strategy

For a complete RAG system:

Use OpenVINO INT8 GPU for Embedding/Reranker
Use OpenVINO GenAI CPU for LLM generation

Section 07

Value and Limitations of Quantization Technology

Value

Performance improvement: 2-3x speed gain
Memory saving: Model size reduced by half or more
Quality loss: Negligible precision loss in Embedding/Reranker tasks

Limitations

Scenarios requiring high-precision numerical computation (e.g., scientific computing models) still need FP32/FP16 precision.

Continue Reading

Keep going with more reads from the same topic.

Nornir MCP Server: An Enterprise-Grade Bridge for Integrating Large Language Models into Network Automation

Nornir MCP Server is an enterprise-level server based on the Model Context Protocol (MCP). It seamlessly integrates large language models (such as Claude) with the Nornir network automation framework, supporting natural language orchestration for multi-vendor network devices (Cisco, Arista, Juniper, etc.), and providing production-grade features like a dual-engine architecture (NAPALM + Netmiko), intelligent filtering, and a secure sandbox.

Recent activity 2026-05-06 20:51

Bibliothèque Française LLM: A French Public Domain Literature Index System Optimized for Large Language Models

Bibliothèque Française LLM is a structured indexing and annotation project for French public domain literature designed specifically for large language models (LLMs). It integrates multiple authoritative sources such as DraCor, Common Corpus, and Wikisource, providing metadata indexing categorized by genre, author, and era, as well as in-depth annotations for dramatic texts (including characters, lines, stage directions, etc.). Its aim is to enable LLMs to efficiently read and understand classic French literary works.

Recent activity 2026-05-06 20:50

Splinter: A Lock-Free Zero-Copy Shared Memory KV and Vector Storage Library That Eliminates Socket and Memcpy Overhead for LLM Inference

Splinter is a minimalist, high-performance key-value (KV) and vector storage system enabling zero-latency inter-process communication via shared memory and atomic operations. With only 766 lines of core code, it supports millions of operations per second and 768-dimensional vector storage, offering a new architectural approach for local LLM inference and data-intensive applications.

Recent activity 2026-04-03 08:49

libmlxforge: An Embedded MLX LLM Inference Engine for Apple Silicon

libmlxforge is an embeddable MLX large language model (LLM) inference engine designed specifically for Apple Silicon. It provides a unified C ABI interface, supports calls from Node.js, Swift, and Rust, and features continuous batching, streaming output, JSON-constrained structured output, and embedding vector generation.

Recent activity 2026-06-09 17:23