Zing Forum

Reading

Intel Arc Pro B70 Local Large Model Inference Tuning Practice: From Performance Bottlenecks to Production-Level Deployment

This article provides an in-depth analysis of the complete tuning solution for running large language models (LLMs) on the Intel Arc Pro B70 graphics card under Ubuntu Server, covering SYCL and Vulkan backend selection, application of key patches, environment variable configuration, and multi-level inference architecture design, helping developers fully unleash the 32GB VRAM potential of the B70.

Intel Arc Pro B70llama.cppSYCLVulkan本地推理Xe2MoE大语言模型UbuntuGPU 优化
Published 2026-04-19 04:45Recent activity 2026-04-19 04:50Estimated read 9 min
Intel Arc Pro B70 Local Large Model Inference Tuning Practice: From Performance Bottlenecks to Production-Level Deployment
1

Section 01

Introduction: Core of Intel Arc Pro B70 Local LLM Inference Tuning Practice

This article provides an in-depth analysis of the complete tuning solution for running large language models (LLMs) on the Intel Arc Pro B70 graphics card under Ubuntu Server, covering SYCL and Vulkan backend selection, application of key patches, environment variable configuration, and multi-level inference architecture design. It helps developers fully unleash the 32GB VRAM potential of the B70 and solve the problem where performance under default configuration only reaches 15%-50% of the hardware's capability.

2

Section 02

Background: Hardware Potential of B70 and Performance Gap Under Default Configuration

The Intel Arc Pro B70 is equipped with the BMG G31 core (Xe2 architecture) and 32GB GDDR6 VRAM, which provides the basic conditions for running LLMs. However, the performance of llama.cpp under default configuration is far below expectations. The performance gap stems from the lack of software stack tuning—bottlenecks can exist in areas from Mesa drivers to SYCL compilation options, kernel patches, and environment variables. The solution in this article comes from a real production environment: an inference server composed of 4 B70 cards, running 5 llama-server instances at different levels simultaneously, covering scenarios such as chat and code generation.

3

Section 03

Core Pain Points: Analysis of Performance Traps Under Default Configuration

The B70 faces three major performance traps: 1. Architecture compatibility issue: The native subgroup size of Xe2 is 16, but the K-quant kernel in the SYCL backend is hard-coded to 32, leading to a 20-25% performance loss; 2. MoE model support defect: The SYCL implementation of llama.cpp has an initialization race condition when processing MoE models, leading to segmentation faults; 3. VRAM management limitation: The Level Zero backend has a default single memory allocation limit of 4GB, which cannot meet the large KV cache requirements for long-context scenarios.

4

Section 04

Key Patches: Performance Optimization for B70 Architecture

The core of tuning lies in 11 patches, with the most impactful ones including:

  • BF16 GET_ROWS support: Adding a native BF16 path speeds up prompt processing of Gemma4 26B by 40% and token generation by 15%;
  • MoE matrix multiplication fusion: Fusing separate operations into a single kernel speeds up token generation of Qwen3-Coder-30B by 47%;
  • K-quant subgroup size adaptation: Changing to Xe2's native 16 improves K-quant model performance by 20-25%;
  • Small matrix oneMKL routing: Switching small-scale matrix multiplication to oneMKL reduces the first token latency by 30ms;
  • Vulkan Xe2 thread block configuration: Adjusting the warptile size improves Vulkan backend performance by 15-25%.
5

Section 05

Runtime Environment: Guide to Key Variable Configuration

Environment variables that must be set:

  • GGML_SYCL_DISABLE_OPT=1: Avoids segmentation faults during MoE model initialization (costs about 5% performance for dense models);
  • UR_L0_ENABLE_RELAXED_ALLOCATION_LIMITS=1: Lifts the Level Zero 4GB single allocation limit, supporting large KV caches for long-context scenarios;
  • SYCL_CACHE_PERSISTENT=0: Prevents segmentation faults caused by kernel cache pollution across restarts; the first run compilation cost is about 30 seconds.
6

Section 06

Backend Selection: Applicable Scenarios for SYCL and Vulkan

Backend selection rules:

  • Prefer SYCL for dense models: For example, Gemma4 26B Q8_0 reaches 26.4 tok/s on SYCL;
  • Prefer Vulkan for MoE models: SYCL has stability issues, while Vulkan can enable Flash Attention;
  • Mixed deployment of multiple instances on the same card: Running two SYCL instances on the same card reduces performance by 10x; it is recommended to use Vulkan for light models and SYCL for heavy models, or use Vulkan for all;
  • Speculative decoding: Using SYCL for both target and draft models is prone to crashes; it is recommended to use SYCL for the target model and Vulkan for the draft model, or use Vulkan for both.
7

Section 07

Production Deployment: Five-Level Instance Architecture Design

Architecture of a 4-card server running five llama-server instances:

Level Model Backend GPU Allocation Performance Description
chat Gemma-4-26B-A4B Q8_0 SYCL 1 card 26.4 tok/s Dense model, SYCL has obvious advantages
code Qwen3-Coder-30B-A3B Q5_K_M SYCL 3 cards 57.7 tok/s MoE model, requires DISABLE_OPT=1
fast Qwen3-4B-Instruct Q6_K Vulkan 3 cards 33.0 tok/s Shares GPU with code level
agentic Qwen3.6-35B-A3B Q6_K_XL +0.6B draft Vulkan 0 cards 25.0 tok/s Speculative decoding
reasoning Qwen3-Next-80B-A3B IQ3_XXS SYCL 2 cards 21.2 tok/s 80B MoE, 3B active parameters
This design fully utilizes resources and enables efficient operation of multi-concurrent services.
8

Section 08

Summary and Recommendations: Implementation Path for B70 Tuning

The B70 is a cost-effective local inference graphics card, but it requires targeted tuning. The solution in this article提升s performance to near hardware limits through patches, environment variables, backend selection, and architecture design. Recommended implementation steps: 1. Ensure Mesa 26+ driver (with BF16 and integer dot product enabled); 2. Apply patches and recompile llama.cpp; 3. Configure key environment variables; 4. Select backend based on model type. A single B70 can smoothly run 30B-level MoE models, and four cards can support enterprise-level concurrent requirements.