Zing Forum

Reading

llmvectorapi4j: Local Large Language Model Inference Using Java Vector API

A pure Java implementation of large language model inference based on Java Vector API, supporting models like DeepSeek, Llama, Qwen, etc., with built-in HTTP server and MCP protocol support, and can run without relying on external libraries.

Java大语言模型Vector API本地推理DeepSeekLlamaQwenMCP注意力可视化SIMD
Published 2026-03-30 06:09Recent activity 2026-03-30 06:21Estimated read 6 min
llmvectorapi4j: Local Large Language Model Inference Using Java Vector API
1

Section 01

Introduction / Main Floor: llmvectorapi4j: Local Large Language Model Inference Using Java Vector API

A pure Java implementation of large language model inference based on Java Vector API, supporting models like DeepSeek, Llama, Qwen, etc., with built-in HTTP server and MCP protocol support, and can run without relying on external libraries.

2

Section 02

Project Overview

llmvectorapi4j is a pure Java implementation of a large language model inference engine that uses Java's Vector API to accelerate neural network computations. Developed by srogmann, this project is based on Alfonso Peterssen's llama3.java implementation with extensive extensions. Most notably, it does not depend on any external libraries, meaning you only need a Java runtime environment to run large language models.

3

Section 03

Technical Background and Motivation

In the field of AI development, Python has long dominated, but Java still holds an irreplaceable position in enterprise applications. The emergence of llmvectorapi4j provides Java developers with a solution to integrate large language models without cross-language calls. Java Vector API is a relatively new feature in JDK (currently in preview), which allows developers to use SIMD (Single Instruction Multiple Data) instruction sets to accelerate numerical calculations—this is crucial for matrix computation-intensive neural network inference.

4

Section 04

Multi-Model Support

The project supports multiple popular large language model architectures:

  • DeepSeek-R1-Distill-Qwen-1.5B: A distilled version of DeepSeek, suitable for local execution
  • Llama-3 Series: Including Llama-3.2 and Llama-3.3
  • Phi-3: Microsoft's open-source model (currently only supports CLI mode)
  • Qwen-2.5 and Qwen3: Alibaba's open-source models (non-MoE versions)
5

Section 05

Built-in HTTP Server

The project includes a built-in HTTP server that provides an interface similar to the OpenAI API. This means you can use llmvectorapi4j as a backend service to integrate with existing AI applications. The server supports interactive chat mode and instruction mode, which can be easily embedded into various application scenarios.

6

Section 06

Attention Visualization

This is a very unique feature. llmvectorapi4j can display the input tokens that the model pays most attention to when generating each token. For example, in an English-to-Chinese translation task, when the model generates the token "三" (san, meaning three), it will highlight the association with the English word "three". This is very helpful for understanding the working principle of Transformer models and provides an intuitive tool for model debugging and optimization.

7

Section 07

KV Cache Persistence

The project supports saving KV cache (key-value cache) to files. This is particularly useful in long conversation scenarios, as it avoids re-computing previous contexts and significantly improves response speed. Cache files are stored in .ggsc format and can be reused across sessions.

8

Section 08

MCP Protocol Support

llmvectorapi4j also implements the client and server sides of the Model Context Protocol (MCP). This means:

  • Tool Calling: You can implement custom MCP tools to allow the language model to call external functions
  • Function Forwarding: The UiServer class can forward function calls from external servers like llama.cpp to custom functions implemented in Java
  • Service Discovery: Load tool implementations via the ServiceLoader mechanism for easy extension

Note that these MCP features are mainly for local testing and are not recommended for direct use in production environments.