Reading

llmvectorapi4j: Local Large Language Model Inference Using Java Vector API

A pure Java implementation of large language model inference based on Java Vector API, supporting models like DeepSeek, Llama, Qwen, etc., with built-in HTTP server and MCP protocol support, and can run without relying on external libraries.

Java大语言模型Vector API本地推理DeepSeekLlamaQwenMCP注意力可视化SIMD

Published 2026-03-30 06:09Recent activity 2026-03-30 06:21Estimated read 6 min

Section 01

Introduction / Main Floor: llmvectorapi4j: Local Large Language Model Inference Using Java Vector API

Section 02

Project Overview

llmvectorapi4j is a pure Java implementation of a large language model inference engine that uses Java's Vector API to accelerate neural network computations. Developed by srogmann, this project is based on Alfonso Peterssen's llama3.java implementation with extensive extensions. Most notably, it does not depend on any external libraries, meaning you only need a Java runtime environment to run large language models.

Section 03

Technical Background and Motivation

In the field of AI development, Python has long dominated, but Java still holds an irreplaceable position in enterprise applications. The emergence of llmvectorapi4j provides Java developers with a solution to integrate large language models without cross-language calls. Java Vector API is a relatively new feature in JDK (currently in preview), which allows developers to use SIMD (Single Instruction Multiple Data) instruction sets to accelerate numerical calculations—this is crucial for matrix computation-intensive neural network inference.

Section 04

Multi-Model Support

The project supports multiple popular large language model architectures:

DeepSeek-R1-Distill-Qwen-1.5B: A distilled version of DeepSeek, suitable for local execution
Llama-3 Series: Including Llama-3.2 and Llama-3.3
Phi-3: Microsoft's open-source model (currently only supports CLI mode)
Qwen-2.5 and Qwen3: Alibaba's open-source models (non-MoE versions)

Section 05

Built-in HTTP Server

The project includes a built-in HTTP server that provides an interface similar to the OpenAI API. This means you can use llmvectorapi4j as a backend service to integrate with existing AI applications. The server supports interactive chat mode and instruction mode, which can be easily embedded into various application scenarios.

Section 06

Attention Visualization

This is a very unique feature. llmvectorapi4j can display the input tokens that the model pays most attention to when generating each token. For example, in an English-to-Chinese translation task, when the model generates the token "三" (san, meaning three), it will highlight the association with the English word "three". This is very helpful for understanding the working principle of Transformer models and provides an intuitive tool for model debugging and optimization.

Section 07

KV Cache Persistence

The project supports saving KV cache (key-value cache) to files. This is particularly useful in long conversation scenarios, as it avoids re-computing previous contexts and significantly improves response speed. Cache files are stored in .ggsc format and can be reused across sessions.

Section 08

MCP Protocol Support

llmvectorapi4j also implements the client and server sides of the Model Context Protocol (MCP). This means:

Tool Calling: You can implement custom MCP tools to allow the language model to call external functions
Function Forwarding: The UiServer class can forward function calls from external servers like llama.cpp to custom functions implemented in Java
Service Discovery: Load tool implementations via the ServiceLoader mechanism for easy extension

Note that these MCP features are mainly for local testing and are not recommended for direct use in production environments.

Continue Reading

Keep going with more reads from the same topic.

Nornir MCP Server: An Enterprise-Grade Bridge for Integrating Large Language Models into Network Automation

Nornir MCP Server is an enterprise-level server based on the Model Context Protocol (MCP). It seamlessly integrates large language models (such as Claude) with the Nornir network automation framework, supporting natural language orchestration for multi-vendor network devices (Cisco, Arista, Juniper, etc.), and providing production-grade features like a dual-engine architecture (NAPALM + Netmiko), intelligent filtering, and a secure sandbox.

Recent activity 2026-05-06 20:51

Bibliothèque Française LLM: A French Public Domain Literature Index System Optimized for Large Language Models

Bibliothèque Française LLM is a structured indexing and annotation project for French public domain literature designed specifically for large language models (LLMs). It integrates multiple authoritative sources such as DraCor, Common Corpus, and Wikisource, providing metadata indexing categorized by genre, author, and era, as well as in-depth annotations for dramatic texts (including characters, lines, stage directions, etc.). Its aim is to enable LLMs to efficiently read and understand classic French literary works.

Recent activity 2026-05-06 20:50

Splinter: A Lock-Free Zero-Copy Shared Memory KV and Vector Storage Library That Eliminates Socket and Memcpy Overhead for LLM Inference

Splinter is a minimalist, high-performance key-value (KV) and vector storage system enabling zero-latency inter-process communication via shared memory and atomic operations. With only 766 lines of core code, it supports millions of operations per second and 768-dimensional vector storage, offering a new architectural approach for local LLM inference and data-intensive applications.

Recent activity 2026-04-03 08:49

Folkering OS: When the Operating System Itself Is AI—A Self-Evolving Bare-Metal Rust System

Folkering OS is the world's first AI-native bare-metal operating system, entirely written in Rust no_std without relying on Linux, POSIX, or libc. It can generate commands from scratch, compile them into WASM, and run them in 10 seconds, achieving true self-evolution.

Recent activity 2026-04-09 16:15