Reading

SpikingLLM: Reducing Conversion Error of Spiking-Driven Large Language Models via Distribution-Aware Multi-Granularity Phase Coding

Open-source implementation of an ICLR 2026 accepted paper, proposing a distribution-aware multi-granularity phase coding method that effectively reduces ANN-to-SNN conversion error, enabling efficient spiking neural network inference on LLaMA-2 and LLaMA-3 models.

脉冲神经网络SNN大语言模型相位编码ANN-to-SNN转换ICLR 2026LLaMA神经形态计算边缘计算

Published 2026-06-16 17:13Recent activity 2026-06-16 17:27Estimated read 10 min

SpikingLLM: Reducing Conversion Error of Spiking-Driven Large Language Models via Distribution-Aware Multi-Granularity Phase Coding

Section 01

Introduction / Main Floor: SpikingLLM: Reducing Conversion Error of Spiking-Driven Large Language Models via Distribution-Aware Multi-Granularity Phase Coding

Section 02

Original Authors and Source

Original Author/Maintainer: njzhenghy
Source Platform: GitHub
Original Title: SpikingLLM
Original Link: https://github.com/njzhenghy/SpikingLLM
Source Publication/Update Time: 2026-06-16T09:13:56Z

Section 03

Research Background: Challenges in Integrating Spiking Neural Networks with Large Language Models

Spiking Neural Networks (SNNs), known as the third generation of neural networks, have attracted much attention due to their event-driven computing characteristics and biological interpretability.

Compared to traditional Artificial Neural Networks (ANNs), SNNs consume energy only when neurons fire spikes; this sparse activation feature gives them a huge advantage in energy efficiency, making them particularly suitable for edge computing and neuromorphic chip deployment.

However, applying SNNs to the field of Large Language Models (LLMs) faces severe challenges. Due to the fundamental difference between the discrete spike mechanism of SNNs and the continuous activation functions of LLMs, directly converting pre-trained LLMs to SNNs results in significant accuracy loss, a problem known as the 'ANN-to-SNN conversion error'. Existing conversion methods often struggle to achieve efficient spike inference while maintaining model performance.

The research team led by NJ Zheng et al. addressed this problem by proposing the 'Distribution-Aware Multi-Granularity Phase Coding' method, which successfully enables efficient spike-driven inference for LLaMA series models, and the related results have been accepted by ICLR 2026.

Section 04

Basic Principles of Phase Coding

Phase Coding is an important temporal coding method in SNNs, which uses the timing of spike firing to encode information.

Compared to traditional Rate Coding, Phase Coding can transmit more information in fewer time steps, thereby improving the inference efficiency of SNNs.

In Phase Coding, the activation value of a neuron is encoded as the firing time of the spike within a specific time window. For example, a higher activation value corresponds to an earlier spike firing time, while a lower activation value corresponds to a later firing time. This coding method allows SNNs to transmit analog value information in a single time step, greatly improving information transmission efficiency.

Section 05

Multi-Granularity Coding Strategy

The research team found that single-granularity phase coding is difficult to adapt to the differences in activation distributions of different layers and neurons in LLMs.

To address this, they proposed the 'Multi-Granularity Phase Coding' strategy, which allows the model to adaptively select the coding granularity based on the distribution characteristics of activation values.

Specifically, this method groups neurons, and each group uses a different coding granularity (grain). For example, some groups may use 2-level granularity (dividing the activation range into 2 intervals), while others may use 3-level granularity (dividing the activation range into 3 intervals). This flexible grouping strategy allows the coding to better match the actual activation distribution of each neuron group.

Section 06

Distribution-Aware Optimization

'Distribution Awareness' is one of the core innovations of this method.

The research team analyzed the statistical distribution of activation values in each layer of LLMs and identified that neurons in different layers and positions have different activation distribution characteristics. Based on this distribution information, they designed an optimization algorithm that automatically selects the most appropriate coding granularity for each neuron group.

This distribution-aware method ensures that coding resources are reasonably allocated: for neuron groups with a relatively concentrated activation distribution, using a coarser granularity is sufficient to ensure accuracy; for neuron groups with a relatively dispersed activation distribution, a finer granularity is needed to fully express the information.

Section 07

Supported Models and Configurations

This project provides complete training and conversion code, supporting ANN-to-SNN conversion for LLaMA-2-7B and LLaMA-3-8B models. Experimental results show that this method achieves excellent performance on multiple benchmark tests:

LLaMA-2-7B Experimental Results (using 8 time steps, T=8):

WikiText-2 Perplexity: 5.50 (grain=2) / 5.50 (grain=3)
WinoGrande Accuracy: 70.48%
ARC-Challenge Accuracy: 46.50% (grain=2) / 46.33% (grain=3)
ARC-Easy Accuracy: 73.91% (grain=2) / 73.86% (grain=3)
PIQA Accuracy: 78.29% (grain=2) / 78.35% (grain=3)

LLaMA-3-8B Experimental Results (using 8 time steps, T=8):

WikiText-2 Perplexity: 6.34 (grain=2) / 6.33 (grain=3)
WinoGrande Accuracy: 72.93% (grain=2) / 73.72% (grain=3)
ARC-Challenge Accuracy: 54.01% (grain=2) / 53.41% (grain=3)
ARC-Easy Accuracy: 77.44% (grain=2) / 77.36% (grain=3)
PIQA Accuracy: 80.63% (grain=2) / 80.36% (grain=3)

These results indicate that even with a small number of time steps (e.g., 6-10 steps), this method can still maintain high model performance, significantly outperforming traditional ANN-to-SNN conversion methods.

Section 08

Key Technical Components

Fast Hadamard Transform: The project uses the fast-hadamard-transform library developed by Dao-AILab for efficient computation of Hadamard transforms, which is a key mathematical tool for implementing phase coding.

Grain Analysis Optimization: The research team uses the Grain Analysis module to analyze neuron activation distributions and select the optimal coding granularity for each neuron group. The optimized parameter configuration has further improved the results compared to those reported in the original paper.

Training Framework: The project is built based on PyTorch 2.4.1, supports CUDA 12.4, and integrates efficient attention mechanism implementations such as Flash Attention.

Continue Reading

Keep going with more reads from the same topic.

Nornir MCP Server: An Enterprise-Grade Bridge for Integrating Large Language Models into Network Automation

Nornir MCP Server is an enterprise-level server based on the Model Context Protocol (MCP). It seamlessly integrates large language models (such as Claude) with the Nornir network automation framework, supporting natural language orchestration for multi-vendor network devices (Cisco, Arista, Juniper, etc.), and providing production-grade features like a dual-engine architecture (NAPALM + Netmiko), intelligent filtering, and a secure sandbox.

Recent activity 2026-05-06 20:51

Bibliothèque Française LLM: A French Public Domain Literature Index System Optimized for Large Language Models

Bibliothèque Française LLM is a structured indexing and annotation project for French public domain literature designed specifically for large language models (LLMs). It integrates multiple authoritative sources such as DraCor, Common Corpus, and Wikisource, providing metadata indexing categorized by genre, author, and era, as well as in-depth annotations for dramatic texts (including characters, lines, stage directions, etc.). Its aim is to enable LLMs to efficiently read and understand classic French literary works.

Recent activity 2026-05-06 20:50

Splinter: A Lock-Free Zero-Copy Shared Memory KV and Vector Storage Library That Eliminates Socket and Memcpy Overhead for LLM Inference

Splinter is a minimalist, high-performance key-value (KV) and vector storage system enabling zero-latency inter-process communication via shared memory and atomic operations. With only 766 lines of core code, it supports millions of operations per second and 768-dimensional vector storage, offering a new architectural approach for local LLM inference and data-intensive applications.

Recent activity 2026-04-03 08:49

libmlxforge: An Embedded MLX LLM Inference Engine for Apple Silicon

libmlxforge is an embeddable MLX large language model (LLM) inference engine designed specifically for Apple Silicon. It provides a unified C ABI interface, supports calls from Node.js, Swift, and Rust, and features continuous batching, streaming output, JSON-constrained structured output, and embedding vector generation.

Recent activity 2026-06-09 17:23