Zing Forum

Reading

Deploying InternVL3 on Jetson Orin Nano: Engineering Practice for Edge Vision-Language Models

A complete guide to deploying the InternVL3 vision-language model on the 8GB Jetson Orin Nano using TensorRT-LLM, achieving 5-6x inference speedup and over 600 tokens/sec throughput for edge AI applications.

Jetson Orin NanoTensorRT-LLMInternVL3视觉语言模型边缘 AI模型量化推理优化Jetpack边缘部署VLM
Published 2026-04-14 18:15Recent activity 2026-04-14 18:24Estimated read 5 min
Deploying InternVL3 on Jetson Orin Nano: Engineering Practice for Edge Vision-Language Models
1

Section 01

【Introduction】Core Practice Guide for Deploying InternVL3 on Jetson Orin Nano

This article fully records how to deploy the InternVL3 vision-language model on the 8GB Jetson Orin Nano using TensorRT-LLM, achieving a 5-6x inference speedup and over 600 tokens/sec throughput. The Orin-Nano-VLM-Deploy project provides a systematic solution covering the entire workflow from environment preparation, model conversion to performance optimization, offering valuable practical experience for edge AI developers.

2

Section 02

Background: Memory Constraints of Edge AI and Deployment Challenges of InternVL3

Edge devices (e.g., Jetson Orin Nano 8GB) face strict memory constraints. Even after quantization, InternVL3 (1B/2B parameters) still has memory pressure and speed issues. The Orin-Nano-VLM-Deploy project addresses these challenges through TensorRT-LLM optimization, documenting engineering pitfalls and their solutions.

3

Section 03

Hardware Environment and System Preparation Steps

Optimized for Jetson Orin Nano 8GB + Jetpack 6.2.1: 1. Flash the device using SDK Manager (recovery mode + USB connection, install CUDA components); 2. Resolve installation stuck issues (wait patiently or restart); 3. Install Jtop monitoring tool; 4. Set MAXN Super power mode and run sudo jetson_clocks to boost frequency.

4

Section 04

TensorRT-LLM Environment Setup and Model Conversion

TensorRT-LLM Setup: 1. Install system dependencies; 2. Install NVIDIA precompiled PyTorch; 3. Create a 30GB temporary swap; 4. Build for SM87 architecture. Model Conversion: Use pt2engine.py in three stages (visual ONNX export, visual engine building, language engine building), with a total time of about 7 minutes.

5

Section 05

Inference Optimization and Performance Analysis

Core Metrics: 5-6x speedup, over 600 tokens/sec throughput. Key Insights: 1. Small batches are bandwidth-limited, large batches are compute-limited; 2. KV Cache is the main memory consumer for medium batches; 3. Pre-generate engines for production use; 4. INT4/INT8 quantization performance depends on bandwidth and saturation state.

6

Section 06

Inference Deployment and Common Pitfall Solutions

Use the engine_infer.py script for inference; NVMe storage is recommended for engines in production. Common Pitfalls: Jetpack version compatibility, PyTorch-specific wheel packages, memory management (use swap during build, remove during runtime), dependency version conflicts, USB connection stability.

7

Section 07

Application Scenarios and Expansion Directions

Applicable Scenarios: Intelligent monitoring, industrial quality inspection, robot navigation, visual impairment assistance, education tools. Expansion Directions: Support for larger models, multi-frame video understanding, KV Cache optimization, quantization-aware training, ROS2 integration.

8

Section 08

Conclusion and Outlook

Orin-Nano-VLM-Deploy lowers the threshold for edge VLM deployment and provides end-to-end guidance. Its core contribution lies in the systematic methodology, offering best practice references for edge AI applications.