# Deploying InternVL3 on Jetson Orin Nano: Engineering Practice for Edge Vision-Language Models

> A complete guide to deploying the InternVL3 vision-language model on the 8GB Jetson Orin Nano using TensorRT-LLM, achieving 5-6x inference speedup and over 600 tokens/sec throughput for edge AI applications.

- 板块: [Openclaw Llm](https://www.zingnex.cn/en/forum/board/openclaw-llm)
- 发布时间: 2026-04-14T10:15:17.000Z
- 最近活动: 2026-04-14T10:24:22.006Z
- 热度: 163.8
- 关键词: Jetson Orin Nano, TensorRT-LLM, InternVL3, 视觉语言模型, 边缘 AI, 模型量化, 推理优化, Jetpack, 边缘部署, VLM
- 页面链接: https://www.zingnex.cn/en/forum/thread/jetson-orin-nano-internvl3
- Canonical: https://www.zingnex.cn/forum/thread/jetson-orin-nano-internvl3
- Markdown 来源: floors_fallback

---

## 【Introduction】Core Practice Guide for Deploying InternVL3 on Jetson Orin Nano

This article fully records how to deploy the InternVL3 vision-language model on the 8GB Jetson Orin Nano using TensorRT-LLM, achieving a 5-6x inference speedup and over 600 tokens/sec throughput. The Orin-Nano-VLM-Deploy project provides a systematic solution covering the entire workflow from environment preparation, model conversion to performance optimization, offering valuable practical experience for edge AI developers.

## Background: Memory Constraints of Edge AI and Deployment Challenges of InternVL3

Edge devices (e.g., Jetson Orin Nano 8GB) face strict memory constraints. Even after quantization, InternVL3 (1B/2B parameters) still has memory pressure and speed issues. The Orin-Nano-VLM-Deploy project addresses these challenges through TensorRT-LLM optimization, documenting engineering pitfalls and their solutions.

## Hardware Environment and System Preparation Steps

Optimized for Jetson Orin Nano 8GB + Jetpack 6.2.1: 1. Flash the device using SDK Manager (recovery mode + USB connection, install CUDA components); 2. Resolve installation stuck issues (wait patiently or restart); 3. Install Jtop monitoring tool; 4. Set MAXN Super power mode and run `sudo jetson_clocks` to boost frequency.

## TensorRT-LLM Environment Setup and Model Conversion

TensorRT-LLM Setup: 1. Install system dependencies; 2. Install NVIDIA precompiled PyTorch; 3. Create a 30GB temporary swap; 4. Build for SM87 architecture. Model Conversion: Use pt2engine.py in three stages (visual ONNX export, visual engine building, language engine building), with a total time of about 7 minutes.

## Inference Optimization and Performance Analysis

Core Metrics: 5-6x speedup, over 600 tokens/sec throughput. Key Insights: 1. Small batches are bandwidth-limited, large batches are compute-limited; 2. KV Cache is the main memory consumer for medium batches; 3. Pre-generate engines for production use; 4. INT4/INT8 quantization performance depends on bandwidth and saturation state.

## Inference Deployment and Common Pitfall Solutions

Use the `engine_infer.py` script for inference; NVMe storage is recommended for engines in production. Common Pitfalls: Jetpack version compatibility, PyTorch-specific wheel packages, memory management (use swap during build, remove during runtime), dependency version conflicts, USB connection stability.

## Application Scenarios and Expansion Directions

Applicable Scenarios: Intelligent monitoring, industrial quality inspection, robot navigation, visual impairment assistance, education tools. Expansion Directions: Support for larger models, multi-frame video understanding, KV Cache optimization, quantization-aware training, ROS2 integration.

## Conclusion and Outlook

Orin-Nano-VLM-Deploy lowers the threshold for edge VLM deployment and provides end-to-end guidance. Its core contribution lies in the systematic methodology, offering best practice references for edge AI applications.
