Zing Forum

Reading

OpenRTLSet: An Open-Source Verilog Dataset for Hardware Design with Large Language Models

OpenRTLSet releases an open-source hardware design dataset with over 130,000 samples, generates natural language descriptions using DeepSeek-R1, and supports fine-tuning of models like Qwen and Granite

硬件设计Verilog开源数据集芯片设计DeepSeek-R1HDL代码生成
Published 2026-06-09 09:17Recent activity 2026-06-10 09:18Estimated read 6 min
OpenRTLSet: An Open-Source Verilog Dataset for Hardware Design with Large Language Models
1

Section 01

[Introduction] OpenRTLSet: An Open-Source Verilog Dataset for Hardware Design with Large Language Models

OpenRTLSet is an open-source Verilog dataset for hardware design with large language models. It releases over 130,000 samples, generates natural language descriptions using DeepSeek-R1, and supports fine-tuning of models like Qwen and Granite. Its aim is to address bottlenecks in hardware design automation, such as the scarcity of HDL training data and restrictions on commercial licensing.

2

Section 02

Background: Data Bottlenecks in Hardware Design Automation

With the breakthroughs of large language models in code generation, hardware design automation has become an important direction for AI-enabled chip design. However, HDL (e.g., Verilog) training data is extremely scarce and mostly restricted by commercial licensing, which hinders research progress. Existing datasets have three major issues:

  1. Limited scale: Insufficient public Verilog samples to support large model training;
  2. Single source: Mostly textbook examples or simple circuits, lacking complexity of industrial scenarios;
  3. Restricted licensing: Strict usage of commercial datasets, impeding openness and reproducibility.
3

Section 03

Methodology: Composition of OpenRTLSet Dataset and Generation of Natural Language Descriptions

OpenRTLSet is the largest fully open-source hardware design dataset to date, containing 131,000 Verilog samples from diverse sources:

  1. GitHub repository code (102,000 modules): Real code from open-source hardware projects, covering complex designs like processors and interfaces;
  2. VHDL-translated modules (5,000 modules): Automated tools convert VHDL to Verilog to expand diversity;
  3. Synthesizable C/C++ translations (24,000 modules): HLS tools generate synthesizable Verilog from C/C++, introducing algorithm-level semantics. Additionally, the DeepSeek-R1 inference model is used to generate paired natural language descriptions for each sample, forming high-quality "description-code" instruction pairs suitable for large model instruction fine-tuning.
4

Section 04

Evidence: Technical Exploration and Experimental Design Results

The team conducted technical explorations around OpenRTLSet:

  1. Verilator context enhancement: Introduce C++ files generated by Verilator as additional context to help models understand hardware dynamic characteristics;
  2. Quantization technique comparison: Compare the impact of INT4 (reducing volume and inference cost) and BF16 (retaining high precision) strategies on performance;
  3. Model scale effect: Evaluate Qwen (Alibaba's open-source Chinese-optimized model) and Granite (IBM's open-source code generation model) series with parameters from 7B to 32B. The results show that open-source methods can match or even outperform proprietary solutions.
5

Section 05

Conclusion: Application Value and Industry Significance of OpenRTLSet

The release of OpenRTLSet has far-reaching significance:

  1. Lower research barriers: Provide a large-scale open dataset, allowing researchers to conduct studies like large model fine-tuning without relying on expensive commercial data;
  2. Promote open-source ecosystem: Lay the foundation for open-source research in hardware design, and is expected to spawn intelligent auxiliary tools like GitHub Copilot;
  3. Accelerate chip design iteration: Fine-tuned models can help engineers quickly generate Verilog prototypes, reduce repetitive code writing, and shorten the design cycle.
6

Section 06

Outlook: Intelligence and Democratization of Hardware Design Automation

OpenRTLSet fills the gap of large-scale open-source datasets in the hardware design field. Combined with the reasoning capabilities of DeepSeek-R1 and diverse data sources, it provides a solid foundation for training high-quality Verilog generation models. In the future, as more innovations are carried out based on this dataset, hardware design automation is expected to experience rapid development, promoting the intelligence and democratization of chip design.