Zing Forum

Reading

AstroLLM: A Domain-Specific Large Language Model for Astronomical Research

AstroLLM is an open-source domain-specific large language model for astronomy and astrophysics research. It is deeply integrated with astronomical databases such as NASA ADS and SIMBAD via RAG technology, providing retrieval-augmented answers with real citations.

大语言模型天文学天体物理学RAG领域专用模型NASA ADSSIMBAD开源项目
Published 2026-04-05 20:13Recent activity 2026-04-05 20:20Estimated read 6 min
AstroLLM: A Domain-Specific Large Language Model for Astronomical Research
1

Section 01

[Main Floor] AstroLLM: A Domain-Specific Large Language Model for Astronomical Research

AstroLLM is an open-source domain-specific large language model system for astronomy and astrophysics research, designed to address the hallucination problem of general-purpose large language models in professional scientific research scenarios. It is deeply integrated with astronomical databases like NASA ADS and SIMBAD through RAG technology, providing retrieval-augmented answers with real citations, and is positioned as an intelligent research assistant for scientists.

2

Section 02

Project Background and Core Positioning

In the field of astronomy, general-purpose large models struggle to provide accurate and reliable scientific research assistance, and the hallucination problem is particularly fatal. AstroLLM's design goal is to become a research assistant that can cite real papers and query real databases, and refuse to answer when evidence is insufficient instead of making up information. Compared to existing astronomical models (e.g., AstroSage), its differentiators include: tool integration capabilities (connecting to databases like SIMBAD and NASA ADS), RAG architecture (real-time knowledge updates), educational adaptability (supporting Socratic teaching for users at different levels), and hardware friendliness (the 8B parameter model can run on consumer-grade hardware).

3

Section 03

Technical Architecture Analysis

AstroLLM adopts a layered architecture:

Data and Model Layer

It uses QLoRA supervised fine-tuning based on the Qwen3-4B/8B model, with training data from an astronomical literature corpus, and injects domain knowledge via LoRA.

Retrieval and Tool Layer

The RAG system builds vector storage based on PostgreSQL+pgvector. The tool integration layer bridges multiple data sources: NASA ADS (15 million+ papers), SIMBAD (20 million+ celestial objects), NASA Exoplanet Archive (5,800+ planets), NED (extragalactic object data), and VizieR (23,000+ catalogs).

Service Layer

Inference supports deployment via vLLM and llama.cpp, and the web interface uses the TanStack Start+Elysia tech stack.

4

Section 04

Development Roadmap

AstroLLM iterates in phases; currently it is in Phase 0:

Phase Timeline Core Deliverables
Phase1(v1) 1-3 months Retrieval-augmented assistant: QLoRA SFT, RAG+ADS/SIMBAD, beta version launch
Phase2(v2) 4-8 months Serious astronomical model: Full LoRA8B, DPO training, expanded toolset
Phase3(v3) 9-18 months Scientific tool ecosystem: Model family (Nano3B+Core8B+Pro32B), continuous learning
Phase4+(v4+) From Year 2 Multimodal knowledge base: AION-1 visual bridge, spectrum and light curve processing
5

Section 05

Application Scenarios and Value

AstroLLM's application scenarios include:

  1. Literature review: Quickly locate relevant research based on ADS and generate review summaries with citations
  2. Celestial object query: Use natural language to query SIMBAD for astrophysical parameters
  3. Teaching assistance: Adjust the depth of explanations according to user level to support astronomy education
  4. Data analysis: Perform basic astronomical calculations and data processing in combination with Astropy
6

Section 06

Open Source Ecosystem and Community

AstroLLM is an open-source project licensed under Apache 2.0, and actively integrates into the astronomical AI ecosystem: it draws on AstroMLab's benchmarking methods, Multimodal Universe's multimodal datasets, and AION-1's multimodal foundation model experience, and encourages wide adoption and contributions from academia and industry.

7

Section 07

Conclusion

AstroLLM represents a typical paradigm for domain-specific large models: building a complete system of tool integration, retrieval augmentation, and knowledge updates, rather than simply fine-tuning general-purpose models. For astronomical researchers, a trustworthy AI assistant is moving from concept to reality.