Zing Forum

Reading

Building a Mini Search Engine from Scratch: A Hands-On Project to Deeply Understand Google's Core Technologies

An open-source project by an SEO practitioner to understand Google's underlying technical principles by building a complete search engine from scratch, covering the full workflow including crawlers, inverted indexes, PageRank, BM25 ranking, and AI overview generation.

搜索引擎SEO爬虫倒排索引PageRankBM25RAGAI概览信息检索开源项目
Published 2026-03-27 20:14Recent activity 2026-03-27 20:19Estimated read 6 min
Building a Mini Search Engine from Scratch: A Hands-On Project to Deeply Understand Google's Core Technologies
1

Section 01

[Introduction] Building a Mini Search Engine from Scratch: An SEO Practitioner's Hands-On Project on Google's Core Technologies

Hoang Duc Viet, AI Director of Vietnam's SEONGON organization, launched the open-source project mini-search-engine in March 2026. By building a complete search engine from scratch (covering workflows like crawlers, inverted indexes, PageRank, BM25 ranking, and AI overview generation), he aims to deeply understand Google's underlying technical principles. This project is both a technical experiment and a way to learn the core of search algorithms, revealing the possibility of combining traditional search with AI.

2

Section 02

Project Background: Why Should an SEO Practitioner Build a Search Engine from Scratch?

As the AI Director of SEONGON, Vietnam's largest Google Ads and SEO agency, Hoang Duc Viet chose to build a mini search engine from scratch out of a desire to explore Google's core mechanisms. He believes that modern AI breakthroughs (such as Transformer and BERT) stem from the fundamental problems that search needs to solve (language understanding, ranking massive documents), and the paper "Attention Is All You Need"—which laid the foundation for Transformer—was published by Google. Launched in March 2026, this project is an open-source experiment to understand the essence of search technology.

3

Section 03

System Architecture & Core Components: Replicating Google's Search Pipeline

The project builds an end-to-end search system, separating offline (crawler scraping, index building, PageRank calculation, vector embedding) and online (query tokenization, index lookup, BM25 scoring, AI overview generation) pipelines. Core components include:

  1. Crawler: BFS strategy, follows robots.txt, 1.5-second interval, crawls 1000 football-related pages;
  2. Inverted Index: 145,736 unique terms, 1,057,023 records, millisecond-level keyword positioning;
  3. PageRank: Damping factor of 0.85, 20 iterations, handles dangling nodes;
  4. BM25: k1=1.2, b=0.75, integrates term frequency, inverse document frequency, and document length;
  5. Semantic Search: Voyage-3-lite generates 768-dimensional vectors, stored with pgvector, supports understanding of similar concepts;
  6. AI Overview: Hybrid retrieval + Groq API calls Llama3.3 70B to generate streaming answers with citations, cached for 24 hours.
4

Section 04

Technology Selection & Visualization: Modern Web Development & Transparent Search Process

The tech stack uses Python3.12 + FastAPI backend, Next.js16 + React19 frontend, and Tailwind v4 styling; AI capabilities rely on Groq's Llama3.3 70B and Voyage AI embeddings. Deployed on the Railway platform with an automated pipeline. Index building takes 35 minutes (25 minutes for crawling, 5 minutes for RAG). Featured visualization interface: the left side shows the offline process, the right side shows online queries; clicking nodes allows viewing of inverted index records, PageRank scores, and other data, which has both educational and debugging value.

5

Section 05

Future Plans: Feature Enhancement & Community Knowledge Sharing

The author's planned enhanced features include: Sports OneBox real-time match cards, automatic crawlers with freshness tracking, query intent detection, incremental indexing, stemming, knowledge graphs, spell correction, etc. He is also writing a series of blogs; "Why Build a Search Engine" and "Designing a Web Crawler" have been published, with future topics covering inverted indexes, BM25+PageRank ranking, etc.

6

Section 06

SEO Insights: Understand Search Logic, Grasp AI Search Trends

Insights from this project for SEO practitioners: Understanding how crawlers discover pages, how indexes process content, and how ranking algorithms evaluate relevance is the foundation of optimization strategies; witnessing PageRank and BM25 calculations firsthand will reshape one's understanding of "high-quality backlinks" and "content relevance". Additionally, the project demonstrates a hybrid architecture combining traditional search with large models (RAG), which maintains search accuracy while gaining generative flexibility—this may represent the future direction of search.