Zing Forum

Reading

ai-browser-profile: Extract Personal Knowledge from Browser Data and Build a Semantic Search Database

A tool that reads local browser data (autofill, history, bookmarks, WhatsApp contacts, etc.) and builds a self-sorting SQLite database, supporting semantic search and entity association

ai-browser-profile浏览器数据SQLite语义搜索个人知识库数据提取Nomic嵌入
Published 2026-04-09 09:43Recent activity 2026-04-09 10:33Estimated read 7 min
ai-browser-profile: Extract Personal Knowledge from Browser Data and Build a Semantic Search Database
1

Section 01

Introduction / Main Floor: ai-browser-profile: Extract Personal Knowledge from Browser Data and Build a Semantic Search Database

A tool that reads local browser data (autofill, history, bookmarks, WhatsApp contacts, etc.) and builds a self-sorting SQLite database, supporting semantic search and entity association

2

Section 02

Core Concept of the Project

The design goal of this tool is clear: to integrate scattered data in the browser into a structured knowledge base. It reads autofill information, login credentials, browsing history, bookmarks, WhatsApp contacts, LinkedIn connections, and Notion workspace data, all stored in a single SQLite database. More importantly, it introduces a self-sorting mechanism and semantic search capabilities, making data retrieval intelligent and efficient.

3

Section 03

Supported Data Sources and Browsers

ai-browser-profile supports extracting data from various browser files. The Web Data SQLite file contains autofilled addresses and credit card information; Login Data SQLite stores saved accounts and passwords; History SQLite records browsing history, which can be used to analyze tool usage frequency; Bookmarks JSON file saves bookmark data, reflecting the user's areas of interest.

For more complex data sources, the project uses LevelDB to read WhatsApp contacts from IndexedDB and LinkedIn connections from Local Storage. It also supports extracting workspace users and page data from Notion's IndexedDB. Supported browsers include Arc, Chrome, Brave, Edge, Safari, and Firefox, with current main support for macOS systems.

4

Section 04

Installation and Initialization

The project provides a convenient command-line installation method. Using npx ai-browser-profile init allows one-click setup of the working environment, creating the ~/ai-browser-profile directory, configuring a Python virtual environment, and installing core dependencies. If semantic search functionality is needed, run npx ai-browser-profile install-embeddings to install the approximately 180MB embedding model. System requirements: Python 3.10+ and Node.js 16+.

5

Section 05

Data Extraction Process

After activating the virtual environment, run python extract.py to scan all browsers and extract data. It also supports specifying specific browsers, such as python extract.py --browsers arc chrome, or skipping certain data sources to speed up the process. Extracted data is saved in memories.db by default, and a custom output path can also be specified.

6

Section 06

Self-sorting and Semantic Deduplication Mechanism

This is one of the project's most distinctive features. Each memory record tracks appeared_count (number of times it appeared during extraction) and accessed_count (number of times it was accessed via queries). The most relevant memories are surfaced by calculating hit_rate (access rate/appearance rate). This self-sorting mechanism allows frequently used and queried data to naturally rise to the top.

Semantic deduplication uses the nomic-embed-text-v1.5 model to generate 768-dimensional embedding vectors. When a new entry has a cosine similarity of ≥0.92 with an existing entry and the key prefixes are the same, the old entry is marked as superseded instead of being stored as a simple duplicate. This intelligent deduplication avoids data redundancy while preserving historical version information.

7

Section 07

Key-Value Pattern and Entity Association

The project designed a structured key pattern to manage different types of data. Single-value keys such as first_name, last_name, email, etc., automatically replace old values with new ones; multi-value keys such as email, phone, account:github.com, tool:vscode, etc., allow multiple values to coexist. This design ensures data uniqueness while supporting complex multi-value attributes.

The entity association function automatically identifies accounts sharing the same username or email, linking them via the same_identity relationship. This association capability is crucial for building a complete user profile.

8

Section 08

Database Schema Design

The database includes several core tables: the memories table stores all memory data, including keys, values, confidence, sources, count information, timestamps, deduplication relationships, etc.; the memory_tags table tags memories, supporting categories such as identity, contact_info, address, payment, account, tool, contact, work, knowledge, communication, social, finance, etc.; the memory_links table records association relationships between memories; the memory_embeddings table stores 768-dimensional vector embedding data.