Zing Forum

Reading

PostgreSQL Database Auto-Discovery and Classification System Based on Large Language Models

A containerized REST API system built with FastAPI that can automatically discover database schemas, extract metadata, and use LLM services to classify data for PII (Personally Identifiable Information), supporting recognition of 13 different PII categories.

LLMPostgreSQL数据发现PII分类FastAPI数据治理DockerSQLAlchemy
Published 2026-04-30 02:41Recent activity 2026-04-30 02:53Estimated read 7 min
PostgreSQL Database Auto-Discovery and Classification System Based on Large Language Models
1

Section 01

[Introduction] Core Overview of PostgreSQL Data Discovery and Classification System Based on Large Language Models

The PostgreSQL Database Auto-Discovery and Classification System Based on Large Language Models is an open-source project aimed at solving the challenge of quickly understanding database content, structure, and sensitivity in data governance. Built with FastAPI as a containerized REST API, the system can automatically discover PostgreSQL database schemas, extract metadata, and use LLM services to classify data for PII (supporting 13 categories). The tech stack includes FastAPI, PostgreSQL 15, SQLAlchemy, Docker, etc., providing an efficient tool for enterprise data compliance and management.

2

Section 02

Project Background and Core Objectives

With data privacy regulations like GDPR and CCPA becoming increasingly strict, enterprises need to identify sensitive information in databases. Traditional manual auditing or rule-based matching is inefficient and prone to omissions. The goal of this project is to leverage the intelligent understanding capabilities of LLMs to automatically analyze database content, identify 13 PII categories (such as emails, phone numbers, ID numbers, etc.), and improve data governance efficiency.

3

Section 03

System Architecture and Technical Implementation

The system adopts a modern architecture with the following core tech stack: FastAPI (Python 3.11) for the backend to provide asynchronous APIs; PostgreSQL 15 as the system database to store metadata; SQLAlchemy and Psycopg2 for ORM; LLM services compatible with the OpenAI API; Docker and Docker Compose for containerized deployment; HTTP Basic Auth for security. The modular design follows the MVC pattern, divided into six modules: core (environment and security), routers (routes), services (business logic), database (engine sessions), models (ORM models), and schemas (Pydantic validation), ensuring maintainability and scalability.

4

Section 04

Detailed Explanation of Core Functions

The system provides six core service endpoints: 1. Authentication service (GET /auth) to manage security tokens; 2. Metadata extraction (POST /db/metadata) to connect to the target database, extract table and column metadata, and assign unique IDs; 3. Metadata list (GET /metadata) to display an overview of stored records; 4. Metadata details (GET /metadata/{id}) to retrieve the table and column structure of a specific record; 5. Metadata deletion (DELETE /metadata/{id}) to clean up data; 6. Data classification (POST /classify) to extract sample data, use LLM to return the probability distribution of PII categories, and support confidence evaluation.

5

Section 05

Unique Advantages of LLM in Data Classification

Traditional regex matching struggles to handle complex data formats (e.g., distinguishing encrypted credit card numbers from product codes). LLMs, with their semantic understanding capabilities: 1. Distinguish different data types with similar formats; 2. Identify sensitive information in non-standard formats (e.g., hand-transcribed ID numbers); 3. Output probability distributions instead of binary judgments to aid risk assessment; 4. Adapt to new formats without updating rule bases, solving the limitations of traditional methods.

6

Section 06

Deployment Steps and Application Scenarios

Deployment steps are simple: 1. Clone the code repository; 2. Create a .env file based on .env.example, configure database connections and OpenAI API keys; 3. Run docker-compose up --build to start the service. The system serves on local port 8000, including Swagger UI documentation (http://localhost:8000/docs). Application scenarios include: data compliance auditing, data migration assessment, data asset management, and security risk assessment.

7

Section 07

Summary and Future Outlook

This project combines LLM intelligence with traditional data engineering to solve data governance challenges. Its modular architecture, containerized deployment, and clear API design make it easy to integrate into existing data infrastructures. In the future, with the advancement of LLM technology, the system can further improve in accuracy, supported data types, and real-time processing capabilities, making it a practical open-source tool for enterprises to handle legacy databases or compliance reviews.