Data Acquisition and Preprocessing Process
The data source for this project is the Lichess public database, which contains millions of annotated complete game records. The raw data is usually stored in PGN (Portable Game Notation) format, a standard chess notation format that records the algebraic notation of each move, timestamps, comments, and other information.
The data preprocessing phase requires completing several key tasks. First is game parsing: converting PGN-formatted text into a machine-understandable board state representation. Common representation methods include 8x8 matrix encoding, where each position is represented by a numerical value indicating the piece type (e.g., 1=white pawn, -1=black pawn, 2=white knight, -2=black knight, etc.).
Second is feature engineering. In addition to the original board state, the system extracts various auxiliary features: king safety assessment, control of the center, piece activity, pawn structure, etc. These features help the neural network better understand the strategic meaning of the position, rather than just memorizing specific piece positions.
Data cleaning is also important. It is necessary to filter out overly short games (such as abnormal endings due to timeout or disconnection), duplicate games, and games suspected of engine cheating. At the same time, depending on the target application scenario, stratified sampling by player rating may be required to ensure that the training data covers various styles from beginners to grandmasters.