Zing Forum

Reading

How ChatGPT's Launch Changed the AI Information Ecosystem on Reddit: An Information Retrieval Study Based on BM25

This article introduces an information retrieval project that studies the impact of ChatGPT's launch on Reddit search result rankings. By comparing Reddit post data before and after ChatGPT's release and using the BM25 algorithm to analyze changes in retrieval results for AI-related queries, the research team found that major AI product launches can significantly alter the information environment users encounter through search systems.

信息检索BM25ChatGPTReddit信息曝光搜索引擎自然语言处理机器学习数据挖掘社交媒体分析
Published 2026-04-24 02:13Recent activity 2026-04-24 02:19Estimated read 7 min
How ChatGPT's Launch Changed the AI Information Ecosystem on Reddit: An Information Retrieval Study Based on BM25
1

Section 01

[Main Floor] Study on the Impact of ChatGPT's Launch on Reddit's AI Information Ecosystem (Based on BM25 Retrieval Analysis)

This study examines the impact of ChatGPT's launch on AI-related search results on Reddit. By comparing post data before and after the launch (2022.10.5-11.29 vs. 2022.11.30-2023.1.24) and using the BM25 algorithm to analyze changes in retrieval results, it was found that major AI product launches significantly alter the information environment: a surge in information volume (post-launch data volume is 5x pre-launch), a shift in content structure (increase in news category share, decrease in technical category share), reduced diversity of retrieval results, and changes in retrieval quality that reflect the essential differences in the information environment.

2

Section 02

Research Background: The AI Discussion Boom Triggered by ChatGPT and the Issue of Information Environment Changes

ChatGPT's launch on November 30, 2022, triggered a boom in AI discussions. Core question: When a major AI product is launched, are there measurable changes in the types of content presented by information retrieval systems? The EECS767 team at the University of Kansas chose Reddit (active AI discussions + open data) to conduct research and quantitatively analyze its impact on AI-related search results.

3

Section 03

Research Design: Data Sources, Time Windows, and Retrieval System Settings

Data Sources: 5 AI subreddits (r/ChatGPT, etc.), crawled via the Arctic Shift API, retaining posts with AI keywords. Time Windows: Pre (2022.10.5-11.29, 4278 posts), Post (2022.11.30-2023.1.24, 20245 posts). Post-launch data volume is 5x pre-launch. The zero results for "ChatGPT" in the pre period validates the window's effectiveness. Retrieval System: Pyserini BM25 index (k₁=0.9, b=0.4). Separate indices for pre and post periods to ensure fair comparison. Query Set: 30 queries divided into three categories—entity-based (ChatGPT, GPT-4, etc.), topic-based (AI regulation, etc.), and scenario-based (how to use ChatGPT, etc.). Top 10 documents retrieved for each query.

4

Section 04

Content Classification System and Core Evaluation Metrics

Document Classification: Based on rule matching, divided into four categories:

  1. News/external content (contains external URLs)
  2. Q&A/help (contains question words/question marks)
  3. Personal experience (first-person narrative)
  4. Technical content (tutorials, code, etc.) Evaluation Metrics:
  • JSD: Measures category distribution differences
  • Entropy: Measures diversity
  • Overlap: Degree of overlap between document sets Statistical significance was tested using t-tests and Bootstrap confidence intervals (2000 iterations).
5

Section 05

Key Findings: Content Structure Shift, Reduced Diversity, and Changes in Retrieval Quality

Changes in Content Distribution:

Category Pre-launch Share Post-launch Share Change
News/External 80.6% 82.0% +1.4%
Technical Content 15.7% 13.7% -2.0%
Q&A/Help 3.5% 3.7% +0.2%
Personal Experience 0.3% 0.6% +0.3%
Reduced Diversity: JSD was significantly greater than 0 at k=3/5/10 (p<0.001), and post-launch entropy values were lower (p=0.027 at k=10).
Retrieval Quality: Post-launch nDCG@10 (0.981) was higher than pre-launch (0.873), but the low pre-launch scores were due to the non-existence of concepts like "how to use ChatGPT", reflecting changes in the information environment.
6

Section 06

Research Limitations and Future Improvement Directions

Limitations:

  1. The rule-based classifier is sensitive to external URLs, which may overestimate the share of the news category.
  2. The 30 queries do not cover all AI-related queries. Future Directions:
  • Expand to platforms like Twitter and Zhihu.
  • Analyze the impact of events like GPT-4 and Sora.
  • Introduce machine learning classifiers and larger query sets.
7

Section 07

Conclusions and Implications: Impact of Major AI Events on the Information Ecosystem

Conclusions:

  1. Surge in information volume: Post-launch data volume is 5x pre-launch.
  2. Content structure shift: Significant changes in category distribution (JSD p<0.001) and reduced diversity (entropy p=0.027).
  3. Retrieval quality paradox: Higher post-launch metrics are due to the lack of relevant concepts in the pre period, not algorithm differences. Implications:
  • Developers need to pay attention to the impact of temporal changes on retrieval results, maintaining diversity and quality.
  • Users need critical thinking and actively seek diverse information sources.