Zing Forum

Reading

Chicago Taxi Operation AI Prediction Audit: From 77GB Big Data to Interpretable Machine Learning

An end-to-end data engineering and machine learning project that uses Google BigQuery to process 77GB of Chicago taxi trip data, build prediction models, and implement transparent model deployment.

机器学习数据工程出租车预测BigQuery可解释AI芝加哥交通大数据时间序列预测
Published 2026-05-09 04:55Recent activity 2026-05-09 05:00Estimated read 15 min
Chicago Taxi Operation AI Prediction Audit: From 77GB Big Data to Interpretable Machine Learning
1

Section 01

Introduction to the Chicago Taxi Operation AI Prediction Audit Project

This project is an end-to-end data engineering and machine learning project that uses Google BigQuery to process 77GB of Chicago taxi trip data, build prediction models, and implement transparent model deployment. The project aims to improve taxi operation efficiency, support regulatory audits, combine technical implementation with interpretability, and provide a scientific basis for urban traffic management.

2

Section 02

Project Background and Significance

Project Background and Significance

The taxi industry is an important part of urban transportation, and its operational efficiency directly affects citizens' travel experience and urban traffic management. As the third-largest city in the United States, Chicago generates massive amounts of taxi trip data every day, which contains rich information on travel patterns, demand forecasting, and operational optimization. However, traditional data analysis methods struggle to handle such large-scale datasets, let alone extract predictive insights from them.

This project was born in this context, aiming to conduct comprehensive data auditing and predictive analysis of the Chicago taxi industry through modern data engineering techniques and machine learning algorithms. The project not only focuses on technical implementation but also emphasizes model interpretability and transparency, providing a scientific basis for industry regulation and operational decision-making.

3

Section 03

Dataset Overview and Technical Challenges

Dataset Overview and Technical Challenges

The core data source of the project is the public Chicago taxi trip dataset, stored on the Google BigQuery platform, with an original data volume of up to 77GB. This scale is considered large among public transportation datasets, posing severe challenges to data processing and modeling.

The dataset contains years of taxi trip records, covering key fields such as pickup time, drop-off time, trip distance, fare amount, payment method, and pickup location coordinates. The spatiotemporal characteristics of these fields give the data obvious time-series and spatial distribution features, providing a rich space for feature engineering in predictive modeling.

The main technical challenges in processing such large-scale data include memory limitations in data cleaning, time complexity of feature calculation, computational resource requirements for model training, and real-time requirements for prediction services. The project uses distributed computing and incremental processing strategies to address these challenges.

4

Section 04

Data Engineering Architecture Design

Data Engineering Architecture Design

The project adopts an end-to-end data engineering architecture, forming a complete closed loop from raw data ingestion to final model deployment. The data flow is divided into the following key stages:

First, the data ingestion layer, which efficiently reads massive raw data through BigQuery connectors and performs preliminary data quality checks, including missing value detection, outlier identification, and data type verification. This stage ensures that subsequent processing is based on reliable data.

Second, the feature engineering layer, which is the core link of the entire project. The team extracted periodic features such as hour, week, month, and holidays from the time dimension; calculated geographic features such as hotspots, trip distance distribution, and inter-regional traffic from the spatial dimension; and also built lag features and sliding window statistics to capture the temporal dependence of demand.

Third, the model training layer, which uses a distributed training framework to process large-scale samples. The project compared multiple algorithms, including gradient boosting trees, random forests, and deep learning models, and finally selected a solution that balances prediction accuracy and interpretability.

Finally, the model deployment layer, which encapsulates the trained model as an API service to support real-time prediction requests, and integrates model monitoring and drift detection mechanisms.

5

Section 05

Prediction Models and Algorithm Selection

Prediction Models and Algorithm Selection

In terms of model selection, the project team conducted systematic experimental comparisons. Considering the scenario characteristics of taxi demand forecasting—needing to capture periodic patterns, the impact of unexpected events, and spatial correlations simultaneously—the team finally adopted an ensemble learning approach.

Gradient Boosting Decision Tree (GBDT) was selected as the core algorithm as the baseline model, due to its excellent handling of tabular data and natural output of feature importance. The model can automatically learn nonlinear interactions between features and performs stably in predicting special scenarios such as peak hours and bad weather.

To further improve prediction accuracy, the project also tried deep learning solutions, including Long Short-Term Memory (LSTM) networks and spatiotemporal graph neural networks. These models show advantages in handling long-range dependencies and inter-regional correlations, but their computational cost is high, so they were retained as alternative solutions.

Model evaluation uses time-series cross-validation to ensure that the evaluation results can truly reflect the model's performance on future data. Evaluation metrics include Root Mean Square Error (RMSE), Mean Absolute Percentage Error (MAPE), and business-oriented prediction accuracy.

6

Section 06

Model Interpretability and Transparency

Model Interpretability and Transparency

Unlike many black-box machine learning projects, this audit project places special emphasis on model interpretability. In public transportation regulation scenarios, prediction results need to be understandable and verifiable by auditors, and decision-making basis must be transparent and traceable.

The project uses SHAP (SHapley Additive exPlanations) value analysis to decompose the contribution of model predictions, identifying the direction and intensity of each feature's impact on a single prediction. This method not only explains "why the model predicts this way" but also discovers potential data biases and model prejudices.

In addition, the project built a global feature importance analysis to show which factors have the greatest impact on taxi demand. The analysis results verified common-sense perceptions—such as strong demand during morning and evening peaks on workdays, and a surge in taxi demand in bad weather—while also discovering some counterintuitive patterns, providing new perspectives for operational optimization.

Another dimension of model transparency is fairness auditing. The project detected whether the model has systematic biases in predictions for different regions and time periods, ensuring that the algorithm does not exacerbate inequalities in service resource allocation.

7

Section 07

Practical Application Scenarios and Business Value

Practical Application Scenarios and Business Value

The project's predictive capabilities can support a variety of practical application scenarios. In terms of capacity scheduling, taxi companies can pre-deploy vehicles to high-demand areas based on predicted demand, reducing empty driving rates and increasing driver income. At the urban planning level, traffic management departments can identify travel hotspots and congestion patterns, optimizing bus routes and infrastructure layout.

For regulatory audits, the transparency analysis provided by the project helps identify abnormal operational behaviors. For example, by comparing the deviation between actual trips and the prediction model, potential violations such as detours and refusal to carry passengers can be found. This data-driven audit method is more efficient and comprehensive than traditional spot checks.

The project also explored auxiliary decision support for dynamic pricing. Although it does not directly participate in the pricing algorithm, the demand prediction results provide input for price elasticity analysis, helping to understand demand responses at different price levels.

8

Section 08

Technical Gains and Future Outlook

Technical Gains and Future Outlook

Through this project, the team accumulated valuable experience in large-scale data engineering practice. The processing flow of 77GB data was optimized from the initial hour-level to minute-level, the feature engineering pipeline achieved a high degree of automation, and the model training process supports one-click reproduction and version management.

The open-source release of the project provides the data science community with a complete reference implementation, showing how to develop the full link from raw data to production-level models. The code structure is clear and the documentation is complete, making it suitable as a teaching case or a basis for secondary development.

Future work directions include: introducing real-time data stream processing to reduce prediction latency from hour-level to minute-level; integrating multi-source data such as weather, event calendars, and public transportation information to improve prediction accuracy; and exploring causal inference methods to move from prediction to decision optimization.