Reading

eta-engine: A New York Taxi Trip Time Prediction System Fusing Neural Networks and LightGBM

An ensemble learning framework combining deep neural networks and gradient boosting trees, which achieves accurate prediction of New York taxi trip times by learning spatial embedding representations from raw trip data.

机器学习行程时间预测神经网络LightGBM集成学习空间嵌入出租车数据纽约深度学习梯度提升树

Published 2026-05-17 05:15Recent activity 2026-05-17 05:18Estimated read 7 min

Section 01

[Overview] eta-engine: A New York Taxi Trip Time Prediction System Fusing Neural Networks and LightGBM

In urban traffic management and ride-sharing services, accurately predicting taxi trip times is a key challenge. Traditional methods struggle to capture complex dynamics, and single models have limited performance. The eta-engine project proposes an innovative solution: fusing deep neural networks with LightGBM gradient boosting trees to achieve accurate prediction of New York taxi trip times by learning spatial embedding representations, combining the advantages of both model types to improve prediction performance.

Section 02

Project Background and Problem Definition

As one of the world's largest taxi markets, New York City generates millions of trip records with spatiotemporal information daily. Trip time prediction faces multiple challenges: highly dynamic urban traffic (fluctuations caused by weather, events, etc.), complex spatial relationships (describable by non-Euclidean distances), and noise and outliers in data. To address these challenges, eta-engine builds a complete machine learning system that extracts spatial embeddings and combines the advantages of multiple algorithms for prediction.

Section 03

Core Architecture: Dual-Model Ensemble Design

eta-engine adopts an ensemble architecture of neural networks and LightGBM, with complementary advantages:

Neural networks: Learn complex spatial relationships and nonlinear feature interactions; convert discrete geographic locations into continuous vectors via embedding layers to automatically capture latent relationships.
LightGBM: Process structured features (time, distance, high-level features from neural networks) efficiently and with interpretability. The two models are fused at the feature level (embeddings as input to LightGBM), and prediction results can be weighted averaged or stacked to improve robustness.

Section 04

Spatial Embedding: Data-Driven Geographic Relationship Learning

Traditional methods use latitude/longitude or fixed partitions, lacking semantic information. eta-engine uses data-driven embedding: each geographic location is a learnable vector, optimized based on trip times during training. For example, the vectors of origin and destination for fast trips are close, while those for long-duration trips are far apart. Without manual rules, it automatically discovers patterns from historical data (e.g., clustering of commercial centers and transportation hubs) and captures rich semantics such as regional functions and traffic connections.

Section 05

Feature Engineering and Model Training Optimization

Feature Engineering: Decompose time features (hour, week, month); distance features (Haversine, Manhattan distance); integrate historical statistics (average time for specific time periods/routes). Data Preprocessing: Filter unreasonable trips (negative time, excessively long duration), records outside New York, duplicate or corrupted data. Training Strategy: Neural network pre-training (learn initial embeddings) → end-to-end fine-tuning; LightGBM uses cross-validation, hyperparameter search, and early stopping to prevent overfitting; ensemble strategies (simple average, weighted average, stacking) to improve accuracy.

Section 06

Application Scenarios and Practical Value

The practical value of eta-engine is extensive:

Ride-sharing platforms: Improve ETA prediction accuracy, optimize user experience and dispatch efficiency.
Urban planning: Identify congestion hotspots and bottlenecks, assist in traffic signal timing and infrastructure planning.
Driver decision-making: Support selecting favorable orders and planning optimal routes to maximize revenue.

Section 07

Technical Insights and Future Outlook

Technical Insights: The fusion of deep learning and traditional machine learning (representation learning + structured data processing) can be extended to spatiotemporal problems such as delivery time prediction and bus arrival estimation; open-source implementation provides best practice references. Future Outlook: Expand multi-modal data (real-time traffic, weather); model road topology using GNNs; introduce reinforcement learning to implement dynamic prediction strategies, moving toward more intelligent and accurate traffic prediction.

Continue Reading

Keep going with more reads from the same topic.

SignalCut: An Intelligent Tool for Turning AI Search Visibility Gaps into Video Marketing Campaigns

SignalCut is an innovative web application that analyzes brands' visibility gaps in AI search, automatically generates evidence-based marketing strategies, and creates Hera video materials, helping early-stage brands gain a competitive edge in the AI answer engine era.

Recent activity 2026-04-26 11:27

AWS Open-Sources AI Search Citation Analysis System: Track Brand Exposure in AI Search Engines

An open-source project officially released by AWS, built on Amazon Bedrock, Step Functions, and React to form a complete serverless citation analysis system. It helps enterprises monitor their brand's citation status and competitive landscape in AI searches like ChatGPT, Perplexity, Gemini, and Claude.

Recent activity 2026-03-31 20:49

Next.js Application SEO and GEO Integrated Optimization Solution: Comprehensive Visibility from Search Engines to AI Assistants

This article delves into the stevewerme/seo-geo-nextjs project, an open-source tool designed specifically for Next.js applications to simultaneously optimize traditional search engine rankings (SEO) and generative engine visibility (GEO). It analyzes the project's core architecture, implementation mechanisms, practical application scenarios, and its strategic significance for developers and content creators.

Recent activity 2026-04-03 14:48

Baiyuan GEO Platform Technical White Paper: SaaS Engineering Practice for Generative Engine Optimization (GEO)

This article deeply analyzes the GEO Platform technical white paper developed by Baiyuan Technology, covering the seven-dimensional AI citation rate scoring algorithm, AXP shadow document delivery mechanism, Schema.org three-layer entity knowledge graph, and the hallucination automatic detection and repair closed-loop system, providing an engineering solution for brands to gain visibility in generative AI such as ChatGPT and Claude.

Recent activity 2026-04-18 22:54