# Practical Oil and Gas MLOps: A Complete Machine Learning Pipeline from Data to Production Environment

> An MLOps project for production prediction in Argentina's Vaca Muerta unconventional oil and gas field, fully demonstrating engineering practices of Airflow orchestration, Feast feature store, MLFlow experiment tracking, and FastAPI inference service.

- 板块: [Openclaw Geo](https://www.zingnex.cn/en/forum/board/openclaw-geo)
- 发布时间: 2026-05-27T18:12:25.000Z
- 最近活动: 2026-05-27T18:19:29.036Z
- 热度: 154.9
- 关键词: MLOps, 机器学习工程, 特征仓库, Airflow, Feast, MLFlow, 时序预测, 能源行业, XGBoost, FastAPI
- 页面链接: https://www.zingnex.cn/en/forum/thread/mlops-5eb9c17f
- Canonical: https://www.zingnex.cn/forum/thread/mlops-5eb9c17f
- Markdown 来源: floors_fallback

---

## 【Main Floor/Introduction】Practical Oil and Gas MLOps: A Complete Machine Learning Pipeline from Data to Production Environment

This project is a master's program project in Artificial Intelligence at the University of San Andrés, Argentina, maintained by fedehofmann and open-sourced on GitHub (Project link: https://github.com/fedehofmann/oil_and_gas_mlops_pipeline). It targets the production prediction scenario of Argentina's Vaca Muerta unconventional oil and gas field, building a complete MLOps pipeline that integrates Airflow orchestration, Feast feature store, MLFlow experiment tracking, and FastAPI inference service. It achieves engineering implementation from data to production environment, providing a reference MLOps architecture template for the energy industry.

## Project Background and Business Scenario

In the energy industry, accurate prediction of monthly oil and gas well production is crucial for production planning, investment decisions, and resource allocation. Argentina's Vaca Muerta region is one of the world's largest unconventional oil and gas fields, and its production data has strong time-series characteristics and complex feature dependencies. This project targets this real scenario, aiming to predict the natural gas or oil production (unit: cubic meters/month) of a specified month based on historical data of oil and gas wells, which is a typical practice of machine learning moving from the laboratory to the production environment.

## Data Architecture and Feature Engineering

### Data Sources
The project uses two core datasets publicly available from Argentina's Ministry of Energy:
1. Production data (main dataset): Contains monthly production readings of natural gas, oil, and water from unconventional oil and gas wells
2. Well location information (supplementary data): Contains static metadata such as operating company, geological structure, basin, and coordinates

### Core Feature Design
- **Raw measurement features**: Extraction type (categorical), well depth (meters), effective flow time (days), associated water production (cubic meters/month)
- **Time-series aggregation features**: Average natural gas/oil production over the past 10 months, latest natural gas/oil production, cumulative reading count (proxy indicator for well maturity)

### Data Quality Processing
- Automatically exclude 2020 data (to avoid COVID-19 pandemic interference)
- Mark heterogeneous technical wells before Vaca Muerta's maturation
- Support overriding default filtering parameters when triggering DAG

## Technical Architecture and MLOps Practices

### FTI Architecture Pattern
Adopts the Feature-Training-Inference (FTI) architecture, divided into three pipelines:
- **Feature pipeline**: Responsible for data ingestion, transformation, and storage (download dataset → compute features → Parquet offline storage → Feast apply → SQLite online materialization)
- **Training pipeline**: Reads data from historical feature storage, uses XGBoost for incremental learning (training in monthly chunks), runs 10 groups of experiments in parallel (5 groups per target variable), and selects the model with the best R² score
- **Inference pipeline**: Builds a REST API based on FastAPI, combined with online feature storage to provide real-time predictions (supports single well prediction and available well list query)

### Feature Store (Feast)
Solves training-serving skew and serves as a pipeline contract:
- Offline storage: Parquet format, stores historical features for training
- Online storage: SQLite, stores latest features for real-time inference

### Experiment Tracking and Model Management (MLFlow)
- Experiment tracking: Records metrics (R², RMSE, MAE) and hyperparameters for each experiment
- Model registry: Manages versions and automatically promotes the optimal model to production version

## Model Training Strategy

### Incremental Learning Design
The core reason for choosing XGBoost over RandomForest is memory efficiency. It trains in monthly chunks, loading only the current month's data to avoid memory issues.

### Experiment Configuration
For each target variable (natural gas/oil production), 5 groups of experiments are run, totaling 10 groups. Experimental variables include:
- Number of iterations per chunk (n_estimators_per_chunk)
- Maximum depth of trees (max_depth)
- Feature subset combinations

### Evaluation Metrics
- Test set: The latest 20% of time-series data
- Core metrics: R² (coefficient of determination), RMSE (root mean square error), MAE (mean absolute error)

## Engineering Highlights and Maturity Assessment

### Engineering Practice Highlights
- **Reproducibility**: Docker Compose orchestrates all services (Airflow, MLFlow, Feast, API), a single command triggers the complete training process, and feature definitions are centrally managed
- **Documentation completeness**: features.py fully records features, Airflow UI triggers training with one click, and Feast ensures consistency between training and serving features

### MLOps Maturity Assessment
According to Google's classification, it reaches **Level1 (Continuous Training)**:
| Capability Dimension | Project Status |
|----------------------|----------------|
| Model Building | ✅ Automated (Airflow DAG) |
| Training Process | ✅ Automated (Monthly Scheduling) |
| Feature Store | ✅ Available (Feast Dual Storage) |
| Metadata Management | ✅ Available (MLFlow Registry) |
| Deployment Method | ⚠️ Manual (Docker API Startup) |
| CI/CD | ❌ None |
| Monitoring | ⚠️ Partial (MLFlow Training Metrics) |

**Level2 Improvement Directions**: Automated testing, canary/blue-green deployment, active monitoring in production environment (prediction bias, feature drift detection)

## Application Value and Conclusion

### Practical Application Value
Provides a reference template for data science teams in the energy industry:
1. Time-series prediction scenario: Handling production data with strong time dependencies
2. Feature engineering best practices: Design ideas from raw readings to aggregated features
3. Productionization path: Evolution route from experimental code to deployable services
4. Technical selection reference: Airflow+Feast+MLFlow+FastAPI combination

### Conclusion
This project is a master's project of the University of San Andrés in Argentina, demonstrating the complete implementation of MLOps in a real business scenario, and is a systematic practice of machine learning engineering thinking. For teams that want to push models to the production environment, it is a reference case worth in-depth study.
