Reading

Practical Oil and Gas MLOps: A Complete Machine Learning Pipeline from Data to Production Environment

An MLOps project for production prediction in Argentina's Vaca Muerta unconventional oil and gas field, fully demonstrating engineering practices of Airflow orchestration, Feast feature store, MLFlow experiment tracking, and FastAPI inference service.

MLOps机器学习工程特征仓库AirflowFeastMLFlow时序预测能源行业XGBoostFastAPI

Published 2026-05-28 02:12Recent activity 2026-05-28 02:19Estimated read 10 min

Section 01

【Main Floor/Introduction】Practical Oil and Gas MLOps: A Complete Machine Learning Pipeline from Data to Production Environment

This project is a master's program project in Artificial Intelligence at the University of San Andrés, Argentina, maintained by fedehofmann and open-sourced on GitHub (Project link: https://github.com/fedehofmann/oil_and_gas_mlops_pipeline). It targets the production prediction scenario of Argentina's Vaca Muerta unconventional oil and gas field, building a complete MLOps pipeline that integrates Airflow orchestration, Feast feature store, MLFlow experiment tracking, and FastAPI inference service. It achieves engineering implementation from data to production environment, providing a reference MLOps architecture template for the energy industry.

Section 02

Project Background and Business Scenario

In the energy industry, accurate prediction of monthly oil and gas well production is crucial for production planning, investment decisions, and resource allocation. Argentina's Vaca Muerta region is one of the world's largest unconventional oil and gas fields, and its production data has strong time-series characteristics and complex feature dependencies. This project targets this real scenario, aiming to predict the natural gas or oil production (unit: cubic meters/month) of a specified month based on historical data of oil and gas wells, which is a typical practice of machine learning moving from the laboratory to the production environment.

Section 03

Data Architecture and Feature Engineering

Data Sources

The project uses two core datasets publicly available from Argentina's Ministry of Energy:

Production data (main dataset): Contains monthly production readings of natural gas, oil, and water from unconventional oil and gas wells
Well location information (supplementary data): Contains static metadata such as operating company, geological structure, basin, and coordinates

Core Feature Design

Raw measurement features: Extraction type (categorical), well depth (meters), effective flow time (days), associated water production (cubic meters/month)
Time-series aggregation features: Average natural gas/oil production over the past 10 months, latest natural gas/oil production, cumulative reading count (proxy indicator for well maturity)

Data Quality Processing

Automatically exclude 2020 data (to avoid COVID-19 pandemic interference)
Mark heterogeneous technical wells before Vaca Muerta's maturation
Support overriding default filtering parameters when triggering DAG

Section 04

Technical Architecture and MLOps Practices

FTI Architecture Pattern

Adopts the Feature-Training-Inference (FTI) architecture, divided into three pipelines:

Feature pipeline: Responsible for data ingestion, transformation, and storage (download dataset → compute features → Parquet offline storage → Feast apply → SQLite online materialization)
Training pipeline: Reads data from historical feature storage, uses XGBoost for incremental learning (training in monthly chunks), runs 10 groups of experiments in parallel (5 groups per target variable), and selects the model with the best R² score
Inference pipeline: Builds a REST API based on FastAPI, combined with online feature storage to provide real-time predictions (supports single well prediction and available well list query)

Feature Store (Feast)

Solves training-serving skew and serves as a pipeline contract:

Offline storage: Parquet format, stores historical features for training
Online storage: SQLite, stores latest features for real-time inference

Experiment Tracking and Model Management (MLFlow)

Experiment tracking: Records metrics (R², RMSE, MAE) and hyperparameters for each experiment
Model registry: Manages versions and automatically promotes the optimal model to production version

Section 05

Model Training Strategy

Incremental Learning Design

The core reason for choosing XGBoost over RandomForest is memory efficiency. It trains in monthly chunks, loading only the current month's data to avoid memory issues.

Experiment Configuration

For each target variable (natural gas/oil production), 5 groups of experiments are run, totaling 10 groups. Experimental variables include:

Number of iterations per chunk (n_estimators_per_chunk)
Maximum depth of trees (max_depth)
Feature subset combinations

Evaluation Metrics

Test set: The latest 20% of time-series data
Core metrics: R² (coefficient of determination), RMSE (root mean square error), MAE (mean absolute error)

Section 06

Engineering Highlights and Maturity Assessment

Engineering Practice Highlights

Reproducibility: Docker Compose orchestrates all services (Airflow, MLFlow, Feast, API), a single command triggers the complete training process, and feature definitions are centrally managed
Documentation completeness: features.py fully records features, Airflow UI triggers training with one click, and Feast ensures consistency between training and serving features

MLOps Maturity Assessment

According to Google's classification, it reaches Level1 (Continuous Training):

Capability Dimension	Project Status
Model Building	✅ Automated (Airflow DAG)
Training Process	✅ Automated (Monthly Scheduling)
Feature Store	✅ Available (Feast Dual Storage)
Metadata Management	✅ Available (MLFlow Registry)
Deployment Method	⚠️ Manual (Docker API Startup)
CI/CD	❌ None
Monitoring	⚠️ Partial (MLFlow Training Metrics)

Level2 Improvement Directions: Automated testing, canary/blue-green deployment, active monitoring in production environment (prediction bias, feature drift detection)

Section 07

Application Value and Conclusion

Practical Application Value

Provides a reference template for data science teams in the energy industry:

Time-series prediction scenario: Handling production data with strong time dependencies
Feature engineering best practices: Design ideas from raw readings to aggregated features
Productionization path: Evolution route from experimental code to deployable services
Technical selection reference: Airflow+Feast+MLFlow+FastAPI combination

Conclusion

This project is a master's project of the University of San Andrés in Argentina, demonstrating the complete implementation of MLOps in a real business scenario, and is a systematic practice of machine learning engineering thinking. For teams that want to push models to the production environment, it is a reference case worth in-depth study.