Zing Forum

Reading

AWS-based Sensor Data Lakehouse Architecture: STEDI Gait Trainer Data Analysis Practice

This article details how to build a lakehouse solution for sensor data, using the STEDI gait trainer as a case study to demonstrate the complete data engineering pipeline from data collection to machine learning model training.

数据湖仓AWS传感器数据机器学习数据工程步态分析
Published 2026-04-28 09:44Recent activity 2026-04-28 09:50Estimated read 7 min
AWS-based Sensor Data Lakehouse Architecture: STEDI Gait Trainer Data Analysis Practice
1

Section 01

[Introduction] AWS-based STEDI Gait Trainer Data Lakehouse Architecture Practice

This article uses the STEDI gait trainer as a case study to detail how to build a lakehouse solution for sensor data, showing the complete data engineering pipeline from data collection to machine learning model training. The project addresses the high-frequency, multi-source heterogeneous, and real-time requirements of sensor data by using AWS cloud services to build a lakehouse architecture, solving the challenges of traditional data warehouses, providing data support for medical rehabilitation and elderly care, and can be extended to other IoT data analysis scenarios.

2

Section 02

Project Background: Value of Gait Data and Challenges of Traditional Architectures

Gait balance ability is an important indicator for assessing the fall risk of the elderly and the effect of rehabilitation treatment. The STEDI gait trainer collects user gait data in real time through built-in sensors, and cooperates with mobile applications to record behaviors, forming valuable data assets. However, the high-frequency, multi-source heterogeneous, and real-time requirements of sensor data pose severe challenges to traditional data warehouse architectures.

3

Section 03

Lakehouse Architecture Selection and AWS Tech Stack Components

The project adopts a lakehouse architecture, which combines the flexibility of data lakes with the structured management capabilities of data warehouses. Raw data is stored in low-cost object storage, and the metadata layer provides schema constraints and ACID transaction support. The tech stack is based on AWS: Amazon S3 as the storage layer, AWS Glue for data catalog and ETL, Amazon Athena for SQL queries, AWS Lambda for real-time stream processing, and Amazon SageMaker for machine learning development and training.

4

Section 04

Data Collection & Ingestion and Cleaning & Transformation Process

Data Collection: STEDI collects IMU data at 50-100Hz, transmits it to mobile apps via Bluetooth, then uploads to the cloud through API Gateway, supporting batch and streaming ingestion to ensure data integrity; mobile apps record user interaction events, which need to be time-aligned with sensor data.

Cleaning & Transformation: Raw data undergoes validation (timestamp, value range), anomaly detection, missing value handling (interpolation/forward filling), and unit standardization; feature engineering extracts time-domain (mean, variance, etc.), frequency-domain (FFT spectrum), time-frequency (wavelet transform), and gait cycle (step length, step frequency, etc.) features.

5

Section 05

Data Layered Management and Machine Learning Training Support

Data Layered Management: The Bronze layer stores raw JSON/Parquet files (partitioned by date and device ID); the Silver layer contains cleaned and standardized data suitable for exploratory analysis; the Gold layer is an aggregated feature table (daily user summaries, training session summaries, etc.) that directly serves model training and reporting.

Machine Learning Training Support: Data scientists query data independently via Glue Data Catalog, and SageMaker Studio integration simplifies workflows; automated training dataset generation supports time window sampling, class balancing, time-series segmentation, and feature version management.

6

Section 06

Technical Challenges and Countermeasures

High-throughput Writing: Use Parquet columnar storage, reasonable file sizes (128MB-256MB), and Glue parallel writing to solve small file and performance bottlenecks.

Real-time & Offline Consistency: Use Spark Structured Streaming and Batch shared code, and idempotent writing to ensure result consistency.

Data Privacy Compliance: Multi-layer security controls such as S3 server-side encryption, IAM permission control, VPC isolation, and user ID hashing.

7

Section 07

Summary and Generalizable Best Practices

This project built a production-level lakehouse solution for IoT sensor data. Key experiences include: layered architecture to manage data lifecycle, managed services to reduce operation and maintenance burden, and optimized data interfaces to support downstream machine learning. This architecture can be extended to IoT data analysis scenarios such as industrial equipment predictive maintenance and sports performance analysis.