Zing Forum

Reading

MultimodalHugs Pipelines: An Experiment Management Framework for Sign Language Processing Research

The NLP team at the University of Zurich open-sourced an experiment management codebase for multimodal sign language processing, supporting model training, hyperparameter search, and reproducibility verification based on MultimodalHugs.

手语处理多模态学习MultimodalHugs实验管理可复现性PHOENIX数据集NLP研究
Published 2026-04-20 19:44Recent activity 2026-04-20 19:55Estimated read 6 min
MultimodalHugs Pipelines: An Experiment Management Framework for Sign Language Processing Research
1

Section 01

[Introduction] MultimodalHugs Pipelines: An Experiment Management Framework for Sign Language Processing Research

The NLP team at the University of Zurich has open-sourced the MultimodalHugs Pipelines experiment management framework, built on the MultimodalHugs extension framework. It supports training of sign language processing models, hyperparameter search, and reproducibility verification. It provides standardized benchmark tests for mainstream sign language datasets like PHOENIX, aiming to address infrastructure pain points in sign language processing research, lower the barrier to entry, and promote result comparability.

2

Section 02

1. Research Background of Multimodal Sign Language Processing

Sign language, as the primary communication method for the deaf community, has multimodal features such as hand movements, facial expressions, and body postures, making automatic recognition and translation challenging. In recent years, deep learning has advanced sign language processing, but mainstream frameworks have limited support for sign language multimodal data (videos, skeletal key points, gloss annotations). Hugging Face Transformers does not natively support vision-language multimodal data, forcing researchers to repeatedly implement infrastructure code, which increases the barrier to research.

3

Section 03

2. MultimodalHugs Framework and the Value of the Pipelines Project

MultimodalHugs (MMH) is an extension framework for Hugging Face developed by the sign language processing community, providing unified multimodal data representation, model extensions tailored to sign language characteristics, and Trainer integration. The University of Zurich's multimodalhugs-pipelines project is a collection of upper-layer experiment management code, with core values including: 1) Ensuring experiment reproducibility through scripted workflows and versioned configurations; 2) Supporting automated hyperparameter search on SLURM clusters; 3) Built-in support for datasets like PHOENIX, enabling standardized benchmark tests.

4

Section 04

3. Technical Architecture and Workflow of the Pipelines Project

The project uses a modular architecture, with workflows divided into: 1) Environment management: Automated virtual environment creation and dependency installation to ensure consistency; 2) Data pipeline: Automatic download of the PHOENIX dataset, with preprocessing steps like video decoding, frame sampling, and key point extraction; 3) Training management: Integration with SLURM, supporting distributed training and dry-run mode for configuration verification; 4) Evaluation: Providing repeatability test scripts to quantify the impact of randomness.

5

Section 05

4. Reproducibility Research and Benchmark Test Results

Reproducibility research identified sources of non-determinism: Differences remain in single-process data loaders, FP16/FP32 precision affects training dynamics, and there are minor differences in weight initialization. Benchmark test results: The base model on the PHOENIX dataset achieved a BLEU score of 10.691; Hyperparameter search ran 50 configurations (each taking about 2 hours); Three repeated runs yielded BLEU scores of 10.199, 10.217, and 10.472—results are stable but have fluctuations.

6

Section 06

5. Community Significance and Future Development Directions

Significance to the community: Lowering the research barrier (focus on innovation rather than infrastructure), promoting result comparability, supporting open-source collaboration, and providing educational cases. Future directions: Building larger-scale sign language datasets, exploring self-supervised pre-training strategies, developing real-time applications, and researching cross-sign-language transfer learning.