Section 01
[Introduction] MADE Benchmark: A New Paradigm for Trustworthy Evaluation of Medical AI
MADE is a dynamic multi-label classification benchmark for medical device adverse events. It prevents data contamination via strict temporal partitioning, systematically evaluates the predictive performance and uncertainty quantification (UQ) methods of over 20 models, and reveals the complex trade-off between model size and UQ quality. Its core innovation lies in the continuous update mechanism, which addresses the issues of saturation and data contamination in existing benchmarks, providing a real and reliable evaluation platform for medical AI.