Zing Forum

Reading

Multimodal Real Estate Valuation Model Integrating CLIP Visual Features

By combining traditional tabular data with visual features extracted zero-shot by the CLIP model, this study achieves valuation performance significantly superior to pure tabular baselines on 730 real estate data samples from Gijón, Spain.

multimodalCLIPreal-estatezero-shotvaluation
Published 2026-05-20 02:19Recent activity 2026-05-20 02:52Estimated read 7 min
Multimodal Real Estate Valuation Model Integrating CLIP Visual Features
1

Section 01

[Main Floor] Core Results of the Multimodal Real Estate Valuation Model Integrating CLIP Visual Features

This paper proposes a multimodal real estate valuation model that integrates traditional tabular data with visual features extracted zero-shot by the CLIP model. It achieves performance significantly superior to pure tabular baselines on 730 real estate data samples from Gijón, Spain. The core innovation lies in using CLIP's zero-shot capability to capture visual information such as decoration and lighting from property photos, providing more comprehensive feature support for real estate valuation.

2

Section 02

Research Background and Motivation

Traditional real estate valuation relies on structured tabular data such as location, area, and number of rooms. However, visual information in property photos—like decoration status, lighting conditions, and view—has a substantial impact on housing prices but is difficult to capture by traditional models. The team from the Department of Mathematics at the University of Oviedo (Spain) raised a key question: Can computer vision improve real estate valuation models?

3

Section 03

Methodological Framework

Data Foundation

The study is based on 730 real estate data samples from Gijón, Spain (as of January 19, 2026), including approximately 21,700 property photos from the Fotocasa platform (in compliance with academic usage terms).

Visual Feature Extraction

The OpenAI CLIP model (ViT-B/32 version with laion2b_s34b_b79k weights) is used to extract zero-shot visual scores across 6 dimensions: decoration status, lighting conditions, material quality, kitchen facilities, bathroom conditions, and view. The scoring mechanism is the similarity between the image and positive prompts minus the similarity with negative prompts, and the results are pre-cached for reproducibility.

Model Architecture

Ridge regression is used to model log(price) (housing prices follow a log-normal distribution). Features are min-max normalized to [-1,1], hyperparameters are searched via RidgeCV (100 values of alpha in the log space from -5 to 8), and Jensen's correction is applied when inverse-transforming to Euros.

4

Section 04

Experimental Results and Statistical Validation

10-fold cross-validation results comparison:

Model R² Test Set MAE (€) RMSE (€) MAPE (%)
M1 — Pure Tabular Baseline 0.59 58,181 95,688 25.2
M2 — Baseline + Feature Engineering 0.59 57,410 92,059 24.8
M3 — Baseline + CLIP 0.62 56,441 92,607 23.8
M4 — Baseline + FE + CLIP 0.62 56,474 94,642 23.7
M6 — Ridge + XGBoost Cascade 0.71 47,714 88,735 18.9

M3 is the main model of the study (per Occam's Razor principle). Compared to M1, its MAE is reduced by approximately 1,740 Euros and MAPE by 1.4 percentage points. The Wilcoxon signed-rank test (one-tailed right) confirms that M3 outperforms M1 (p=0.0205), verifying the effectiveness of visual features.

5

Section 05

Technical Implementation and Application Value

Technical Implementation

The project uses modular code design:

  • data.py: Data loading, IQR outlier filtering, one-hot encoding
  • features.py: Feature set management
  • clip_scorer.py: CLIP zero-shot scoring (supports caching)
  • models.py: Cross-validation encapsulation
  • evaluate.py: Evaluation metrics and statistical tests
  • plots.py: Visualization charts Includes a complete Jupyter Notebook workflow (from EDA to result analysis).

Application Value

  1. Zero-shot capability: Extract visual features without labeled data
  2. Interpretability: Clear meaning of scores across 6 dimensions
  3. Performance improvement: Statistically significant improvement with a simple model
  4. Reproducibility: Open-source code and precomputed features facilitate verification and extension
6

Section 06

Limitations and Future Directions

Limitations

  • Data is limited to Gijón, Spain, with a small sample size
  • CLIP scoring relies on predefined prompt templates

Future Directions

  • Validation with larger-scale multi-city data
  • End-to-end fine-tuning of visual encoders
  • Fine-grained room-level visual analysis
7

Section 07

Research Summary

This study successfully verifies the effectiveness of CLIP visual features in real estate valuation, providing a concise paradigm for integrating traditional tabular data with visual information (zero-shot visual scoring + traditional regression model). This methodology is generalizable and can be transferred to other valuation scenarios that require combining structured data with visual perception.