Zing Forum

Reading

HouseNet: A Multimodal House Price Prediction Model Fusing Visual and Structured Data

A multimodal deep learning model that fuses CNN image features (MobileNetV2) with tabular data, combined with a 16-dimensional city embedding layer and Huber loss function, achieving an R² score of 0.72-0.80 and reducing MAE to $100k-$130k in the Southern California house price prediction task.

多模态学习房价预测计算机视觉MobileNetV2嵌入层深度学习房地产估值Huber损失数据融合
Published 2026-04-19 14:02Recent activity 2026-04-19 14:23Estimated read 5 min
HouseNet: A Multimodal House Price Prediction Model Fusing Visual and Structured Data
1

Section 01

Introduction to the HouseNet Multimodal House Price Prediction Model

HouseNet is a multimodal deep learning model that fuses visual and structured data. It extracts image features via MobileNetV2, combines them with tabular data, uses a 16-dimensional city embedding layer and Huber loss function, achieving an R² score of 0.72-0.80 and reducing MAE to $100k-$130k in the Southern California house price prediction task, significantly improving prediction accuracy.

2

Section 02

Project Background and Research Motivation

The Southern California real estate market is complex; houses in the same neighborhood can have vastly different prices due to differences in appearance and environment. Traditional models rely on structured data and ignore visual information. HouseNet assumes that house images contain value-related visual cues such as building quality and landscape, and fusing visual and structured data can improve prediction accuracy.

3

Section 03

Technical Architecture Design

HouseNet uses an end-to-end multimodal fusion architecture: 1. Visual feature extraction uses MobileNetV2 (lightweight and efficient, extracting multi-scale features); 2. Structured data is standardized and encoded, then concatenated with visual features; 3. A 16-dimensional city embedding layer maps city names to dense vectors, capturing geo-economic similarities and trained jointly; 4. Log transformation handles long-tailed distribution, and Huber loss balances MSE and MAE for strong robustness.

4

Section 04

Performance

HouseNet performs excellently in the Southern California house price prediction task: R² of 0.72-0.80 (proportion of explained variance), MAE of $100k-$130k, MAPE of 14-18%. Given the large price range in the market, this error level is acceptable.

5

Section 05

Key Findings and Inferences from Ablation Experiments

  1. Value of multimodal fusion: Visual cues (e.g., decoration, landscape) supplement information missing from structured data; 2. Role of city embedding: More flexible than simple encoding, capturing complex relationships between cities and facilitating generalization; 3. Synergy of log transformation and Huber loss: Compresses extreme values, reduces the impact of abnormal samples, and focuses on the patterns of typical houses.
6

Section 06

Application Scenarios and Commercial Value

  1. Real estate valuation: Provides more accurate automatic valuation for platforms like Zillow; 2. Investment decision-making: Identifies undervalued/overvalued properties; 3. Market trend analysis: Discovers changes in visual factors affecting house prices; 4. Insurance assessment: Assists in premium pricing.
7

Section 07

Technical Limitations and Improvement Directions

  1. Dependence on data quality: Image quality affects feature extraction; 2. Temporal dynamics: Regular training is needed to adapt to market changes; 3. Interpretability: Attention mechanisms can be introduced to enhance transparency; 4. Cross-region generalization: Need to verify the effect of transfer to other regions.
8

Section 08

Summary and Insights for Multimodal Learning

HouseNet demonstrates the potential of multimodal learning in real estate valuation, achieving excellent performance by fusing visual and structured data with technologies like city embedding. Insights: The importance of modal complementarity, domain knowledge encoding, target engineering, and lightweight architecture, providing references for other multi-source data prediction tasks.