Based on the above principles, the paper proposes an improved multimodal RoPE scheme: 2D RoPE extension: For image patches, use 2D RoPE: Pseudo-code illustration: def apply_2d_rope(patch_embed, pos_x, pos_y): # Apply rotation to x and y directions respectively rotated_x = apply_rope(patch_embed, pos_x) rotated_y = apply_rope(patch_embed, pos_y) return combine(rotated_x, rotated_y). Modality-aware unified space: Through clever design, map two-dimensional image positions and one-dimensional text positions to a unified high-dimensional space: Text position: (t) → mapped to a specific subspace; Image position: (x, y) → mapped to a complementary subspace. Explicit modality identification: Introduce modality type embedding to allow the model to distinguish whether it is processing an image or text.