Fuse3D – Multi-Image Fusion for Controlled 3D Asset Generation

<20s

Generation Time (3 Images)

1×

Single NVIDIA A6000 GPU

SIGGRAPH
Asia '25

Publication Venue

1st

Multi-Image Region-Level 3D Control

Overview

Breaking the Single-Image Barrier
in 3D Content Generation

Fuse3D — short for Generating 3D Assets Controlled by Multi-Image Fusion — addresses a fundamental limitation shared by virtually all existing text-to-3D and image-to-3D pipelines: they accept only a single conditioning image as global input, leaving creators unable to specify different visual characteristics for different spatial regions of a model in a single generation pass.

Developed at the State Key Laboratory of CAD&CG, Zhejiang University, Fuse3D introduces a principled multi-condition architecture that fuses visual features from multiple independent reference images and assigns them to precisely targeted 3D regions — without requiring any fine-tuning of the underlying generative model. The result is unprecedented local control over geometry, texture, and appearance, all within a single coherent 3D asset produced in under 20 seconds.

The framework is built upon TRELLIS, Microsoft's state-of-the-art image-to-3D model, and adopts 3D Gaussian Splatting (3DGS) as its core scene representation — a choice that enables photorealistic rendering at interactive frame rates while remaining fully compatible with downstream editing workflows.

Core Method

Three Innovations, One Unified Framework

The Fuse3D pipeline introduces three tightly integrated modules, each solving a distinct challenge inherent to multi-image 3D generation. Together, they form a complete system for region-level control without model retraining.

Multi-Condition Fusion Module (MCFM)

Integrates visual features from multiple distinct conditioning images into a unified set of condition tokens. The module supports hierarchical control spanning global structure down to fine local detail, enabling each reference image to contribute independently to the final 3D output while remaining harmonious with the whole.

3D Semantic-Aware Alignment (Voxel Alignment)

Automatically aligns 2D image regions selected by the user with their spatially corresponding 3D voxel regions. Using attention maps extracted from the pretrained TRELLIS model, Fuse3D establishes semantic 2D-to-3D correspondences without manual 3D annotation, enabling accurate region-specific feature transfer.

Local Attention Enhancement Strategy

Resolves feature conflicts that arise when multiple conditioning signals target different regions simultaneously. By constructing localized attention matrices and exposing an adjustable Enhancement Factor, the strategy gives users fine-grained, real-time control over the relative influence of each reference image on the generated output.

// Generation Pipeline

01 · Input

Multi-Image Conditioning

User provides multiple reference images and selects target regions for each via an interactive interface.

02 · MCFM

Feature Fusion

Visual features from all conditioning images are merged into a unified latent condition token representation.

03 · Alignment

2D-to-3D Mapping

TRELLIS attention maps semantically link each 2D region to its corresponding 3D voxel space location.

04 · Enhancement

Conflict Resolution

Local attention matrices enforce region boundaries, balancing competing features via the Enhancement Factor.

05 · Output

3D Asset

TRELLIS VAE decoder renders the final structured latent representation into a complete 3D Gaussian Splatting asset.

Applications

Designed for Real Creative Workflows

Fuse3D is engineered to address concrete production challenges in game development, VFX, virtual reality, and AI-assisted design. Its sub-20-second generation time on a single GPU makes it practical for iterative creative exploration.

Texture Generation

Multi-Reference Texture Synthesis

Generate high-fidelity, region-consistent textures for untextured 3D meshes using multiple localized reference images. Different parts of a mesh — such as a wing, fuselage, and cockpit of an aircraft — can each draw from independent photographic references in a single pass.

Mesh Editing

Region-Level Feature Transfer

Select any region in a 2D reference image and transfer its visual characteristics — color, material, pattern, structure — to a corresponding location on an existing 3D mesh. Enables non-destructive, artistically directed editing without manual UV unwrapping or texture painting.

Creative Design

Hybrid Concept Asset Creation

Combine visual attributes from entirely different sources into a single coherent 3D model. Create fantasy creatures with mixed animal markings, vehicles that blend historical and futuristic design languages, or characters with region-specific costume elements — all in one generation.

Interactive Control

Real-Time Enhancement Factor Tuning

The exposed Enhancement Factor parameter allows artists to dynamically adjust the influence of each conditioning region during the generation process — turning what was previously a binary on/off operation into a continuous, expressive creative dial.

Research Team

Zhejiang University
CAD&CG National Key Laboratory

The State Key Laboratory of CAD&CG at Zhejiang University is one of China's foremost research institutions in computer graphics, computer vision, and virtual reality. Established in 1992 as a national "Seventh Five-Year Plan" key project, the laboratory has produced foundational work in rendering, 3D reconstruction, and AI-driven content generation. Fuse3D represents the lab's latest contribution to the global AIGC research community.

Xuancheng Jin

金宣丞

First Author

Rengan Xie

谢仁干

Corresponding Author

Wenting Zheng

郑文婷

Author

Rui Wang

王锐

Professor

Hujun Bao

鲍虎军

Professor · Lab Director

Yuchi Huo

霍宇驰

Communicating Author · ZJLab

Technical Specifications

System Requirements & Environment

Fuse3D is designed for deployment on professional Linux workstations. The TRELLIS base model weights are downloaded automatically from Hugging Face on first run; no manual model configuration is required. The system does not require fine-tuning of any pretrained components.

Specification	Details
Operating System	Linux (officially tested)
GPU	NVIDIA GPU with ≥ 16 GB VRAM (validated on NVIDIA A6000)
CUDA Version	CUDA 11.8 recommended
Python	Python 3.8 or higher
Key Dependencies	xformers, flash-attn, and standard PyTorch ecosystem libraries
Base Model	microsoft/TRELLIS-image-large (auto-downloaded from Hugging Face)
3D Representation	Structured Latent (SLat) + 3D Gaussian Splatting (3DGS)
Generation Latency	< 20 seconds for fusion of 3 conditioning images
Fine-tuning Required	None — TRELLIS weights are used as-is

Experimental Evaluation

Comparison Against State-of-the-Art

Because no prior method directly supports multi-image region-level 3D control, the Fuse3D team constructed a rigorous evaluation protocol: GPT-4o was used to generate descriptive text prompts from the multi-image inputs, which then guided four established baseline methods before their outputs were lifted to 3D via TRELLIS. Fuse3D consistently outperformed all baselines in regional precision and feature fidelity.

Baseline Method	Approach
IP-Adapter	Text + image condition fusion → SDXL 2D generation → TRELLIS 3D lift
MasaCtrl	SDXL-based real image editing via mutual self-attention → TRELLIS
Prompt-to-Prompt	Stable Diffusion prompt-space editing → TRELLIS
Ctrl-X	Structure + appearance control signal transfer → TRELLIS
Fuse3D (Ours)	Direct multi-image region-level conditioning — no 2D intermediate required

Resources

Access the Research & Code

All materials associated with the Fuse3D project — including the full paper, source code, and interactive demonstrations — are publicly available. We encourage researchers, engineers, and creative practitioners to explore, reproduce, and build upon this work.

Research Paper

arXiv 2602.17040

Source Code

GitHub · JINNMnm/Fuse3D

Project Homepage

Interactive Demo & Visualizations

ACM Digital Library

DOI 10.1145/3757377.3763943

// BibTeX Citation

@inproceedings{jin2025fuse3d,
  title     = {Fuse3D: Generating 3D Assets Controlled by Multi-Image Fusion},
  author    = {Jin, Xuancheng and Xie, Rengan and Zheng, Wenting
               and Wang, Rui and Bao, Hujun and Huo, Yuchi},
  booktitle = {ACM SIGGRAPH Asia 2025},
  year      = {2025},
  doi       = {10.1145/3757377.3763943},
  url       = {https://arxiv.org/abs/2602.17040}
}

Breaking the Single-Image Barrierin 3D Content Generation