Variational Test-time Optimization for Diffusion Synchronization

Abstract

Collaborative generation, which coordinates multiple diffusion trajectories to extend the capabilities of pretrained priors, has emerged as a powerful paradigm for extending the applicability of diffusion models. Among existing approaches, diffusion synchronization provides a scenario-agnostic solution by introducing general guidance mechanisms. However, current synchronization approaches rely heavily on heuristics and still require task-specific tailoring, which limits their generalizability and performance. In this work, we mathematically derive a synchronization framework based on optimal control, providing a principled explanation of diffusion synchronization. During sampling, we optimize control variables to guide multiple trajectories toward coherent solutions while remaining close to the underlying diffusion prior. Our method operates entirely at test-time without additional training, thereby enabling broad applicability across diverse generation scenarios when combined with strong pretrained priors. We demonstrate consistent improvements over baselines on three representative collaborative generation tasks, covering a wide range of modalities and applications. Beyond performance gains, our work establishes a novel foundation for collaborative generation, opening a principled path toward extending pretrained generative models to new collaborative generation settings.

Introduction

Collaborative generation couples multiple diffusion trajectories so each one remains plausible on its own, while all of them agree as one larger structure, such as a wide image, an optical illusion, or a textured 3D mesh.
Diffusion synchronization tackles collaborative generation by coordinating those trajectories during sampling, but existing methods often rely on heuristic averaging or impractical Gaussian approximations.
SyncVC treats synchronization as controlled sampling: it optimizes small variational controls at test time so trajectories become consistent while staying close to the pretrained diffusion prior.

Method

A. Problem Formulation

The goal of collaborative generation is to generate a set of consistent elements rather than a single isolated sample. For example, in wide image generation, the elements are overlapping patches, while in optical illusion generation, the elements are transformed views of one image. Across these cases, a task-specific reward measures whether the elements agree with the collaborative structure required by the task.

B. Variational Controls

SyncVC introduces control variables at each denoising step. Inspired by the variational control view of steering diffusion trajectories via controls, SyncVC optimizes these controls to align neighboring trajectories under a task-specific reward for consistency while regularizing toward the pretrained diffusion prior. This turns synchronization into an optimized test-time control problem rather than a manually selected heuristic rule.

Overview of SyncVC coupling multiple diffusion trajectories with variational controls during denoising — **Overall mechanism.** Variational controls couple neighboring trajectories so generated elements agree while staying close to the pretrained diffusion prior.

Intuition: SyncVC does not retrain or modify the diffusion model. It only optimizes small steering directions during sampling, one timestep and one neighboring trajectory at a time.

C. Rewards for Different Tasks

Wide image generation

Neighboring patches are rewarded when they agree in their overlapping regions, encouraging a coherent wide canvas.

Optical illusion generation

Transformed views are rewarded when they agree visually while still satisfying different prompt conditions.

3D mesh texturing

Each new rendered view is rewarded for matching the texture implied by previous views of the source mesh.

D. Algorithm: Test-time Optimization Loop

The actual implementation is a simple loop around a pretrained diffusion sampler. At each denoising timestep, SyncVC first samples the leading trajectory normally. Then, for each following trajectory, it optimizes a variational control variable conditioned on the previously synchronized trajectories, and uses that control to take the next denoising step.

Pseudocode for the SyncVC test-time sampler with greedy sequential optimization of variational controls — **Practical sampler.** At each denoising step, controls are zero-initialized, optimized sequentially, and used to steer each trajectory toward a coherent solution.

Experimental Results

Wide Image Generation

SyncVC improves wide image generation by maintaining style and color consistency across horizontally arranged patches. Compared with baselines, it reduces discontinuities, object color shifts, and boundary artifacts.

Flexible Control with Additional Constraints

Because SyncVC is formulated through rewards and controls, it can naturally incorporate additional constraints such as style guidance. For stylized wide image generation, a style-transfer reward guides texture and color while preserving the prompt semantics.

Stylized wide image generation results showing SyncVC incorporating a style-transfer reward — **Stylized wide image generation.** **SyncVC** transfers texture and overall color from the style reference while preserving prompt semantics without artifacts.

Additional Comparisons

Additional wide image qualitative comparisons between SyncVC and synchronization baselines — **Additional comparisons for wide image generation.** **SyncVC** maintains unified color and style in both examples; baselines show mountain/sky shifts, inconsistent tree and flower colors, or discontinuities.

Additional Results

Additional wide image examples generated by SyncVC across diverse prompts — **Additional wide image generation results with Stable Diffusion.** Across diverse prompts, **SyncVC** generates wide images with strong style consistency.

SANA-based SyncVC wide image result at 4096 by 1024 resolution — **Wide image generation with the SANA model (4096 × 1024 resolution).** With the pretrained SANA model, **SyncVC** synthesizes high-resolution wide images from 1024² patches.

SANA-based SyncVC wide image result at 8192 by 2048 resolution — **Wide image generation with the SANA model (8192 × 2048 resolution).** Using 2048² SANA patches, **SyncVC** extends generation horizontally to ultra-high resolution.

Optical Illusion Generation

SyncVC produces images that clearly support multiple semantic interpretations under transformations such as rotation and vertical flip, while maintaining high visual quality.

Additional optical illusion examples generated by SyncVC under multiple transformations — **Additional results for optical illusion generation.** **SyncVC** encodes both prompt semantics under rotation, producing table/waterfall and horse/snowy-village interpretations.

Text-guided 3D Mesh Texturing

SyncVC synchronizes multiple generated views during text-guided 3D mesh texturing, producing more realistic and detailed textures with fewer artifacts.

Additional SyncVC 3D mesh texturing results with realistic fine-grained object textures — **Additional results for text-guided 3D mesh texturing.** **SyncVC** generates realistic, artifact-free textures across diverse meshes and text prompts.

Understanding the Roles of Controls

At early timesteps (large $t$ ), the controls focus on shaping the overall semantics of an image to satisfy the optimization objective, whereas at later timesteps (small $t$ ), they progressively manipulate fine-grained details.

Visualization of optimized SyncVC controls across denoising steps from coarse structure to fine details — **Optimized controls.** Under clockwise rotation, controls are optimized to align the prompts "an oil painting of a horse" and "an oil painting of a snowy mountain village".

Selected References

[1] Bar-Tal et al. "MultiDiffusion: Fusing diffusion paths for controlled image generation", ICML, 2023.

[2] Kim et al. "SyncTweedies: A general generative framework based on synchronized diffusions", NeurIPS, 2024.

[3] Lee et al. "SyncSDE: A probabilistic framework for diffusion synchronization", CVPR, 2025.

[4] Yeo et al. "StochSync: Stochastic diffusion synchronization for image generation in arbitrary spaces", ICLR, 2025.

[5] Pandey et al. "Variational control for guidance in diffusion models", ICML, 2025.

[6] Xu et al. "Diffusion-based visual anagram as multi-task learning", WACV, 2025.

[7] Zhang et al. "TexPainter: Generative mesh texturing with multi-view consistency", SIGGRAPH, 2024.

[8] Richardson et al. "TEXTure: Text-guided texturing of 3D shapes", SIGGRAPH, 2023.

Citation

@article{lee2026variational,
  title={Variational Test-time Optimization for Diffusion Synchronization},
  author={Hyunsoo Lee and Farrin Marouf Sofian and Kushagra Pandey and Stephan Mandt},
  journal={arXiv preprint arXiv:2606.15614},
  year={2026}
}