High-resolution content creation is rapidly emerging as a central challenge in both the vision and graphics communities. Images serve as the most fundamental modality for visual expression, and content generation that aligns with the user intent requires effective, controllable high-resolution image manipulation mechanisms. However, existing approaches remain limited to low-resolution settings, typically supporting only up to 1K resolution. In this work, we introduce the task of high-resolution image editing and propose a test-time optimization framework to address it. Our method performs patch-wise optimization on high-resolution source images, followed by a fine-grained detail transfer module and a novel synchronization strategy to maintain consistency across patches. Extensive experiments show that our method produces high-quality edits, facilitating high-resolution content creation.
High-resolution image editing should preserve the source image's fine-grained details while applying the user's intended semantic change. Most existing editing methods, however, are designed for lower-resolution inputs and are difficult to apply directly to 2K or higher-resolution images.
A straightforward approach is to edit a downsampled image and then apply super-resolution. This separates the edit from the original high-resolution source: the super-resolution model only observes the low-resolution edited result, so it may struggle to recover source-specific details.
ScaleEdit instead uses the low-resolution edited image to capture the semantic change, then transfers fine-grained details from the high-resolution source during generation. The final output is a high-resolution image that follows the intended edit while preserving the visual fidelity of the original source.
ScaleEdit first divides the high-resolution source image into patches at the model-native resolution, allowing it to fully inherit the pretrained model's strong generative priors. Overall, our method generates detail-enhanced patches through patch-wise test-time optimization and refines them with a synchronization strategy for cross-patch consistency.
ScaleEdit leverages the low-resolution edit to capture the desired semantic change. For each high-resolution patch, it learns a transfer function that maps the low-resolution source toward the corresponding high-resolution details. During generation, this learned transfer function is applied during the reverse process of the target image, injecting source-specific fine-grained details into the high-resolution output.
High-resolution images are processed patch by patch, which can create visible seams when patches are generated independently. ScaleEdit synchronizes neighboring patches during denoising by blending Tweedie estimates. As a result, adjacent patches share consistent details and transitions before final assembly, producing coherent high-resolution target images without boundary artifacts.
ScaleEdit preserves fine-grained source details more faithfully than edit-then-super-resolution baselines, which reconstruct high-resolution target images from the edited low-resolution image alone.
We provide additional qualitative results for ScaleEdit across multiple resolutions. The results show that ScaleEdit effectively transfers fine-grained source details to the target image.
ScaleEdit also supports 8K image editing without additional training, demonstrating strong scalability. Its patch-wise detail transfer enables high-resolution edits beyond the limits of standard image editing pipelines.
We compare ScaleEdit with and without synchronization. The zoomed crops highlight patch consistency and detail preservation: with synchronization, neighboring patches form smoother transitions and avoid boundary artifacts.
| Source (1K) | w/o Sync. | Ours | w/o Sync. (Zoomed) | Ours (Zoomed) |
|---|---|---|---|---|
![]() |
![]() |
![]() |
|
|
| "Replace the dragon with a jet plane in the sky." | ||||
![]() |
![]() |
![]() |
|
|
| "Replace the giraffe with a dragon under the lightning sky." | ||||
![]() |
![]() |
![]() |
|
|
| "Replace the mountain with a volcano under the castle." | ||||
![]() |
![]() |
![]() |
|
|
| "Replace the feather with a bird flying through the flames." | ||||
| Source (2K) | w/o Sync. | Ours | w/o Sync. (Zoomed) | Ours (Zoomed) |
|---|---|---|---|---|
|
|
|
|
|
| "Replace the dragon with a phoenix flying over the mountain." | ||||
|
|
|
|
|
| "Replace the bridge with a stone bridge with lanterns." | ||||
[1] Qiu et al. "Freescale: Unleashing the resolution of diffusion models via tuning-free scale fusion", ICCV, 2025.
[2] Dong et al. "Tsd-sr: One-step diffusion with target score distillation for real-world image super-resolution", CVPR, 2025.
[3] Sun et al. "Pixel-level and semantic-level adjustable super-resolution: A dual-lora approach", CVPR, 2025.
[4] Cheng et al. "Effective diffusion transformer architecture for image super-resolution", AAAI, 2025.
[5] Duan et al. "Dit4sr: Taming diffusion transformer for real-world image super-resolution", ICCV, 2025.
[6] Zhang et al. "In-context edit: Enabling instructional image editing with in-context generation in large scale diffusion transformer", NeurIPS, 2025.
[7] Zhu et al. "Kv-edit: Training-free image editing for precise background preservation", ICCV, 2025.
[8] Comanici et al. "Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities", 2025.
@InProceedings{Lee_2026_CVPR,
author = {Lee, Junsung and Lee, Hyunsoo and Lee, Yong Jae and Han, Bohyung},
title = {Low-Resolution Editing is All You Need for High-Resolution Editing},
booktitle = {CVPR},
year = {2026}
}