Flexible Visual Editing via Multimodal Instruction Following
Shufan Li1*   Harkanwar Singh1   Aditya Grover1  
University of California, Los Angeles


The ability to provide fine-grained control for generating and editing visual imagery has profound implications for computer vision and its applications. Previous works have explored extending controllability in two directions: instruction tuning with text-based prompts and multi-modal conditioning. However, these works make one or more unnatural assumptions on the number and/or type of modality inputs used to express controllability. We propose InstructAny2Pix, a flexible multi-modal instruction-following system that enables users to edit an input image using instructions involving audio, images, and text. InstructAny2Pix consists of three building blocks that facilitate this capability: a multi-modal encoder that encodes different modalities such as images and audio into a unified latent space, a diffusion model that learns to decode representations in this latent space into images, and a multi-modal LLM that can understand instructions involving multiple images and audio pieces and generate a conditional embedding of the desired output, which can be used by the diffusion decoder. Additionally, to facilitate training efficiency and improve generation quality, we include an additional refinement prior module that enhances the visual quality of LLM outputs. These designs are critical to the performance of our system. We demonstrate that our system can perform a series of novel instruction-guided editing tasks.


Add [clock] to [image]
add [water] to [image]
add [dog] to [image]
remove [siren] from [image]
remove [rain] from [image]
Fit [image] to [upbeat electronic music]
Fit [image] to [mysterious ambient sound]
Fit [image] to the artstyle of [anime]
Fit [image] to the artstyle of [painting]
Add [birds] to foreground and [waterfall] to background of [image]
Remove [rain] and add [thunder] to [image]
Replace [scientist] with [witch] in [image]
Add [wolf] and [moon] to [image]
In the soft glow of a virtual sunset, Lucas and Sophia found themselves together. The sound of waves crashing against the shore filled their ears as they watched the sun dip below the horizon, casting a warm and ethereal light upon the virtual world around them. In that serene moment, as the virtual world faded away, their real hearts beat in harmony, and their connection transcended the confines of technology, igniting a spark of love that neither of them could deny. Make [image] fit the story
Bathed in a warm spotlight, her eyes revealed a quiet resolve. She drew in a deep breath and began to play, her fingers moving effortlessly across the instrument. The music swelled, filling the space with a poignant, emotive melody that touched every heart in attendance. As the final note hung in the air, the audience erupted in thunderous applause, and tears of joy glistened in the musician's eyes. Years of dedication and hard work had finally borne fruit, and she had achieved the success she had longed for, in the most exquisite manner. Make the [image] fit the story, considering the music she plays is [audio].