LaViDa-O (18s / image)
BAGEL (45s / image)
LaViDa-O (1 step detection)
Ours Detection
a cute dog [mask][mask][mask][mask] a boy [mask][mask][mask][mask] ships [mask][mask][mask][mask]
Qwen2.5-VL (autoregressive)
Qwen Detection
Text-to-Image with Planning
Image-Editing with Planning
Self-Reflection
*demo accelerated for better viewing experience