Stability AI releases Stable Diffusion XL, its next-gen image synthesis model

Benj Edwards
07/27/2023 10:59 pm
Categories: AI, Biz & IT, Tech

Enlarge / Several examples of images generated using Stable Diffusion XL 1.0.
Stable Diffusion

On Wednesday, Stability AI released Stable Diffusion XL 1.0 (SDXL), its next-generation open weights AI image synthesis model. It can generate novel images from text descriptions and produces more detail and higher-resolution imagery than previous versions of Stable Diffusion.

For example, with Stable Diffusion 1.5, the default model (trained on a scrape of images downloaded from the Internet) can generate a broad scope of imagery, but it doesn't perform as well with more niche subjects. To make up for that, hobbyists fine-tuned SD 1.5 into custom models (and later, LoRA models) that improved Stable Diffusion's ability to generate certain aesthetics, including Disney-style art, Anime art, landscapes, bespoke pornography, images of famous actors or characters, and more. Stability AI expects that community-driven development trend to continue with SDXL, allowing people to extend its rendering capabilities far beyond the base model.

An image generated using a beta version of Stable Diffusion XL by a Reddit user named masslevel.

masslevel/Stable Diffusion
An image generated using a beta version of Stable Diffusion XL by a Reddit user named masslevel.

masslevel/Stable Diffusion
An image generated using a beta version of Stable Diffusion XL by a Reddit user named masslevel.

masslevel/Stable Diffusion
An image generated using a beta version of Stable Diffusion XL by a Reddit user named masslevel.

masslevel/Stable Diffusion

Upgrades under the hood

Like other latent diffusion image generators, SDXL starts with random noise and "recognizes" images in the noise based on guidance from a text prompt, refining the image step by step. But SDXL utilizes a "three times larger UNet backbone," according to Stability, with more model parameters to pull off its tricks than earlier Stable Diffusion models. In plain language, that means the SDXL architecture does more processing to get the resulting image.

Local control, open philosophy

We downloaded the Stable Diffusion XL 1.0 model and ran it locally on a Windows machine using an RTX 3060 GPU with 12GB of VRAM. Interfaces such as ComfyUI and AUTOMATIC1111's Stable Diffusion web UI make the process more user-friendly than when Stable Diffusion first launched last year, but it still requires some technical finagling to get it working. If you want to try it, this tutorial can point you in the right direction.

Overall, we saw image generations with a dreamlike quality, angling more toward the style of commercial AI image generator Midjourney. SDXL shines by providing greater detail in larger image sizes, as mentioned above. It also seems to follow prompts with more fidelity, although that's debatable.

Other notable improvements include the ability to render hands slightly better than previous SD models, and it's better at rendering text in images. But as with earlier models, generating quality images is still like pulling a slot machine lever and hoping for a good result. Experts find that careful prompting (and lots of trial and error) is the key to better results.

An AI-generated image of "a human hand," created using SDXL 1.0.

Stable Diffusion
An AI-generated image of "A man holding a sign that says Ars Technica," created using SDXL 1.0. Believe it or not, but this is an improvement from the generic scribble in earlier versions of SD. Also, note that the image quality of SDXL suffers if the generation is under 768 pixels in at least one dimension, as was the case here.

Stable Diffusion

There are also drawbacks to running SDXL locally on consumer hardware, such as higher memory requirements and slower generation times than with Stable Diffusion 1.x and 2.x. (On our test rig, a 1024x1024 image at 20 steps, Euler Ancestral, CFG 8, rendered in 23.3 seconds for SD 1.5 and 26.4 seconds for SDXL 1.0. The resulting SDXL image had fewer repeating elements than the SD 1.5 image.)

So far, SD hobbyists seem to lament the lack of numerous fine-tuned LoRAs available for SD 1.5-style models that enhance aesthetics (such as a 3D-rendered style) or more detailed backdrops for certain scenes, but they expect that the community will fill in those gaps soon enough.

Community is key where Stable Diffusion is concerned since the model can run locally without oversight. That's a boon to an underground scene of amateur synthographers who utilize the software to craft interesting artwork. But it also means that the software can be used to create deepfakes, pornography, and disinformation. To Stability AI, the trade-off between some negative aspects and openness is worth it.

Stability AI releases Stable Diffusion XL, its next-gen image synthesis model

Further Reading

Upgrades under the hood

Further Reading

Local control, open philosophy

Further Reading