Ars Technica

Stability AI releases Stable Diffusion XL, its next-gen image synthesis model

View non-AMP version at arstechnica.com

Several examples of images generated using Stable Diffusion XL 1.0.
Enlarge / Several examples of images generated using Stable Diffusion XL 1.0.
Stable Diffusion

On Wednesday, Stability AI released Stable Diffusion XL 1.0 (SDXL), its next-generation open weights AI image synthesis model. It can generate novel images from text descriptions and produces more detail and higher-resolution imagery than previous versions of Stable Diffusion.

As with Stable Diffusion 1.4, which made waves last August with an open source release, anyone with the proper hardware and technical know-how can download the SDXL files and run the model locally on their own machine for free.

Local operation means that there is no need to pay for access to the SDXL model, there are few censorship concerns, and the weights files (which contain the neutral network data that makes the model function) can be fine-tuned to generate specific types of imagery by hobbyists in the future.

For example, with Stable Diffusion 1.5, the default model (trained on a scrape of images downloaded from the Internet) can generate a broad scope of imagery, but it doesn't perform as well with more niche subjects. To make up for that, hobbyists fine-tuned SD 1.5 into custom models (and later, LoRA models) that improved Stable Diffusion's ability to generate certain aesthetics, including Disney-style art, Anime art, landscapes, bespoke pornography, images of famous actors or characters, and more. Stability AI expects that community-driven development trend to continue with SDXL, allowing people to extend its rendering capabilities far beyond the base model.

Upgrades under the hood

Like other latent diffusion image generators, SDXL starts with random noise and "recognizes" images in the noise based on guidance from a text prompt, refining the image step by step. But SDXL utilizes a "three times larger UNet backbone," according to Stability, with more model parameters to pull off its tricks than earlier Stable Diffusion models. In plain language, that means the SDXL architecture does more processing to get the resulting image.

To generate images, SDXL uses an "ensemble of experts" architecture that guides a latent diffusion process. Ensemble of experts refers to a methodology where an initial single model is trained and then split into specialized models that are specifically trained for different stages of the generation process, which improves image quality. In this case, there is a base SDXL model and an optional "refiner" model that can run after the initial generation to make images look better.

Stable Diffusion XL includes two text encoders that can be combined. In this example by Xander Steenbrugge, an elephant and an octopus combine seamlessly into one concept.
Enlarge / Stable Diffusion XL includes two text encoders that can be combined. In this example by Xander Steenbrugge, an elephant and an octopus combine seamlessly into one concept.

Notably, SDXL also uses two different text encoders that make sense of the written prompt, helping to pinpoint associated imagery encoded in the model weights. Users can provide a different prompt to each encoder, resulting in novel, high-quality concept combinations. On Twitter, Xander Steenbrugge showed an example of a combined elephant and an octopus using this technique.

And then there are improvements in image detail and size. While Stable Diffusion 1.5 was trained on 512×512 pixel images (making that the optimal image generation size but lacking detail for small features), Stable Diffusion 2.x increased that to 768×768. Now, Stability AI recommends generating 1024×1024 pixel images with Stable Diffusion XL, resulting in greater detail than an image of similar size generated by SD 1.5.

Local control, open philosophy

We downloaded the Stable Diffusion XL 1.0 model and ran it locally on a Windows machine using an RTX 3060 GPU with 12GB of VRAM. Interfaces such as ComfyUI and AUTOMATIC1111's Stable Diffusion web UI make the process more user-friendly than when Stable Diffusion first launched last year, but it still requires some technical finagling to get it working. If you want to try it, this tutorial can point you in the right direction.

Overall, we saw image generations with a dreamlike quality, angling more toward the style of commercial AI image generator Midjourney. SDXL shines by providing greater detail in larger image sizes, as mentioned above. It also seems to follow prompts with more fidelity, although that's debatable.

Other notable improvements include the ability to render hands slightly better than previous SD models, and it's better at rendering text in images. But as with earlier models, generating quality images is still like pulling a slot machine lever and hoping for a good result. Experts find that careful prompting (and lots of trial and error) is the key to better results.

There are also drawbacks to running SDXL locally on consumer hardware, such as higher memory requirements and slower generation times than with Stable Diffusion 1.x and 2.x. (On our test rig, a 1024x1024 image at 20 steps, Euler Ancestral, CFG 8, rendered in 23.3 seconds for SD 1.5 and 26.4 seconds for SDXL 1.0. The resulting SDXL image had fewer repeating elements than the SD 1.5 image.)

So far, SD hobbyists seem to lament the lack of numerous fine-tuned LoRAs available for SD 1.5-style models that enhance aesthetics (such as a 3D-rendered style) or more detailed backdrops for certain scenes, but they expect that the community will fill in those gaps soon enough.

Community is key where Stable Diffusion is concerned since the model can run locally without oversight. That's a boon to an underground scene of amateur synthographers who utilize the software to craft interesting artwork. But it also means that the software can be used to create deepfakes, pornography, and disinformation. To Stability AI, the trade-off between some negative aspects and openness is worth it.

In a technical report on SDXL listed on arXiv earlier this month, Stability complains that "black box" models (such as OpenAI's DALL-E and Midjourney) that don't let users download the weights "make it challenging to assess the biases and limitations of these models in an impartial and objective way." They further claim that the closed nature of those models "hampers reproducibility, stifles innovation, and prevents the community from building upon these models to further the progress of science and art."

That kind of idealism is likely small comfort for artists who feel threatened by technology that utilizes scrapes of artists' work without permission to train models like SDXL. And it won't quiet the lawsuits over copyright. But even so, despite ethical issues with image synthesis technology, it keeps rolling along anyway, and that's exactly the way Stable Diffusion hobbyists like it.