Stable Diffusion XL puts AI-generated visual worlds at your GPU’s command

unequivocal · Thursday at 7:05 PM

Up to now, I have found that Midjourney is about 3 months ahead of Stable Diffusion in terms of prompt coherence, image detail, and image quality. But I haven't compared them for a little while - has anything changed?

euzeka · Thursday at 7:19 PM

SDXL also offers drawbacks when running locally on consumer hardware, such as higher memory requirements and slower generation times

It would be nice to have some approximate numbers here.

Defenestrar · Thursday at 7:22 PM

The blending of two inputs seems to be extremely useful. Often times one wants to have a subject and a scene, for example and it's difficult (in my very limited experience) to merge those well. The county fair is coming up, perhaps I should revisit the Stability installation I played with a year ago and see what can be cooked up in the digital art realm.

I had previously even tried it out for work in synthesizing images for safety presentations - the kind of scenes where you'd never actually set up to shoot for actual safety reasons.

I was also in a training seminar today when HR said they've licensed software to read text in multiple types of AI trained voices for slide and video voiceovers. I remember some of the articles here a while back which introduced me to that tech. Now it's at least one fully fledged commercial product. That alone will save so much time versus the amateur voice acting we have to do for our web-based training sessions. Not to mention kick up the variety and possibilities for keeping it more engaging.

This is the fastest technology singularity I've been alive to witness so far. It's pretty amazing.

archtop · Thursday at 7:23 PM

The first test of any new image-creating AI program seems to be to have it create an image of Nicole Kidman (or a vague resemblance thereof).

Defenestrar · Thursday at 7:23 PM

euzeka said:
It would be nice to have some approximate numbers here.

I suspect it depends to some extent on each user's TPL (total patience level).

Mustachioed Copy Cat · Thursday at 7:29 PM

Linguistic pre-processing seems nice, but was already sorta hacked in with alternating prompts.

The community seems excited about. We’ll see what the checkpoint makers can do with it…

Dadlyedly · Thursday at 7:29 PM

Old Bitsmasher said:
And here we thought that social media had plumbed the depths of human depravity. Gentle reader, you ain't seen nothin' yet.

Every major advancement in media also seems to be an advancement in depravity in general, and I'm not sure how I'm going to feel about it in the next few years. To paraphrase: I just don't feel like I have any choice but to ride this wave while the scientists are busy figuring out how to make it work instead of asking themselves if they should.

Defenestrar · Thursday at 7:30 PM

unequivocal said:
Up to now, I have found that Midjourney is about 3 months ahead of Stable Diffusion in terms of prompt coherence, image detail, and image quality. But I haven't compared them for a little while - has anything changed?

There's a really good summary of recent updates to Stable Diffusion here. Very much worth a read.

Tall Dwarf · Thursday at 7:30 PM

archtop said:
The first test of any new image-creating AI program seems to be to have it create an image of Nicole Kidman (or a vague resemblance thereof).

Nah, its just the author's preference. Tons of examples, Kidman is nowhere to be seen.

Zomboe · Thursday at 7:31 PM

euzeka said:
It would be nice to have some approximate numbers here.

According to the release announcement, "This two-stage architecture allows for robustness in image generation without compromising on speed or requiring excess compute resources. SDXL 1.0 should work effectively on consumer GPUs with 8GB VRAM or readily available cloud instances."

ANNOUNCING SDXL 1.0 — Stability AI

The Stability AI team is proud to release as an open model SDXL 1.0, the next iteration in the evolution of text-to-image generation models. Following the limited, research-only release of SDXL 0.9, the full version of SDXL has been improved to be the world's best open image generation model.

stability.ai

Over time, the community has improved speed and lowered memory requirements for previous versions of Stable Diffusion, so hopefully that will be the case for SDXL as well.

Looking forward to trying this out. I think most people (including myself) have still been using the 512x512 models, so a jump to 1024x1024 is pretty significant!

Defenestrar · Thursday at 7:38 PM

Tall Dwarf said:
Nah, its just the author's preference. Tons of examples, Kidman is nowhere to be seen.

In the past I know Mr. Edwards has used the same prompt (maybe even seeds) for different generations of the software. I wouldn't be surprised if this is the case for the familiarity of certain images in the article series.

Zomboe · Thursday at 7:38 PM

Defenestrar said:
The blending of two inputs seems to be extremely useful. Often times one wants to have a subject and a scene, for example and it's difficult (in my very limited experience) to merge those well. The county fair is coming up, perhaps I should revisit the Stability installation I played with a year ago and see what can be cooked up in the digital art realm.

You might look into the ControlNet stuff. I haven't played with it myself, but supposedly it's good for distinguishing/manipulating aspects like subject vs. background, poses, depth (in 3D), etc.

GitHub - lllyasviel/ControlNet: Let us control diffusion models!

Let us control diffusion models! Contribute to lllyasviel/ControlNet development by creating an account on GitHub.

github.com

Rris · Thursday at 7:39 PM

"A man that holds a sign saying Ars Technica".... and 6 or 7 fingers on each hand.

Defenestrar · Thursday at 7:40 PM

Rris said:
"A man that holds a sign saying Ars Technica".... and 6 or 7 fingers on each hand.

Clearly a natural born Ars forum member. Extra fingers for faster typing.

Civitello · Thursday at 7:48 PM

Old Bitsmasher said:
And here we thought that social media had plumbed the depths of human depravity. Gentle reader, you ain't seen nothin' yet.

I followed the unstable diffusion NSFW discord a while ago and occasionally peek in to see what they are up to. Some interesting stuff for sure. This box has been opened and will never be closed again.

Tall Dwarf · Thursday at 7:53 PM

Civitello said:
I followed the unstable diffusion NSFW discord a while ago and occasionally peek in to see what they are up to. Some interesting stuff for sure. This box has been opened and will never be closed again.

The Internet is for porn, and so is AI. I doubt OP was concerned about NSFW, it seems to be about "stolen" pictures.

benjedwards · Thursday at 7:53 PM

Defenestrar said:
In the past I know Mr. Edwards has used the same prompt (maybe even seeds) for different generations of the software. I wouldn't be surprised if this is the case for the familiarity of certain images in the article series.

It's true, I was attempting to replicate the now-famous "Stable Diffusion lady" from my original Stable Diffusion article last year, but with SDXL.

VelvetGlove · Thursday at 8:07 PM

From the example, looks like stable diffusion still doesn't have hands figured out. Hands are notoriously difficult to render.

Fred Duck · Thursday at 8:08 PM

benjedwards said:
It's true, I was attempting to replicate the now-famous "Stable Diffusion lady" from my original Stable Diffusion article last year, but with SDXL.

I suppose this could be a younger version. She has the same evil eyebrows.

Still, "techinica?" It even takes liberties with specific text prompts?

randomcat · Thursday at 8:21 PM

I just got a local instance running. Taking requests.

Edit: just messing around with very basic prompts for a couple minutes, I can attest that this thing is very powerful, very fast, and has no obvious filters (I'm not going to push too far on that last one)

KWRussell · Thursday at 8:26 PM

Fred Duck said:
I suppose this could be a younger version. She has the same evil eyebrows.

Still, "techinica?" It even takes liberties with specific text prompts?

Clear text with an obvious typo is streets ahead of older versions of SD, which usually produce text that looks like worn-off labels from Amazon Scrabble-tile-jumble brands.

NotEntirelySirius · Thursday at 8:52 PM

Dadlyedly said:
Every major advancement in media also seems to be an advancement in depravity in general, and I'm not sure how I'm going to feel about it in the next few years. To paraphrase: I just don't feel like I have any choice but to ride this wave while the scientists are busy figuring out how to make it work instead of asking themselves if they should.

I can't wait for the images of a certain would be president being french kissed by another inmate with spider web tattoos to start showing up. Is that the kind of depravity you meant?

fosborb · Thursday at 8:56 PM

BobbyBobberson said:
Why is it so easy to tell that an image is AI generated?

focal point on the photorealistic stuff is an easy tell, but in general, the individual elements of AI generated images rarely feel intentionally composed

Dadlyedly · Thursday at 9:18 PM

NotEntirelySirius said:
I can't wait for the images of a certain would be president being french kissed by another inmate with spider web tattoos to start showing up. Is that the kind of depravity you meant?

No, I meant depravity I wouldn't enjoy seeing.

MNP · Thursday at 9:22 PM

I wish I had the money for this. Main girl on the image looks like an eviler version of late 90s Nicole Kidman.

Dapd Funk · Thursday at 9:26 PM

One might expect a difference between AI snd human art because the human learns from the universe of data differently from the AIs. Humans do not do statistical averaging but rather use induction. The human output is derived by combining and particularizing from these universals, rather than as a statistic of a dataset.

randomcat · Thursday at 9:27 PM

MNP said:
I wish I had the money for this. Main girl on the image looks like an eviler version of late 90s Nicole Kidman.

It requires a decent rig to run fast, but it's free and will probably run on anything if you have the patience.

Toastr · Thursday at 9:31 PM

Looking at Stable Diffusion's announcement, it appears that AMD GPU's are now supported on Linux (albeit with a much higher RAM requirement). As somebody that hasn't been all that closely following this space, is that AMD support likely to come to Windows in the foreseeable future, or am I going to have to spin up a Linux distro on my PC if I want to try this out?

Fatesrider · Thursday at 9:38 PM

euzeka said:
It would be nice to have some approximate numbers here.

You need both decent GPU/CPU speeds and lots of vRAM as well as regular RAM.

Yeah, not numbers, but this runs differently than a chat AI, which needs vRAM for best results.

As a comparison, I used Stable Diffusion on my old rig with an AMD 2700X, 32 GB RAM (probably 5400?) with an NVidia 2070 (8 GB vRAM), and it would take up to 10 minutes to run a batch of 64 500X500 images on default settings.

I loaded EasyDiffusion into my new Linux computer with an AMD 5900X with 64 GB (6000 something) RAM, Nvidia 3060 (12 GB vRAM) and it whips through a batch of 64 500X500 images on default settings in about a minute.

Times will vary depending on iterations, and engine,

So numbers are REALLY subjective and entirely based on your system's capacity. It's not that it WON'T run. It's that some things take longer to run, even on the same system, depending on the settings.

I'll probably have to wait for the Linux community to bring this to my system, because the information for getting it to run is Windows-centric, but I'll keep my eye out for that. It's entertaining when I'm bored, and it might help me with cover art (I do my own anyhow using other programs, so I'm not putting a starving artists out on the streets by doing that).

But I'll still probably resort to my CGI models and program to get it done anyhow. In the end, AI art is just an entertaining distraction for me.

Cheefachi · Thursday at 9:47 PM

How could a text engine that was explicitly told to show a sign with "Ars Technica" actually get the spelling wrong??

marsilies · Thursday at 10:02 PM

Cheefachi said:
How could a text engine that was explicitly told to show a sign with "Ars Technica" actually get the spelling wrong??

Because the image generation doesn't understand text. It knows the general patterns of text, but it doesn't understand what individual letters are or what words are, and such.

This is actually an improvement. I beleive before, anything that was supposed to be text just tended to be gibberish, often with nonsense symbols.

View: https://www.reddit.com/r/StableDiffusion/comments/112z3pt/eli5_please_why_does_ai_struggle_producing_text/

MNP · Thursday at 10:02 PM

randomcat said:
It requires a decent rig to run fast, but it's free and will probably run on anything if you have the patience.

All I have these days is an S22 but honestly makes me want to make it a medium term project.

ScifiGeek · Thursday at 10:05 PM

randomcat said:
It requires a decent rig to run fast, but it's free and will probably run on anything if you have the patience.

Isn't there a limit on the resolution of images generated, based on the amount of VRAM.

Anyone know the limit for 8GB of VRAM?

marsilies · Thursday at 10:18 PM

Defenestrar said:
In the past I know Mr. Edwards has used the same prompt (maybe even seeds) for different generations of the software. I wouldn't be surprised if this is the case for the familiarity of certain images in the article series.

I think that's likely, for example, from a September 2022 article:

With Stable Diffusion, you may never believe what you see online again

AI image synthesis goes open source, with big implications.

arstechnica.com

Benovite · Thursday at 10:27 PM

I'm gonna be the odd man out and say that there's been a regression instead of progression with what I'm seeing here. Although I gather it's perhaps easier to make worse AI-generated images than before? Is that the gist?

Trying to quantify all of this while laughing at the giant hand with backwards fingers. That's using an older version though right? ¯\(ツ)/¯
(my lack of understanding any of this; entirely possible)

A harmless pie · Thursday at 10:32 PM

BobbyBobberson said:
Why is it so easy to tell that an image is AI generated?

Because we've barely scratched the surface of what models like these are really capable of. In the blink of an eye, we've gone from barely coherent collections of almost human elements to fantastically detailed and coherent images of virtually anything you can type out. This tech is moving extremely fast, so it won't be long before wonky hands are no longer a tell.

SubWoofer2 · Thursday at 10:44 PM

I just got a local instance running. Taking requests.

Hi Randomcat, may I offer a suggestion on that, please? One of the ways this has been played with, has been to create "what happened outside the photo" in relation to LP covers. A few months back someone used a previous generative engine to show the wide view of Roxy Music's "Stranded" album cover (it showed more rocks and shoreline, who knew?). Perhaps a wide view of a favourite album cover of yours?

Stable Diffusion XL puts AI-generated visual worlds at your GPU’s command

Ars Praefectus

Ars Praefectus

Ars Legatus Legionis

Ars Scholae Palatinae

Ars Legatus Legionis

Ars Praefectus

Ars Tribunus Militum

Ars Legatus Legionis

Ars Centurion

Ars Tribunus Militum

Ars Legatus Legionis

Ars Tribunus Militum

Wise, Aged Ars Veteran

Ars Legatus Legionis

Ars Praetorian

Ars Centurion

Smack-Fu Master, in training

Ars Scholae Palatinae

Ars Praefectus

Ars Scholae Palatinae

Ars Centurion

Wise, Aged Ars Veteran

Ars Scholae Palatinae

Ars Tribunus Militum

Ars Scholae Palatinae

Wise, Aged Ars Veteran

Ars Scholae Palatinae

Ars Tribunus Militum

Ars Legatus Legionis

Wise, Aged Ars Veteran

Ars Tribunus Angusticlavius

Ars Scholae Palatinae

Ars Legatus Legionis

Ars Tribunus Angusticlavius

Smack-Fu Master, in training

Wise, Aged Ars Veteran

Ars Scholae Palatinae