AI Image Generation – Let's Build Robotics with Shadow

July 8, 2024July 8, 2024

I’m Sold On StableSwarmUI

Good Morning from my Robotics Lab! This is Shadow_8472 and I’ve made up my mind on StableSwarmUI as a replacement to A1111. Let’s get started!

Generative AI (Artificial Intelligence) is the technology buzz word of the decade so far thanks to open sourced models. Automatic1111 has an extensive community library, but ComfyUI’s flexibility may yet challenge it as the next favorite. While not yet polished to A1111’s visual aesthetic, a total AI noob should find StableSwarmUI navigable while letting him/her peek at Comfy beneath.

Learning ComfyUI Basics

I’m taking that peek… ComfyUI looks like boxes and spaghetti. The correct term is “workflow.” Each node represents some unit of work similar to any other UI. The power of Comfy is the ability to arbitrarily link and re-arrange nodes. Once my first impression –intimidation– wore off, I found grouping familiar options by node and color coding their connections made the basic workflow more intuitive while highlighting my gaps in understanding of the Stable Diffusion process.

Let’s define some terms before continuing. Be warned: I’m still working on my intuition, so don’t quote me on this.

Latent Space: data structure for concepts trained by [counter]examples. Related concepts are stored close to each other for interpolation between them.
Latent Image: a graphical point in a latent space.
Model/Checkpoint: save files for a latent space. From what I can tell: checkpoints can be trained further, but finished models are more flexible.
CLIP: (Contrastive Language-Image Pretraining) a part of the model that turns text into concepts.
Sampler: explores the model’s latent space for a given number of “steps” with respect to concepts specified in the CLIP conditioning as well as additional sliders.
VAE: (Variable AutoEncoder) a model that translates images to and from latent space.

The basic Stable Diffusion workflow starts with an empty Latent Image node defining height, width, and batch size. Alongside this, a model or checkpoint is loaded. CLIP Text Encode nodes are used to enter prompts (typically both positive and negative). A KSampler node does the heavy lifting, combining everything into a low-resolution preview based off the latent image (if enabled). Finally, a VAE decoder node turns your latent image into a normal picture.

While I’m still developing an intuition for how a latent space works, I’m imagining a tent held up by a number of poles defining its shape. You are free to interpolate between these points, but quirks can arise when concepts bleed into each other: like how you’d tend to imagine bald people as male.

ControlNet

The next process I wish to demystify to myself is ControlNet. A second model is loaded to extract information from an existing image. This information is then applied to your positive prompt. (Let me know if you get any interesting results conditioning negative prompts.) Add in a second or more ControlNets, and combining them presents its own artistic opportunity.

For this exercise, I used a picture I made during my first attempt at Stable Diffusion: a buff angel with a glowing sword. As a challenge to myself, I redid it with SDXL (Stable Diffusion eXtra Large). I used matching ControlNet models for Canny and OpenPose. Some attempts came up with details I liked and tried to keep. I added the SDXL refiner model to try fix his sword hand. It didn’t work, but in the end, I had made a generation I liked with a few golden armor pieces and a red, white, and blue “(kilt:1.4).” Happy 4th of July!

Practical Application

A recent event has inspired me to try making a landscape picture with a pair of mason jars –one full of gold coins, and the other empty– both on a wooden table in front of a recognizable background. It’s a bit complex to generate straight out of text, but it shouldn’t be too hard with regional conditioning, right?

Impossible. Even if my background came out true, I’d still want the mason jars to match, which didn’t happen. This would have been end of the line of the if I were limiting myself to A1111 without researching additional plugins for my already confusing-to-manage cocktail. With Comfy, My basic idea is to generate a jar and generate another filled jar based off it, then generate them together in front of my background.

Again: easier said than done. Generating the initial mason jar was simple. I even arranged it into a tidy group. From there, I made a node group for ControlNet Canny and learned about Latent Composite – both of which allowed me to consistently put the same jar into a scene twice (once I figured out my dimensions and offsets), but filling/emptying one jar’s gold proved tricky. “Filling” it only ever gave me a quarter jar of coins (limited by the table visible through the glass), and emptying it left the glass surface horribly deformed. What’s more is that the coins I did get would often morph into something else –such as maple syrup– with too high of a denoise in the KSampler. On the other hand, too low a value, and the halves of the image don’t fuse. I even had coins wind up in the wrong jar with an otherwise clean workflow.

Even though I got a head start on this project, I must lay it down here, incomplete. I have seen backgrounds removed properly with masking, so I’ll be exploring that when I come back.

Takeaway

ComfyUI looks scary, but a clean workflow is its own work of art. Comfy’s path to mastery is clearer than A1111. Even if you stick to basics, StableSwarmUI has simpler interfaces – a simple prompt and an “unpolished A1111-esk” front panel for loaded pre-made workflows.

Final Question

I’m probably being too hard on myself compositing glass in-workflow. Let me know what you think. What tips and tricks might you know for advanced AI composition? I look forward to hearing from you in the comments below or on my Socials!

June 24, 2024June 24, 2024

Which Stable Diffusion UI is Right for Me?

Good Morning from my Robotics Lab! This is Shadow_8472 and today I am exploring Automatic1111 alternatives. Let’s get started!

A1111 is a nice baseline StableDiffusion interface. A determined beginner should find it approachable, it provides easy access to a large toolbox for an intermediate audience, and the community library of extensions and video/text tutorials is large enough to keep experts honing their skills.

Stable Diffusion Forge vs. StableSwarmUI

But A1111 it’s hardly the only one around. Forge has had my attention as a direct improvement for A1111, for –if nothing else– bugfixes when switching models. I’ve bumped into this limitation while experimenting with ControlNet, and it gets in the way.

But another UI (User Interface) has caught my attention recently: StableSwarmUI. From around one hour of research, it appears to be a beginner friendly package built off ComfyUI, an interface I’d previously written off as well above my skill level. Installation threw an extra challenge when it assumed browser access and I was working over SSH. I recently learned graphical SSH though:

ssh -CY <user@host>

Otherwise, StableSwarmUI was very easy to install.

Out of the box, my installation of StableSwarmUI was set up to run SDXL models. When I tried SonicDiffusion (Stable Diffusion 1.5 base) from my A1111 installation, I kept getting 50% gray outputs. I took a peek at the ComfyUI backend. Yeah… I have no business making the all-out switch until I’ve properly introduced myself to ComfyUI. Time to research until I can make a basic workflow.

…

OK, don’t ask me about the gray boxes. Refreshing Firefox did nothing. Some people fixed similar issues by reinstalling or deleting one file or another. I left it over a weekend, then restarted StableSwarmUI server while installing the Custom Node Manager for ComfyUI.

ComfyUI Workflows

ComfyUI all about the workflow: a program you make by linking various nodes into a flowchart. I looked up consistent character workflows to get a better idea of how they work. There are a couple options, but YouTuber NerdyRodent’s Reposer Plus caught my attention first [1]. Custom Node Manager found most of its custom nodes, but NerdyRodent used a now outdated plugin called IPAdapter. I had to study IPAdapter v2 (programmer video [2]), but it wasn’t too difficult to swap out the relevant nodes once I’d taken my time.

Reposer Plus needed additional models – some of which I already had in A1111. I made a shared models directory and moved StableSwarmUI’s entire models directory over. I found a setting in StableSwarmUI at “Server/Server Configuration/Paths/ModelRoot” to point the UI at my models directory. A1111 would have me edit a .yaml file directly, but symbolic links are easier.

I set the workflow in motion with “Queue Prompt,” but the IPAdapter Advanced node I installed threw an error on me. It took an extra session, but experimentation identified model mismatch (I tried loading a “Big G” CLIP Vision model when it needed the normal one). The workflow then ran normally, but the final upscale turned sepia. I tried a photorealistic upscale model (as opposed to one for anime), but it turned out this was another server restart issue.

Takeaway

I played around with StableSwarmUI a bit more after a line of mediocre results with the Nerdy Rodent’s workflow. Like with many tech projects, I’m interacting with a large and evolving ecosystem. Being on local hardware, I have both the liberty and burden of being my own admin while still learning the user’s point of view. And until I know both, I cannot tell if StableSwarmUI is there yet or not. I was all primed to complain about how I can’t readily draw into the beginner interface for a ControlNet input, but on closer inspection I was mistaken about how this UI works. I still haven’t found the feature, but that doesn’t mean it’s not there.

If you are a first-day beginner, I would still recommend EasyDiffusion for its easy installation, image history, and inpainting. If you want anything more, A1111 will let you explore further (Forge appears abandoned) at the cost of image history. If you want to try a cool ComfyUI workflow, StableSwarmUI may be right for you.

Final Question

What is your favorite ComfyUI workflow? I look forward to hearing your answers in the comments below or on my Socials!

Works Cited

[1] N. Rodent, “Stable Diffusion – Face + Pose + Clothing – NO training required!,” youtube.com, Oct. 14, 2023. [Online]. Available:https://youtu.be/ZcCfwTkYSz8. [Accessed June 20, 2024].

[2] L. Vision, “IPAdapter v2: all the new features!,” youtube.com, Mar. 25, 2024. [Online]. Available:https://youtu.be/_JzDcgKgghY. [Accessed June 20, 2024].

June 10, 2024June 7, 2024

Self-Hosted AI Consistent Characters: Part 2

Good Morning from my Robotics Lab! This is Shadow_8472 and today I am continuing work towards a consistent character using Stable Diffusion AI image generation software. Let’s get started!

Previously

Last time I talked about making a consistent character on local hardware, I went over using Automatic 1111 (A1111) web interface (running on my father’s computer), installing the ControlNet extension for Stable Diffusion and equipping it with models for OpenPose, and then using OpenPose to generate eight skeletons based of screenshots from Sonic Forces. All common enough stuff, but for context, I am following a tutorial on YouTube by Not4Talent “Create consistent characters with Stable diffusion!!” [1].

Character Switch

While I had previously been working on a Sonic fan character for my sister whom I am calling Ms. T, I switched over to working on my own character in the same setting, whom I’m calling Smokey Fox. He’s just spent several years studying in a foreign culture with a human sense of modesty, so I generated an orange Mobian fox with blue eyes and wearing red sneakers, blue jeans, and a red trench coat while applying a bunch of little things I picked up along the way, such as quality prompts and negative prompts.

Along the way, the AI came up with details I liked, such as a white shirt, black gloves, black tipped ears, and some of the time, he even generated with a red thing on a glove I decided was some kind of accessory crystal. Quality was spotty. It took me a few attempts before I cleaned up the hair in his profile shots by prompting for a bald head. Only four poses consistently gave him a tail, but one was almost never usable. It also tried giving him a black tail tip a few times, but I didn’t like that.

Along the way, I grabbed pictures with poses I liked and stacked them in GIMP. Because I was using a fixed seed, I was able to assemble more poses until I had eight portraits I’d touched up. Notably: I had to extend his coat on his behind shot, his shoes needed a lot of help, and I had to draw one of his tails from scratch. The crystal thing on his glove also got interesting to transfer around, and I did have to draw it myself a few times. During this process, I took screenshots of my work in progress and shared them on Discord.

No Auto Save

Disaster!! At some point, my computer randomly crashed. I don’t remember the details, but it was several days later when I returned to work on Smokey that I learned that GIMP doesn’t auto-save, like LibreOffice just as I mentioned it. Thankfully, I had the screenshots to work with. I also lost my original prompts through a side project where I helped my mother troll a friend from elementary regarding a Noah’s Ark baby quilt she made for her. In total, I made an island chain, a forest scene, and just as it was about to arrive, I made up a beach scene with the Taco Bell logo embedded using a ControlNet model for making fancy QR codes.

Back to Smokey Fox, the next step in the tutorial was upscaling. Pain followed. The Not4Talent tutorial [1] didn’t make sense to me, so I spent a day or two unenthusiastically bumbling around trying to learn enough to feel ready to post. I played around with several ControlNet models. Most are variations on making white-on-black detail maps. One late night session landed me an upscale tutorial by Olivio Sarikas [2] that clicked with me. Similarly to other tutorials, A1111 has [re?]moved stuff around between updates in the 6 months to a year since it was popular to introduce ControlNet – not to mention various plugins which may differ between our setups. Olivio’s tutorial rescued my project, and I got back to having fun cleaning up details with GIMP.

Takeaway

I may need to take a closer look at Forge instead of A1111. A1111 has a known bug where it has trouble unloading models, but while I was playing with various ControlNet models, I managed to defeat the vRAM capacity on the GPU.

Final Question

Forge will require a virtual environment, which I don’t know how to do properly yet. What tutorial would you recommend? I look forward to hearing your answers in the comments below or on my Socials!

Work Cited

[1] Not4Talent, “Create consistent characters with Stable diffusion!!,”youtube.com, Jun. 2, 2023. [Online]. Available:https://youtu.be/aBiGYIwoN_k [Accessed Jun. 7, 2024].

[2] O. Sarikas, “ULTIMATE Upscale for SLOW GPUs – Fast Workflow, High Quality, A1111.”youtube.com, May 6, 2023 [Online]. Available:https://youtu.be/3z4MKUqFEUk. [Accessed Jun. 7, 2024].

August 28, 2023

Happy Birthday Stable Diffusion!

Good Morning from my Robotics Lab! This is Shadow_8472 and today I am spending a week with Stable Diffusion to improve my skills at it. Let’s get started!

The science of AI art goes back to around the time complete CPU’s were first integrated into a single computer chip in the late 60’s/early 70’s. At least a couple waves of AI craze came and went, but on August 22, 2022, Stable Diffusion was released as free and open source software.

In the year since, Stable Diffusion has proven to be quite the disruptive technology. I’ve never had the cash to commission an online artist, but with a little effort, a decent amount of patience, and only an ounce of experience, I’ve gotten subjectively better results than commissioned works posted by low-end digital artists. I feel sorry for the people losing their dream jobs to machines, but at the same time this is a frontier I can have fun exploring.

One Week of Study

I’m setting myself a goal of spending two hours dedicated to learning Stable Diffusion every day this week. We’ll see what happens.

Monday

We won’t talk about what didn’t happen on Monday.

Tuesday

I finally started researching for this topic after midnight. I started up Easy Diffusion, an intuitive webUI for Stable Diffusion, generated a number of images with a project for my sister in mind.

I ended up looking up tips and tutorials. Looks like the hot-shot web UI these days is Automatic1111. It has more options, but is proportionally harder to use. I might try it later in the week. Otherwise, most of my time actually working today was writing the introduction.

Wednesday

Easy Diffusion is definitely the way to so if all you’re looking to do is goof around, because that is exactly what I did. So far as I can tell, I am at the exact bottom of graphics cards that can do this. I’m finding it of use to go smaller for faster feedback while learning to prompt. Conclusion: img2img has a tendency to muddle things.

Still, the draw of potentially more powerful techniques is calling. I found a piece of software called Stability Matrix, which supports a number of web UI’s – including Automatic1111, which every Stable Diffusion tutorial out there tends to go after. I ran into trouble with its integrated Python while setting it up (portable, in the end). I’m hoping I can replace it with a later version tomorrow.

Thursday

I switched approach from last night and did an online search for my error:

error while loading shared libraries: libcrypt.so.1: cannot open shared object file: No such file or directory

Multiple results pointed from people trying Python projects on Arch family systems like the one I’m on. One source from December 2022 recommended a multi-step process involving the AUR. I figured riffling through the project’s GitHub issues was worth a shot – to report it if nothing else. I searched for ‘libcrypt.so.1’, and the fix was to install libxcrypt-compat; I found it in the more trusted pacman repository [1].

AUR: Arch User Repository

I installed Automatic1111 using Stability Matrix and loaded it up. My first impression when compared to Easy Diffusion: Wall of controls. Easy is easy in both the setup AND the relatively intuitive control scheme, but it seemingly doesn’t support a lot of the tools I’ve seen and want to learn.

Per tradition, I made a photo of an astronaut riding a horse. It was a flop, but I got an image nonetheless. Its immediate followup didn’t finish when I told it to fix faces and I ran out of vRAM memory on my graphics card (to be fair, I didn’t have next to everything closed).

Sabbath starts tomorrow, and I’ve been writing these mostly late at night. I can tell I’m not likely to meet my time goal of a couple hours every day, but I feel getting to this step is a major accomplishment. Word count says 700+ words, so I could end it here and feel fine about it. I’ll see what happens. I want to find the control that tells it my graphics card is barely up to this stuff.

Friday

Time to start optimizing! For cotext, I’m on an NVIDIA graphics card with 4GB of vRAM, which is enough to get a feel for the software if you have a minute or two of patients per image, but having more would be prudent. After trying a couple online videos, I found AUTOMATIC1111’s GitHub had a list of optimizations [2] I’ll be listing as –flags to the COMMANDLINE_ARGS variable in my start script. I don’t have time this evening for a full test, but perhaps tomorrow night or Sunday I can do some benchmarking.

vRAM: Video RAM (Random Access Memory) *So glad to have finally looked this one up!*

xformers

For NVIDIA cards, we have a library xformers. It speeds up image generation and lowers vRAM usage, but at the cost of consistency, which may not be a bad thing depending on the situation.

opt-split-attention/opt-sub-quad-attention/opt-split-attention-v1

A “black magic” optimization that should be automatically handled. I’ll be selecting one via the webUI, though.

medvram/lowvram

This optimization breaks up the model to accommodate lesser graphics cards. The smaller the pieces though, the more time it will need to swap pieces out. Side note, but I believe it’s MEDvram as in MEDium as opposed to the naive pronunciation I heard with MEDvram as in MEDical.

opt-channelslast

Some procedures are exparimental optimization that is literally unknown if it’s worth it at this time. I’m skipping it.

Saturday Night

I took it off

Sunday

I joined my father on shopping trip and we ran out of gas at a car wash. By the time I sat down to work on Stable Diffusion, I wasn’t up to much more than an unguided self-tour of the settings. I don’t know what most of the categories are supposed to do! I’ll look each one up in time.

Monday Morning

As usual in recent months, I spend a while writing the Takeaway and Final Question, dressing up the citations, and copying everything out of LibreOffice and into WordPress for publication at noon.

Takeaway

Progress! It might not be what I expected this week, but I’m still satisfied that I have something to show off. The point I’m at is to get to the same place as I was with Easy Diffusion before looking up the toys I came to Automatic1111 for.

As one final note, this week is also the anniversary of this blog. It caused a bit of a delay in getting this post scheduled by noon, but that would make it the third instance I can remember of a late post in twice as many years. I feel bad about it, but at the same time, it’s still a decent track record.

Final Question

Do you have a favorite interface for using Stable Diffusion?

[1] PresTrembleyIIIEsq, et. all, “SD.Next / ComfyUI Install: Unexpected Error #54,” github.com, July 30, 2023. [Online]. https://github.com/LykosAI/StabilityMatrix/issues/54. [Accessed Aug. 8, 2023].

[2] AUTOMATIC1111, “Optimizations” github.com, August, 2023. [Online]. https://github.com/AUTOMATIC1111/stable-diffusion-webui/wiki/Optimizations. [Accessed Aug. 8, 2023].