I’m Sold On StableSwarmUI

Good Morning from my Robotics Lab! This is Shadow_8472 and I’ve made up my mind on StableSwarmUI as a replacement to A1111. Let’s get started!

Generative AI (Artificial Intelligence) is the technology buzz word of the decade so far thanks to open sourced models. Automatic1111 has an extensive community library, but ComfyUI’s flexibility may yet challenge it as the next favorite. While not yet polished to A1111’s visual aesthetic, a total AI noob should find StableSwarmUI navigable while letting him/her peek at Comfy beneath.

Learning ComfyUI Basics

I’m taking that peek… ComfyUI looks like boxes and spaghetti. The correct term is “workflow.” Each node represents some unit of work similar to any other UI. The power of Comfy is the ability to arbitrarily link and re-arrange nodes. Once my first impression –intimidation– wore off, I found grouping familiar options by node and color coding their connections made the basic workflow more intuitive while highlighting my gaps in understanding of the Stable Diffusion process.

Let’s define some terms before continuing. Be warned: I’m still working on my intuition, so don’t quote me on this.

  • Latent Space: data structure for concepts trained by [counter]examples. Related concepts are stored close to each other for interpolation between them.
  • Latent Image: a graphical point in a latent space.
  • Model/Checkpoint: save files for a latent space. From what I can tell: checkpoints can be trained further, but finished models are more flexible.
  • CLIP: (Contrastive Language-Image Pretraining) a part of the model that turns text into concepts.
  • Sampler: explores the model’s latent space for a given number of “steps” with respect to concepts specified in the CLIP conditioning as well as additional sliders.
  • VAE: (Variable AutoEncoder) a model that translates images to and from latent space.

The basic Stable Diffusion workflow starts with an empty Latent Image node defining height, width, and batch size. Alongside this, a model or checkpoint is loaded. CLIP Text Encode nodes are used to enter prompts (typically both positive and negative). A KSampler node does the heavy lifting, combining everything into a low-resolution preview based off the latent image (if enabled). Finally, a VAE decoder node turns your latent image into a normal picture.

While I’m still developing an intuition for how a latent space works, I’m imagining a tent held up by a number of poles defining its shape. You are free to interpolate between these points, but quirks can arise when concepts bleed into each other: like how you’d tend to imagine bald people as male.

ControlNet

The next process I wish to demystify to myself is ControlNet. A second model is loaded to extract information from an existing image. This information is then applied to your positive prompt. (Let me know if you get any interesting results conditioning negative prompts.) Add in a second or more ControlNets, and combining them presents its own artistic opportunity.

For this exercise, I used a picture I made during my first attempt at Stable Diffusion: a buff angel with a glowing sword. As a challenge to myself, I redid it with SDXL (Stable Diffusion eXtra Large). I used matching ControlNet models for Canny and OpenPose. Some attempts came up with details I liked and tried to keep. I added the SDXL refiner model to try fix his sword hand. It didn’t work, but in the end, I had made a generation I liked with a few golden armor pieces and a red, white, and blue “(kilt:1.4).” Happy 4th of July!

Practical Application

A recent event has inspired me to try making a landscape picture with a pair of mason jars –one full of gold coins, and the other empty– both on a wooden table in front of a recognizable background. It’s a bit complex to generate straight out of text, but it shouldn’t be too hard with regional conditioning, right?

Impossible. Even if my background came out true, I’d still want the mason jars to match, which didn’t happen. This would have been end of the line of the if I were limiting myself to A1111 without researching additional plugins for my already confusing-to-manage cocktail. With Comfy, My basic idea is to generate a jar and generate another filled jar based off it, then generate them together in front of my background.

Again: easier said than done. Generating the initial mason jar was simple. I even arranged it into a tidy group. From there, I made a node group for ControlNet Canny and learned about Latent Composite – both of which allowed me to consistently put the same jar into a scene twice (once I figured out my dimensions and offsets), but filling/emptying one jar’s gold proved tricky. “Filling” it only ever gave me a quarter jar of coins (limited by the table visible through the glass), and emptying it left the glass surface horribly deformed. What’s more is that the coins I did get would often morph into something else –such as maple syrup– with too high of a denoise in the KSampler. On the other hand, too low a value, and the halves of the image don’t fuse. I even had coins wind up in the wrong jar with an otherwise clean workflow.

Even though I got a head start on this project, I must lay it down here, incomplete. I have seen backgrounds removed properly with masking, so I’ll be exploring that when I come back.

Takeaway

ComfyUI looks scary, but a clean workflow is its own work of art. Comfy’s path to mastery is clearer than A1111. Even if you stick to basics, StableSwarmUI has simpler interfaces – a simple prompt and an “unpolished A1111-esk” front panel for loaded pre-made workflows.

Final Question

I’m probably being too hard on myself compositing glass in-workflow. Let me know what you think. What tips and tricks might you know for advanced AI composition? I look forward to hearing from you in the comments below or on my Socials!

Generative AI: Ethics on the Frontier

Good Morning from my Robotics Lab! This is Shadow_8472 and today I have a few thoughts about ethics when living on a frontier. Let’s get started.

Law and New Technology

Law follows innovation. A world without motor vehicles or electricity won’t require cars to stop at a red light. Conversely, new technologies bring legal uncertainty. A nuclear-powered laptop might be ready for 20 years of abuse in an elementary classroom without leaking any radiation, but expect more courtroom pushback than a mind-reading camera – at least until the legal system can parse the respective technologies.

Generative AI in 2024 is data hungry. More training data makes for a better illusion of understanding. OpenAI’s ChatGPT-4o reportedly can read a handwritten note and display emotion in a verbal reply in real time. If they haven’t already, they will soon have a model trained off every scrap of text, video, and audio freely available as well as whatever databases they have access to. But the legal-moral question is: what is fair game?

Take drones as a recent, but more mature point of comparison. Generally speaking, drones should be welcome whereever recreational R/C aircraft already are. Hover like you might be spying on someone expecting privacy, and there might be trouble. Laws defining the boundaries between these and similar behaviors protect drone enthusiasts and homeowners alike. Before that compromise was solidified, the best anyone could do was not be a jerk while flying/complaining.

The AI Art War

But not everyone’s idea of jerk behavior is the same. Many AI trainers echo the refrain, “It’s not illegal, so we cam scrape.” Then digital artists on rough times see AI duplicating their individualized styles, and they fight back. Soon, jerks are being jerks to jerks because they’re both jerks.

Model trainers practically need automated scraping, precluding an opt-in consent model like what artists want. Trainers trying not to be jerks can respect name blacklists, but improperly tagged re-uploads sneak in anyway. Artists can use tools like Glaze and Nightshade to poison training sets, but it’s just a game of cat and mouse so far.

Those were the facts as stated as objectively as I can. My thoughts are that artists damage their future livelihood more by excluding their work from training data. The whole art market will be affected as they lose commissioners to a machine that does “good enough.” Regulars who care about authentic pieces will be unaffected. Somewhere between these two groups are would-be art forgers in their favorite style and people using AI to shop for authentic commissions. I expect the later to be larger, so the moral decision is to make an inclusive model.

At the same time, some countries have a right to be forgotten. Verbally abusing AI art jerks provides digital artists with a much-needed sense of control. While artists’ livelihoods are threatened on many sides, AI is approachable enough to attack, so they vent where they can. I believe most of the outcry is overreaction but remember I’m biased in favor of Team Technology, though I am not wholly unsympathetic to their cause. I am in favor of letting them exclude themselves, just not for the reasons they would rather hear.

Takeaway

I see the AI situation in 2024 as comparable to China’s near monopoly on consumer electronics and open secret about committing human rights violations. In theory you could avoid unethically sourced consumer goods, but often times going without is not an option. You can then see the situation as forcing you to support immoral practices or you can see yourself as making the effort to find the best –though only– reasonable option available. The same thing applies to AI. All other factors equal, I intend to continue using AI tools as my conscience allows.

Final Question

Do you disagree with my stance? Feel free to let me know in the comments below or on my Socials!

Building Up My SillyTavern Suite

Good Morning from my Robotics Lab! This is Shadow_8472, and today I am going farther into the world of AI chat from the comfort of my own home. Let’s get started!

Last week, I got into Silly Tavern, a highly configurable AI chat playground with tons of features. Accomplishing a functional setup was rewarding on its own, but I am having my mind blown reading about some of the more advanced features. I want to explore farther. Notably: I am interested in a long term goal of posing characters with text and “taking pictures,” as well as locally powered AI web search.

Stable Diffusion

My first serious exploration into AI was image generation. Silly Tavern can have the LLM (Large Language Model) write a prompt for Stable Diffusion, then interface with tools such as Automatic1111 through an API’s (Application Program Interface) to generate an image. Similarly to the LLM engine, A1111 must be launched with the –api flag. I haven’t yet spent much time on this since getting it working.

Web Search

It is possible with a plugin to give your AI character the ability to search the Web. While historically this was done through something called the Extras API, the official documentation noted how it is no longer maintained as of last month and that most of the plugins work without it. The step I missed on both this and Stable Diffusion last week was connecting to their repository to download stuff. Anything else I tried kept getting ignored.

I configured AI search to use DuckDuckGo through Firefox. Let’s just say that while my AI search buddies appear to have a knack for finding obscure documentation, they do suffer from hallucinations when asking for exact products, so always double check the AI’s work.

A favorite interaction I had AI searching was looking up how much I probably paid for my now dying tablet (Praise God for letting me finish backing it up first!), a Samsung Galaxy Tab A 10.1 (2016). The bot said it originally sold for around $400 citing MSRP (Manufacturer’s Suggested Retail Price, which I did not know previously). I went and found the actual price, which was $50 cheaper and closer to what I remember its price tag being.

LLM Censorship

While my experience with Artificial Intelligence so far has been a fun journey of discovery, I’ve already run into certain limitations. The companies releasing LLM’s typically install a number of guardrails. I used AI to find a cell phone’s IMEI number, but Crazy Grandpa Joe might make bombs or crack with it in his son’s kitchen using common ingredients. This knowledge is legal, but the people training LLM’s don’t want to be seen as responsible for being accessories to crime. So they draw a line.

But where should they draw this line? Every sub-culture differs in values. [Social] media platforms often only allow a subset of what’s legal for more universal appeal; your .pdf giveaway of Crime This From Home! will likely draw attention from moderators to limit the platform/community’s liability before someone does something stupid with it. In the same line of reasoning, if LLM trainers wish to self-censor, then that is their prerogative. However, progressive liberal American culture doesn’t distinguish between potential for danger and danger itself. LLM’s tend to be produced under this and similar mentalities. It is no surprise then that raw models –when given the chance– are ever eager to lecture about environmental hyper-awareness and promote “safe” environments.

It gets in the way though. For example: I played a scenario in which the ruthless Princess Azula (Avatar: The Last Airbender) is after a fight. The initial prompt has her threatening to “…incinerate you where you stand…” for bonking her with a vollyball. I goaded her about my diplomatic immunity and suddenly she merely wanted my head. At, “I will find a way to make you pay for this,” I jokingly tossed her a unit of currency. It went over poorly, but she still refused to get physical. I ended up taking her out for coffee. I totally understand the reasoning behind this kind of censorship, but it made the LLM is so adverse to causing harm, it cannot effectively play a bad guy doing bad things to challenge you as the hero.

Takeaway


AI is already powerful genie. The “uncensored” LLM’s I have looked at draw their line at bomb and crack recipes, but sooner or later truly uncensored LLM’s will pop up as consumer grade hardware grows powerful enough to train models from scratch. Or perhaps by then access to precursor datasets will be restricted and distribution of such models regulated. For now though, those with the power to let technologies like LLM’s out of the AI bottle have chosen to do so slowly in the hopes we don’t destroy ourselves by the time we learn to respect and use them responsibly.

Final Question

I tested pacifist Azula against a group chat, and got them to fight normally, but the LLM I’m using (kunoichi-dpo-v2-7b) gives {user} Mary Sue grade plot armor as elaborated above. Have you found a 7B model and configuration that works for interesting results?

I’ve looked around for another LLM and read it’s one of the better ones for the hardware I’m using. I tested pacifist Azula against a few other cards in a group chat, and found that fights can happen, but it gives {user} plot armor to the point of being a Mary Sue. Have you found a LLM you like? I look forward to hearing from you in the comments below or on my Socials!

A Game for Geeks (Silly Tavern)

Good Morning from my Robotics Lab! This is Shadow_8472 and today I am getting into the world of self-hosted AI chat. Let’s get started!

Welcome to the Jungles

The Linux ecosystem is a jungle when compared to Windows or Mac. Granted: it’s had decades to mature atop its GNU roots that go back before the first Linux kernel. Emergent names such as Debian, Ubuntu, Arch, and Red Hat stand tall and visible among a canopy of other distros based off them with smaller names searchable on rosters like DistroWatch akin to the understory layer with a jungle floor of personal projects. Rinse and repeat for every kind of software from window managers to office tools. Every category has its tour attractions and an army of guides are often more than happy to advise newcomers on how to assemble a working system. The Linux ecosystem is a jungle I have learned to navigate in, but I would be remiss if I were to say it is not curated!

This isn’t my first week on AI. Nevertheless, the AI landscape feels like the playground/park my parents used to take me to by comparison, if it were scaled up so I were only a couple inches tall. Names like ChatGPT, Gemini, Stable Diffusion, and other big names are the first names anyone learns when researching AI – establishing them as the de facto benchmark standards everything else is judged by in their respective sub-fields. Growing in among the factionated giants are a comparatively short range of searchable shrubs, but if you wish to fully self-host, 2-inch-tall you about has to venture into the grass field of projects too short lived to stand out before being passed up. The AI ecosystem is a jungle where future canopy and emergent layers are indistinguishable from shrubs and moss on the forest floor. The art of tour guiding is guesswork at best because the ecosystem isn’t mature enough to be properly pruned. I could be wrong of course, but this is my first impression of the larger landscape.

AI Driven Character Chat

My goal this week was to work towards an AI chat bot and see where things went from there. I expect most everyone reading this has either used or heard ChatGPT and/or similar tools. User says something and computer responds based on the conversational context using a Large Language Model (LLM – a neural network trained from large amounts of data). While I have a medium-term goal of using AI to solve my NFS+rootless Podman issues, I found a much more fun twist: AI character chat.

LLM’s can be “programmed” by the user to respond in certain ways strikingly similarly to how Star Trek’s holodeck programs and characters are depicted working. One system I came across to facilitate this style of interaction is called Silly Tavern. Silly Tavern alone doesn’t do much – if a complete AI chatbot setup were a car, I’d compare Silly Tavern to the car interior. To extend the analogy, the LLM is the engine powering things, but it needs an LLM engine to interface the two like a car frame.

Following the relevant Silly Tavern documentation for self-hosted environments, I located and deployed Oobabooga as an LLM engine and an LLM called Kunoichi-DPO-v2. Without going into the theory this week, I went with a larger and smarter version than is recommended for a Hello World setup because I had the vRAM available to run it. Each of these three parts has alternatives, of course. But for now, I’m sticking with Silly Tavern.

I doubt I will forget the first at-length conversation I had with my setup. It was directly on top of Oobabooga running the LLM, and we eventually got to talking about a baseball team themed up after https://www.nerdfitness.com/wp-content/uploads/2021/05/its-working.gifthe “Who’s on First?” skit, but with positions taken up by fictional time travelers from different franchises. I had it populate the stadium with popcorn and chili dog vendors, butlers, and other characters – all through natural language. It wasn’t perfect, but it was certainly worth a laugh when, say I had the pitcher, Starlight Glimmer (My Little Pony), trot over to Sonic’s chili dog stand for some food and conversation (I’m just going to pretend he had a vegetarian option, even though neither the bot nor I thought of it at the time).

But also importantly, I asked it a few all-but-mandatory questions about itself, which I plan on covering next week along with the theory. The day after the baseball team conversation, I went to re-create the error I’d previously gotten out of Silly Tavern, and I got a response. Normally, I’d call it magic, but in this conversation with the AI, I casually asked something like,

You know when something computer doesn’t work, it gets left alone for a while, and then it works without changing anything?

I was just making conversation as I might with a human, but it got back with a very reasonable sounding essay to the tune of:

Sometimes memory caches or temporary files are refreshed or cleaned up, letting things work when before they didn’t. [Rough summary without seeing it for days.]

Moving on, I had a stretch goal for the week of working towards one of Silly Tavern’s features: character group chat. For that purpose, I found a popular character designed to build characters. We tried to build a card for Sonic the Hedgehog. The process was mildly frustrating at times, but we eventually ended up talking about how to optimize the card for a smaller vRAM footprint, which changed wildly when I brought up my intention to group chat.

Takeaway

I learned more for this topic than I often do in a given week, so I am splitting the theory out to save for next week. Hopefully, I will have group chat working by then as well as another feature I thought looked interesting.

Final Question

Love it or hate it, what are your thoughts about the growing role AI is taking on in society? I look forward to hearing from you in the comments below or on my Socials!