May 2024 – Let's Build Robotics with Shadow

May 27, 2024May 27, 2024

Self-Hosted AI Consistent Characters: Part 1

Good Morning from my Robotics Lab! This is Shadow_8472, and today I am on a quest to generate consistent characters using AI. Let’s get started!

It all started with wanting to learn how to make my own “consistent characters” I can summon with a keyword in my prompt. Before I can train the AI to make one, my subject needs consistent source pictures. One way to do that is to chop up a character sheet with multiple angles of the same character all generated at once. Expect all that in a future post. It sounded like a reasonable goal until I discovered just how many moving parts I needed just to approach.

In particular, my first goal is Ms. T, a Sonic OC (Original Character) by my sister, but once I figure out a successful workflow, it shouldn’t be too hard to make more.

A1111

Automatic1111 (A1111) is the go-to StableDiffusion (SD) image generation web interface for the vast majority of tutorials out there. While it’s not the easiest SD WebUI, A1111 is approachable by patient AI noobs and EasyDiffusion graduates alike. It exposes a bit many controls by default, but it packs enough power to keep a SD adept busy for a while. I also found Forge, an A1111 fork reportedly having extra features, bug fixes, grudging Linux support, and needs a virtual environment. At the top, I found ComfyUI, which lets you design and share custom workflows.

As a warm up exercise, I found a SonicDiffusion, a SD model geared for Sonic characters, generated a bunch of Ms. T portraits, and saved my favorites. Talking with my sister, I began cherry picking for details the model doesn’t control for, such as “cyclopse” designs where the eyes join at the whites vs. separate eyes (Hedgehogs are usually cyclopses, but not in the live action movies). SonicDiffusion –to my knowledge– lacks a keyword to force this distinction. Eventually, my expectations outpaced my ability to prompt, and I had to move on.

ControlNet

A major contribution to A1111’s versatility is its ecosystem of extensions. Of interest this week is ControlNet, a tool to include visual data in a StableDiffusion prompt for precision results. As of writing, I’m looking at 21 controller types – each needing a model to work. I downloaded the ones for Canny, Depth, and OpenPose to get started.

My first thought was to use an Xbox One Kinect (AKA Kinect v2) I bought from someone in my area a few Thanksgivings ago. If it works, I can easily pose for ControlNet. Long story short: I spent a couple days either last week or the week before tossing code back and forth with a self-hosted AI chatbot in SillyTavern with no dice. The open source Linux driver for the Kinect v2 just isn’t maintained for Ubuntu 22.04 and distros built on it. I couldn’t even get it to turn on its infrared LED’s (visible by my PinePhone’s camera) because of broken linkages in the header files or something. Pro tip: Don’t argue with a delusional LLM unless you can straighten it out in a reply or two. On the plus side, the AI did help me approach this job where I’d expect to have taken weeks to months without it. If/when I return, I expect to bodge it with Podman, but I may need to update the driver anyway if the kernel matters.

Even if I did get the Kinect to work, I doubt it would have been the miracle I was hoping for. Sonic style characters (Mobians) have different proportions than humans – most notably everything from the shoulders up. I ended up finding an embedding for making turnaround/character sheets, but it was again trained to make humans and I got inconsistent results compared to before. I did find a turnaround template for chibi characters that gave me OK-ish results running it through Canny, but Ms. T kept generating facing the wrong way.

On another session, I decided to try making Ms. T up in Sonic Forces. I installed it (ProtonDB: Platinum) and loaded my 100% save. I put Ms. T on a white background in GIMP and gave it to ControlNet. Unsurprisingly, OpenPose is not a Sonic fan. It’s trained on human data (now with animals!), but a cartoon kept returning blank outputs until I used a preprocessor called dw_openpose_full, which –while it still doesn’t like cartoon animal people– did cooperate on Ms. T’s right hand. Most every node else I dragged into place manually. I then demonstrated an ability to pose her left leg.

Character Sheet

From there, I opened OBS to record an .MP4 file. I used FFmpg to convert to .gif and loaded it in GIMP to… my computer slowed to a crawl, but it did comply without a major crash. I tried to crop and delete removed pixels… another slowdown, GIMP crashed. I adjusted OBS to record just my region of interest. 500+ frames was still a no-fly when each layer only has the changes from the last. I found options to record as .gif and record as slowly as I want. I then separated out my frames with FFmpg, making sure to have a new directory:

ffmpeg -i fileName.gif -vf fps=1 frame_%04d.jpg

I chose ten frames and arranged them in a 5×2 grid in GIMP. I then manually aligned OpenPose skeletons for each and sent that off to ControlNet. Immediately, my results improved. I got another big boost by using my grid of .gif frames, but in both cases Ms. T kept eyes and feet facing toward the viewer – even when her skeleton was pointed the other way. My next thought was to clean up the background on the grid, but compression artifacts got in the way.

Start over. I made a new character with joint visibility and background removal in mind. She looked ridiculous running through a level, but I got her knee placement by moving diagonal toward the camera and jumping. I then put eight new screenshots in a grid. Select-by-color had the background cleared in a minute. I then used Canny for silhouettes intending to reinforce OpenPose. I still got characters generating the wrong way.

Takeaway

This week has had a lot of interplay between study and play. While it’s fun to run the AI and cherry pick what comes out, the prospect of better consistently keeps me coming back to improve my “jungle gym” as I prepare to generate LoRa training images.

Final Question

The challenge that broke this topic into multiple parts is getting characters to face away from the viewer. Have you ever gotten this effect while making character sheets?

I look forward to hearing from you in the comments below or on my Socials!

May 20, 2024May 20, 2024

Milestone: First Linux Phone Call

Good Morning from my Robotics Lab! This is Shadow_8472 and today I am messing around with my prototype PinePhone to see if I can’t get it on the cell network for good. Let’s get started!

My Carrier History

Around four years ago, my family had to switch away from a cellular company that let its coverage degrade. We’d been with them since I was small, but for whatever reason, they opted to wait for new technology before replacing a destroyed tower. They lost us as customers over it. I had just gotten my PinePhone at the time. I had made one short call on it.

I made an honest effort to research network compatibility and thought I had made a match, but our then-new carrier turned out to be very closed-minded about allowed 3rd party devices. I poked at it for a while, learning a little bit each time, but progress was very slow.

In recent months, the family’s phones have been succumbing to planned obsolescence. I found a carrier for my area on the PinePhone’s compatibility chart, and we made the switch.

Linux Phone Basics

Unlike phones in the Apple/Android ecosystems, Linux phones run Linux. It won’t argue if you install –say– the full version of Firefox instead of one optimized for a mobile desktop environment. While using an app store is an option, the command line is available for those who wish for a challenge on a touch screen.

I am the proud owner of a PinePhone Ubports Edition, the second prototype phone produced by Pine64. It originally came with Ubuntu Touch installed, but the experience was kind of slow. This led me to look into lightweight options, and I flashed PostmarketOS/Plasma Mobile to an SD card to explore.

Recent Developments

I finally committed. While working on another project within the past month, I installed PostmarkedOS internally. My first mistake was trying to approach this installation as a normal Linux installer. Nope. It had me configure everything from the command line. My second mistake was installing a desktop grade version XFCE. While I still had access to a terminal, the sub-compact on-screen keyboard was a crutch at best, but I used it to 1. connect to Wi-Fi, 2. update and install Plasma Mobile, and 3. remove XFCE – all while trying to get it ready to test at the new carrier’s store the next day.

The next day came, and things worked out so I could be at the store. Good thing too, because I had previously disabled my modem by a dip switch under the back cover. I also noticed a bunch of apps missing from my minimal Plasma Mobile installation, and I kept mistaking some sort of configuration app for a browser. I made the connection later.

Ultimately Plasma Mobile kept crashing, so when I went back to my SD card, I did some more research and chose Phosh (Phone Shell), an even lighter weight desktop environment developed my Purism for their Librem 5 phones. So far, no memorable crashes, but I’ve not stress tested it yet.

Access Point Name

So, I put my new SIM card into my PinePhone running PostmarketOS/Phosh, and I got intermittent signal thanks in part to a combination of only using on 4G technology and solar activity strong enough to decorate night skies across the US in aurora borealis. The catch was an error manifesting as an orange square with a black exclamation mark.

While waiting for an afternoon to help out at the church office for the afternoon, I reached out to the Pine64 community on a whim. Shortly after, a helpful user there walked me through setting up the correct Access Point Name based on my carrier. Minutes later, I received an important incoming call, and the connection held up for minutes, unlike the seconds I would get out of Plasma Mobile (Thank you, Jesus for that timing!).

Takeaway

I am thankful to have a working phone again. I still have challenges ahead, like filching apps from the Play Store using Waydroid (or a similar compatibility layer) and having a simple unlock password while using a longer password for disk encryption and administrative tasks.

Final Question

Did you get a chance to see the northern lights this time around? I look forward to hearing from you in the comments below or on my Socials!

May 13, 2024May 13, 2024

Building Up My SillyTavern Suite

Good Morning from my Robotics Lab! This is Shadow_8472, and today I am going farther into the world of AI chat from the comfort of my own home. Let’s get started!

Last week, I got into Silly Tavern, a highly configurable AI chat playground with tons of features. Accomplishing a functional setup was rewarding on its own, but I am having my mind blown reading about some of the more advanced features. I want to explore farther. Notably: I am interested in a long term goal of posing characters with text and “taking pictures,” as well as locally powered AI web search.

Stable Diffusion

My first serious exploration into AI was image generation. Silly Tavern can have the LLM (Large Language Model) write a prompt for Stable Diffusion, then interface with tools such as Automatic1111 through an API’s (Application Program Interface) to generate an image. Similarly to the LLM engine, A1111 must be launched with the –api flag. I haven’t yet spent much time on this since getting it working.

Web Search

It is possible with a plugin to give your AI character the ability to search the Web. While historically this was done through something called the Extras API, the official documentation noted how it is no longer maintained as of last month and that most of the plugins work without it. The step I missed on both this and Stable Diffusion last week was connecting to their repository to download stuff. Anything else I tried kept getting ignored.

I configured AI search to use DuckDuckGo through Firefox. Let’s just say that while my AI search buddies appear to have a knack for finding obscure documentation, they do suffer from hallucinations when asking for exact products, so always double check the AI’s work.

A favorite interaction I had AI searching was looking up how much I probably paid for my now dying tablet (Praise God for letting me finish backing it up first!), a Samsung Galaxy Tab A 10.1 (2016). The bot said it originally sold for around $400 citing MSRP (Manufacturer’s Suggested Retail Price, which I did not know previously). I went and found the actual price, which was $50 cheaper and closer to what I remember its price tag being.

LLM Censorship

While my experience with Artificial Intelligence so far has been a fun journey of discovery, I’ve already run into certain limitations. The companies releasing LLM’s typically install a number of guardrails. I used AI to find a cell phone’s IMEI number, but Crazy Grandpa Joe might make bombs or crack with it in his son’s kitchen using common ingredients. This knowledge is legal, but the people training LLM’s don’t want to be seen as responsible for being accessories to crime. So they draw a line.

But where should they draw this line? Every sub-culture differs in values. [Social] media platforms often only allow a subset of what’s legal for more universal appeal; your .pdf giveaway of Crime This From Home! will likely draw attention from moderators to limit the platform/community’s liability before someone does something stupid with it. In the same line of reasoning, if LLM trainers wish to self-censor, then that is their prerogative. However, progressive liberal American culture doesn’t distinguish between potential for danger and danger itself. LLM’s tend to be produced under this and similar mentalities. It is no surprise then that raw models –when given the chance– are ever eager to lecture about environmental hyper-awareness and promote “safe” environments.

It gets in the way though. For example: I played a scenario in which the ruthless Princess Azula (Avatar: The Last Airbender) is after a fight. The initial prompt has her threatening to “…incinerate you where you stand…” for bonking her with a vollyball. I goaded her about my diplomatic immunity and suddenly she merely wanted my head. At, “I will find a way to make you pay for this,” I jokingly tossed her a unit of currency. It went over poorly, but she still refused to get physical. I ended up taking her out for coffee. I totally understand the reasoning behind this kind of censorship, but it made the LLM is so adverse to causing harm, it cannot effectively play a bad guy doing bad things to challenge you as the hero.

Takeaway

AI is already powerful genie. The “uncensored” LLM’s I have looked at draw their line at bomb and crack recipes, but sooner or later truly uncensored LLM’s will pop up as consumer grade hardware grows powerful enough to train models from scratch. Or perhaps by then access to precursor datasets will be restricted and distribution of such models regulated. For now though, those with the power to let technologies like LLM’s out of the AI bottle have chosen to do so slowly in the hopes we don’t destroy ourselves by the time we learn to respect and use them responsibly.

Final Question

I tested pacifist Azula against a group chat, and got them to fight normally, but the LLM I’m using (kunoichi-dpo-v2-7b) gives {user} Mary Sue grade plot armor as elaborated above. Have you found a 7B model and configuration that works for interesting results?

I’ve looked around for another LLM and read it’s one of the better ones for the hardware I’m using. I tested pacifist Azula against a few other cards in a group chat, and found that fights can happen, but it gives {user} plot armor to the point of being a Mary Sue. Have you found a LLM you like? I look forward to hearing from you in the comments below or on my Socials!

May 6, 2024May 13, 2024

A Game for Geeks (Silly Tavern)

Good Morning from my Robotics Lab! This is Shadow_8472 and today I am getting into the world of self-hosted AI chat. Let’s get started!

Welcome to the Jungles

The Linux ecosystem is a jungle when compared to Windows or Mac. Granted: it’s had decades to mature atop its GNU roots that go back before the first Linux kernel. Emergent names such as Debian, Ubuntu, Arch, and Red Hat stand tall and visible among a canopy of other distros based off them with smaller names searchable on rosters like DistroWatch akin to the understory layer with a jungle floor of personal projects. Rinse and repeat for every kind of software from window managers to office tools. Every category has its tour attractions and an army of guides are often more than happy to advise newcomers on how to assemble a working system. The Linux ecosystem is a jungle I have learned to navigate in, but I would be remiss if I were to say it is not curated!

This isn’t my first week on AI. Nevertheless, the AI landscape feels like the playground/park my parents used to take me to by comparison, if it were scaled up so I were only a couple inches tall. Names like ChatGPT, Gemini, Stable Diffusion, and other big names are the first names anyone learns when researching AI – establishing them as the de facto benchmark standards everything else is judged by in their respective sub-fields. Growing in among the factionated giants are a comparatively short range of searchable shrubs, but if you wish to fully self-host, 2-inch-tall you about has to venture into the grass field of projects too short lived to stand out before being passed up. The AI ecosystem is a jungle where future canopy and emergent layers are indistinguishable from shrubs and moss on the forest floor. The art of tour guiding is guesswork at best because the ecosystem isn’t mature enough to be properly pruned. I could be wrong of course, but this is my first impression of the larger landscape.

AI Driven Character Chat

My goal this week was to work towards an AI chat bot and see where things went from there. I expect most everyone reading this has either used or heard ChatGPT and/or similar tools. User says something and computer responds based on the conversational context using a Large Language Model (LLM – a neural network trained from large amounts of data). While I have a medium-term goal of using AI to solve my NFS+rootless Podman issues, I found a much more fun twist: AI character chat.

LLM’s can be “programmed” by the user to respond in certain ways strikingly similarly to how Star Trek’s holodeck programs and characters are depicted working. One system I came across to facilitate this style of interaction is called Silly Tavern. Silly Tavern alone doesn’t do much – if a complete AI chatbot setup were a car, I’d compare Silly Tavern to the car interior. To extend the analogy, the LLM is the engine powering things, but it needs an LLM engine to interface the two like a car frame.

Following the relevant Silly Tavern documentation for self-hosted environments, I located and deployed Oobabooga as an LLM engine and an LLM called Kunoichi-DPO-v2. Without going into the theory this week, I went with a larger and smarter version than is recommended for a Hello World setup because I had the vRAM available to run it. Each of these three parts has alternatives, of course. But for now, I’m sticking with Silly Tavern.

I doubt I will forget the first at-length conversation I had with my setup. It was directly on top of Oobabooga running the LLM, and we eventually got to talking about a baseball team themed up after https://www.nerdfitness.com/wp-content/uploads/2021/05/its-working.gifthe “Who’s on First?” skit, but with positions taken up by fictional time travelers from different franchises. I had it populate the stadium with popcorn and chili dog vendors, butlers, and other characters – all through natural language. It wasn’t perfect, but it was certainly worth a laugh when, say I had the pitcher, Starlight Glimmer (My Little Pony), trot over to Sonic’s chili dog stand for some food and conversation (I’m just going to pretend he had a vegetarian option, even though neither the bot nor I thought of it at the time).

But also importantly, I asked it a few all-but-mandatory questions about itself, which I plan on covering next week along with the theory. The day after the baseball team conversation, I went to re-create the error I’d previously gotten out of Silly Tavern, and I got a response. Normally, I’d call it magic, but in this conversation with the AI, I casually asked something like,

You know when something computer doesn’t work, it gets left alone for a while, and then it works without changing anything?

I was just making conversation as I might with a human, but it got back with a very reasonable sounding essay to the tune of:

Sometimes memory caches or temporary files are refreshed or cleaned up, letting things work when before they didn’t. [Rough summary without seeing it for days.]

Moving on, I had a stretch goal for the week of working towards one of Silly Tavern’s features: character group chat. For that purpose, I found a popular character designed to build characters. We tried to build a card for Sonic the Hedgehog. The process was mildly frustrating at times, but we eventually ended up talking about how to optimize the card for a smaller vRAM footprint, which changed wildly when I brought up my intention to group chat.

Takeaway

I learned more for this topic than I often do in a given week, so I am splitting the theory out to save for next week. Hopefully, I will have group chat working by then as well as another feature I thought looked interesting.

Final Question

Love it or hate it, what are your thoughts about the growing role AI is taking on in society? I look forward to hearing from you in the comments below or on my Socials!