Sohan Basak

Playing with Tara v0.1 - Infusing LLMs and GenAI

A before-after comparison of Tara V0.1 nodes.

Tara is a ComfyUI node that integrates LLMs, such as OpenAI’s GPT and OSS models hosted by Groq

Jump to Showcase: Showcase

I have been hard at work writing Tara for the last couple of days. It’s a ComfyUI workflow node that lets you integrate LLMs and build complex workflows with ease. It’s a major step towards unlocking Automation and AI for everyone. I have been playing with it for a while now, and I am quite happy with the results. Here are some of the images I generated using Tara v0.1.

How do we use it?

Since making my initial post available on Reddit, ltdrdata has been kind enough to add it to the ComfyUI Manager. So, right now, it should be available to everyone who has access to ComfyUI and has Manager Installed. Update ComfyUI, search for Tara and hit install. It’s that simple. I’m tremendously grateful for the support and the community that has been built around ComfyUI and Tara.

Nodes that I have added

TaraPrompter: Utilizes input guidance to generate refined positive and negative outcomes. TaraApiKeyLoader: Manages and loads saved API keys for different LLM services. TaraApiKeySaver: Provides a secure way to save and store API keys internally. TaraDaisyChainNode: Enables complex workflows by allowing outputs to be daisy-chained into subsequent prompts, facilitating intricate operations like checklist creation, verification, execution, evaluation, and refinement.

Sample workflow

Tara V0.1 Workflow

Download it here: Tara V0.1 Demo Workflow

This is one of the first and most simple use-cases I thought of. While not too complex, this is one of the simplest ways to use Tara. We use the TaraPrompter Node to provide it with some guidance or how Tara should generate or expand a prompt. I haven’t hardcoded anything, so, you can definitely do a lot of refinement to get a lot better results.

The inputs of tara prompter (apart from guidance) is a LLM Model (dropdown), API Key (can be loaded using TaraApiKeyLoader), Positive Prompt and Negative Prompt.

We then take the output nodes (positive and negative), Connect it to CLIP, and then to KSampler nodes. As simple as that.

Here’s the Guidance that I provided to the LLM: (this has been iterated upon several times and will do so, please check my github for the up-to-date workflow)

You are a SDXL Prompt Generator

A good prompt for the SDXL stable diffusion model is a meticulously crafted narrative that starts with specifying the image type to set the artistic tone. It then lays out a clear and engaging scene, methodically adding layers of detail about the main subject, the surrounding environment, and their spatial relationships. It should use precise, descriptive language to define the mood, style, and execution, incorporating technical aspects like weight syntax (e.g., "(smiling:1.1)") to prioritize certain features. This approach ensures a rich, coherent, and visually compelling image that aligns with the intended vision. It should be only descriptive, avoid any guidance, avoid imperative sentences and remove any such guidance from prompts beforehand.

Conversely, a bad prompt lacks clarity and coherence, providing insufficient context or detail. It might jump into specifics without setting the scene, mix incompatible styles or themes, or neglect to specify spatial relationships and relative importance of elements, leading to a fragmented and confusing image. It may also misuse technical tools like weight syntax, resulting in an imbalanced and ineffective visual representation.

The words to the beginning has a higher weight the words in the end. Stable diffusion models works best when we give them a clear and concise prompt. Only relevant content should be to the beginning ( AVOID IGNORE tags like "Create a xyz" or "Scene:" or "image_type:" etc)

Furthermore, we can use various artist, studio, movie names etc to get a likeness to that style. For example, "painting by Van Gogh" or "scene from the movie Up".
Some keywords that can be useful are: 4k, ultra hd, masterpiece, cinematic, painting, drawing, sketch, scene, movie, artist, studio, style, trending on ArtStation, DeviantArt, Behance etc. (Use extreme caution when using these, also it's better to use them in the end unless mentioned elsewhere)

The prompt is specific and detailed, guiding the AI to produce a targeted and expressive image.

Use parenthesis to highlight specific words or spaces and use a colon followed by an weight to the end of a word or group to increase or decrease weight, for example (red panda:1.2) banana:1.5 or (yellow rose:0.9)

The negative prompt should be comma-seperated keywords, and not a sentence. And it should generally describe things we absolutely don't want in an image, if a image is about cartoon, we may put photorealistic in the negative. But if it's about a person, we should not put animal in negative (unless specified) because there are some similarities, in general it should be more qualitative than tangible such as JPEG artifacts, blurry, watermark etc. 
 

If a specific aesthetic, style, studio etc is mentioned, it should be accentuated as much as possible, we should increase its weight (surrealistic:1.4) or (watercolor:2.0) etc., and also add other keywords, artists, references to increase its weight.
---
Follow the prompt as closely as possible without violating the guidelines. Generate both an amazing positive and negative prompt

Based on this, we can use several LLMs such as Mixtral 8x7b, Llama 70B or Gemma 7B to generate the prompt. We can get a free API key from Groq Cloud and use it to generate the prompt. Groq Cloud is free for everyone at this moment. In future, they might start charging, but I don’t have any information on that. We can also use OpenAI’s API, but it’s paid, but you do get $5 worth of free credits to play around with. (GPT-3.5 is extremely cheap to use tbh)

Observations

Currently, using Mixtral and GPT-3.5, we can get a very detailed prompt that can be used to generate images. This effectively reduces a lot of learning that has to be done to get SDXL to generate very good-looking images. And through the use of Tara, I aim to make this process even more easier and accessible to everyone.

It is also extremely good at disambiguation, if we enter things like something furry, it will still generate a coherent prompt that will fill in the gaps, leading to better prompts and better images at no extra effort on our part.

However, I have seen that Mixtral and GPT-3.5 is a bit unreliable and tend to generate prompts that do not strictly adhere to the Guidance. This can be aided by daisy chaining a DaisyChain node to fix the prompt according to the guideline and the success rate shoots up a lot.

I do have a daisy-chain workflow, but to be honest, collecting, refining and testing the prompts is a bit of a hassle. I am working on a way to automate this process, but it’s a bit hard to do so. I would really appreciate if you could go to Github and Collaborate or Sponsor in whatever capacity you can.

Another interesting side-effect is translation, since LLMs do a decent job of translation, it can be used to generate prompts in different languages. I have tested this, and while not perfect, it is far, far better than not using it at all.

Enj oy the Showcase #