Not-For-All-Audiences

Trained adapter for use of T5Gemma-2b as text encoder for Rouwei 0.8 (and other sdxl models).

by firedotinc, masterpiece, senko (sewayaki kitsune no senko-san), shiro (sewayaki kitsune no senko-san), yellow eyes, red eyes, fox girl, facial marks, 2girls.
Two happy anime fox girls sitting together on a park bench, both licking the same ice cream cone from opposite sides. The ice cream cone is positioned in the center of the frame as the focal point. Their faces are smeared with ice cream and their tongues are sticking out. They're wearing white t-shirts and red miniskirts. The scene is set in a sunny park with a cheerful, playful friendship atmosphere.

As a further development of llm adapter and early versions, now it has been moved to a separate repository.

What is it:

A drop-in replacement for Clip text encoders for SDXL models that allows them to achieve better prompt adherence and understanding.

Key features:

State of the art prompt adherence and NL prompt understanding among SDXL anime models
Support of both long and short prompts, no 75 tokens limit per chunk
Preserves original knowledge of styles and characters while allowing amazing flexibility in prompting
Support of structured prompts that allow us to describe individual features for characters, parts, elements, etc.
Maintains perfect compatibility with booru tags (alone or combined with NL), allowing easy and convenient prompting

How to run

Install/update custom nodes for Comfy

Option a: Go to ComfyUI/custom_nodes and use git clone https://github.com/NeuroSenko/ComfyUI_LLM_SDXL_Adapter
Option b: Open example workflow, go to ComfyUI Manager and press Install Missing Custom Nodes button.

Make sure you updated Transformers to version that supports t5gemma (4.53 and above)

Activate ComfyUI venv, type pip install transformers -U

Download adapter and put it into ComfyUI/models/llm_adapters
Download T5Gemma

Option a: After activating ConfyUI venv type hf download Minthy/RouWei-Gemma --include "t5gemma-2b-2b-ul2_*" --local-dir "./models/LLM" (correct path if needed).
Option b: Download safetensors file and put in into ComfyUI/models/text_encoders

Download Rouwei-0.8 (vpred or epsilon or base-epsilon) checkpoint if you don't have one yet. Also, you can use any Illustrious-based checkpoints but performance can be limited.
Use this workflow as a reference, feel free to experiment

Current performance:

This version stands above any clip text encoders from various models in terms of prompt understanding. It allows to specify more details and individual parts for each character/object that will work more or less consistent instead of pure randomness, make a simple comic (stability varies), define positions and more complex composition.

However, it is still in early stage, there can be difficulties with rare things (especially artist styles) and some biases. And it works with quite old and small UNET than needs proper training (and possibly modifications), don't expect it to perform as top tier open-source image generation models like Flux and QwenImage.

Usage and Prompting with examples:

The model is quite versatile and can accept various formats, including multilingual inputs or even base64.

But it is better to stick to one of several prompting styles:

Natural language:

masterpiece, by kantoku,
Three cubes stacked on each other: red, green and blue. On top of highest one sits a cute black-haired maid.

kikyou (blue archive) a cat girl with black hair and two cat tails in side-tie bikini swimsuit is standing on all fours balancing on top of swim ring. She is scared with tail raised and afraid of water around.

Just plain text. It is better to avoid very short and very long prompts.

Booru tags

Regular booru tags.

Until emphasis support will be added to nodes, avoid adding \ before brackets. Also unlike with clip misspelling may lead to wrong results.

Combination of tags and NL:

masterpiece, best quality, by muk (monsieur).
1girl, kokona (blue archive), grey hair, animal ears, brown eyes, smile, wariza,
holding a yellow ball that resembles crying emoji

Most easy and convenient approach for most cases.

Structured prompting:

bold line, masterpiece, classroom.
## Asuka:
Souryuu Asuka Langley in school uniform with tired expression sitting at a school desk, head tilt.
## Zero two:
Zero two (darling in the franxx) in red bodysuit is standing behind and making her a shoulder massage.

It can understand Markdown (## for separating), json, xlm or simple separation with new lines and :. Prompt structuring allows to improve results when prompting several characters with individual features. Depending on specific case, it can work very stable, work in most cases above random level, or can require some rolls but allowing to achieve things impossible otherwise due to biases or complexness.

All-together::

1girl, sussurro (arknights), fox girl, blue eyes, grey hair, oripathy lesion (arknights), she holds a black hole with bright silver glow around, smile, fang, star-shaped puppils.
Her outfit consists of:
race queen outfit with purple crop top, green pleated skirt, bike shorts, orange sunglasses on head.
On one of her fox ears there is a large golden bow.
She is posing from her side, leaning against yellow sport car, midriff.
Background:
The sky has a gradient from green to dark purple.
outdoors,street, road, horizon line, night, red moon.
black clouds, low brightness, epic landscape of mountain range, best quality

Any combinations of above. Recommended for most complex cases.

More example SFW NSFW

Quality tags:

masterpiece or best quality for positive.

worst quality or low qualit for negative.

It is better to avoid spamming because it can cause unwanted biases.

Knowledge and Train Dataset:

Training dataset utilizes about 2.7M of pictures from Minthy/Anime-Art-Multicaptions-v5.0 and few other sources. Still quite a small number.

Training and code:

Forward:

The adapter creates text embeddings and pooled states that are directly compatible with SDXL Unet from T5Gemma hidden states. So, its support can be easily implemented by replacing Clip forward part.

Main class and inference example, consists of 3 wide + 3 small transformers blocks and cross-attention comress between them. It supports up to 512 tokens input and converts hidden states from T5Gemma encoder with a shape of [512, 2304] to text embedding [308, 2048] + pooled part of vector embedding [1280]. 308 here is an equivalent of 4 concated clip chunks of 77 tokens (300 without bos and eos).

Such adapter can be used with any other llm or encoder.

Obtaining hidden states with a smiple example for multiple processing. Just aregular use of T5Gemma encoder part, can be connected directly with the adapter.

Pretraining:

This part is required only for initial pretraining if you're initializing new weights and want to train your own version (for example for a different text encoder model). Finetuning of pretrained model below.

For early stages to reduce costs feature-based training from direct Clip outputs have been used. Here is an example training code that works with cached states and reference clip results. Consider it only as a starting point, adjust dataloaders and formats according to your preferences.

Main training:

Backpropagation through frozen unet.

T5gemma (Frozen atm, can be cached) -> Adapter (Trained) -> Unet (Frozen atm) -> Loss -> Backward

Sd-scripts fork for Lora training

Sd-scripts (dev branch) fork for Full Training, supports fine-tuning for each part (t5gemma, adapter, unet).

Compatibility:

Designed to work with Rouwei, works with most of Illustrious-based checkpoints including NoobAi and popular merges.

Near plans:

Custom nodes improvement including emphasis

There will be another version trained on larger dataset to estimate capacity and decision about joint training with encoder or lefting it untouched. If no flaws are found, then it will be used as text encoder for large training of next version of Rouwei checkpoint.

I'm willing to help/cooperate:

Join Discord server where you can share you thouhts, give proposal, request, etc. Write me directly here, on civitai or dm in discord.

Thanks:

Part of training was performed using google TPU and sponsored by OpenRoot-Compute

Personal: NeuroSenko (code), Rimuru (idea, discussions), Lord (testing), DraconicDragon (fixes, testing), Remix (nodes code), and all fellow brothers who supported me before.

Donations:

BTC bc1qwv83ggq8rvv07uk6dv4njs0j3yygj3aax4wg6c

ETH/USDT(e) 0x04C8a749F49aE8a56CB84cF0C99CD9E92eDB17db

XMR 47F7JAyKP8tMBtzwxpoZsUVB8wzg2VrbtDKBice9FAS1FikbHEXXPof4PAb42CQ5ch8p8Hs4RvJuzPHDtaVSdQzD6ZbA5TZ

License:

This repo contains original or finetuned models google/t5gemma-2b-2b-ul2. Gemma is provided under and subject to the Gemma Terms of Use found at ai.google.dev/gemma/terms.

MIT license for adapter models.