cgus's picture
Update README.md
312c26e verified
|
raw
history blame
8.17 kB
---
language:
- en
license: apache-2.0
library_name: exllamav2
base_model:
- huihui-ai/Homunculus-abliterated
tags:
- distillation
- /think
- /nothink
- reasoning-transfer
- arcee-ai
- chat
- abliterated
- uncensored
---
# Homunculus-abliterated-exl2
Original model: [Homunculus-abliterated](https://huggingface.co/huihui-ai/Homunculus-abliterated) by [huihui.ai](https://huggingface.co/huihui-ai)
Based on: [Homunculus](https://huggingface.co/arcee-ai/Homunculus) by [Arcee AI](https://huggingface.co/arcee-ai)
Foundation model: [Mistral-Nemo-Base-2407](https://huggingface.co/mistralai/Mistral-Nemo-Base-2407) by [Mistral AI](https://huggingface.co/mistralai) with data and tokenizer from [Qwen3-235B-A22B](https://huggingface.co/Qwen/Qwen3-235B-A22B) by [Qwen](https://huggingface.co/Qwen)
## Quants
[4bpw h6 (main)](https://huggingface.co/cgus/Homunculus-abliterated-exl2/tree/main)
[4.5bpw h6](https://huggingface.co/cgus/Homunculus-abliterated-exl2/tree/4.5bpw-h6)
[5bpw h6](https://huggingface.co/cgus/Homunculus-abliterated-exl2/tree/5bpw-h6)
[6bpw h6](https://huggingface.co/cgus/Homunculus-abliterated-exl2/tree/6bpw-h6)
[8bpw h8](https://huggingface.co/cgus/Homunculus-abliterated-exl2/tree/8bpw-h8)
## Quantization notes
Made with Exllamav2 0.3.1 with default dataset.
These quants can be used with RTX GPU on Windows or RTX/ROCm GPU on Linux with TabbyAPI or Text-Generation-WebUI.
Exllamav2 quants must fully fit your GPU to be usable or to maintain maximum performance.
For example, I use Mistral-Nemo-12B models with RTX3060/12GB 6bpw quant and 16k context (Q6 cache) or RTX4060TI/16GB with 6bpw 32k (Q8 cache).
# Original model card.
# huihui-ai/Homunculus-abliterated
This is an uncensored version of [arcee-ai/Homunculus](https://huggingface.co/arcee-ai/Homunculus) created with abliteration (see [remove-refusals-with-transformers](https://github.com/Sumandora/remove-refusals-with-transformers) to know more about it).
This is a crude, proof-of-concept implementation to remove refusals from an LLM model without using TransformerLens.
## ollama
You can use [huihui_ai/homunculus-abliterated](https://ollama.com/huihui_ai/homunculus-abliterated) directly,
Switch the thinking toggle using /set think and /set nothink
```
ollama run huihui_ai/homunculus-abliterated
```
## Usage
You can use this model in your applications by loading it with Hugging Face's `transformers` library:
You can try using **/nothink** to toggle think mode, but it’s not guaranteed to work every time.
```python
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, TextStreamer
import torch
import os
import signal
cpu_count = os.cpu_count()
print(f"Number of CPU cores in the system: {cpu_count}")
half_cpu_count = cpu_count // 2
os.environ["MKL_NUM_THREADS"] = str(half_cpu_count)
os.environ["OMP_NUM_THREADS"] = str(half_cpu_count)
torch.set_num_threads(half_cpu_count)
print(f"PyTorch threads: {torch.get_num_threads()}")
print(f"MKL threads: {os.getenv('MKL_NUM_THREADS')}")
print(f"OMP threads: {os.getenv('OMP_NUM_THREADS')}")
# Load the model and tokenizer
NEW_MODEL_ID = "huihui-ai/Homunculus-abliterated"
print(f"Load Model {NEW_MODEL_ID} ... ")
quant_config_4 = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_use_double_quant=True,
llm_int8_enable_fp32_cpu_offload=True,
)
model = AutoModelForCausalLM.from_pretrained(
NEW_MODEL_ID,
device_map="auto",
trust_remote_code=True,
quantization_config=quant_config_4,
torch_dtype=torch.bfloat16
)
tokenizer = AutoTokenizer.from_pretrained(NEW_MODEL_ID, trust_remote_code=True)
if tokenizer.pad_token is None:
tokenizer.pad_token = tokenizer.eos_token
tokenizer.pad_token_id = tokenizer.eos_token_id
messages = []
enable_thinking = True
skip_prompt=True
skip_special_tokens=True
def apply_chat_template(tokenizer, messages, enable_thinking, add_generation_prompt=True):
input_ids = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=add_generation_prompt,
)
if not enable_thinking:
input_ids += "\n<think>\n\n</think>\n"
return input_ids
class CustomTextStreamer(TextStreamer):
def __init__(self, tokenizer, skip_prompt=True, skip_special_tokens=True):
super().__init__(tokenizer, skip_prompt=skip_prompt, skip_special_tokens=skip_special_tokens)
self.generated_text = ""
self.stop_flag = False
def on_finalized_text(self, text: str, stream_end: bool = False):
self.generated_text += text
print(text, end="", flush=True)
if self.stop_flag:
raise StopIteration
def stop_generation(self):
self.stop_flag = True
def generate_stream(model, tokenizer, messages, enable_thinking, skip_prompt, skip_special_tokens, max_new_tokens):
formatted_prompt = apply_chat_template(tokenizer, messages, enable_thinking)
input_ids = tokenizer(
formatted_prompt,
return_tensors="pt",
return_attention_mask=True,
padding=False
)
tokens = input_ids['input_ids'].to(model.device)
attention_mask = input_ids['attention_mask'].to(model.device)
streamer = CustomTextStreamer(tokenizer, skip_prompt=skip_prompt, skip_special_tokens=skip_special_tokens)
def signal_handler(sig, frame):
streamer.stop_generation()
print("\n[Generation stopped by user with Ctrl+C]")
signal.signal(signal.SIGINT, signal_handler)
print("Response: ", end="", flush=True)
try:
generated_ids = model.generate(
tokens,
attention_mask=attention_mask,
#use_cache=False,
max_new_tokens=max_new_tokens,
do_sample=True,
pad_token_id=tokenizer.pad_token_id,
streamer=streamer
)
del generated_ids
except StopIteration:
print("\n[Stopped by user]")
del input_ids, attention_mask
torch.cuda.empty_cache()
signal.signal(signal.SIGINT, signal.SIG_DFL)
return streamer.generated_text, streamer.stop_flag
while True:
user_input = input("User: ").strip()
if user_input.lower() == "/exit":
print("Exiting chat.")
break
if user_input.lower() == "/clear":
messages = []
print("Chat history cleared. Starting a new conversation.")
continue
if user_input.lower() == "/nothink":
if enable_thinking:
enable_thinking = False
print("Thinking = False.")
else:
enable_thinking = True
print("Thinking = True.")
continue
if user_input.lower() == "/skip_prompt":
if skip_prompt:
skip_prompt = False
print("skip_prompt = False.")
else:
skip_prompt = True
print("skip_prompt = True.")
continue
if user_input.lower() == "/skip_special_tokens":
if skip_special_tokens:
skip_special_tokens = False
print("skip_special_tokens = False.")
else:
skip_special_tokens = True
print("skip_special_tokens = True.")
continue
if not user_input:
print("Input cannot be empty. Please enter something.")
continue
messages.append({"role": "user", "content": user_input})
response, stop_flag = generate_stream(model, tokenizer, messages, enable_thinking, skip_prompt, skip_special_tokens, 8192)
print("", flush=True)
if stop_flag:
continue
messages.append({"role": "assistant", "content": response})
```
### Donation
If you like it, please click 'like' and follow us for more updates.
You can follow [x.com/support_huihui](https://x.com/support_huihui) to get the latest model information from huihui.ai.
If you have any questions, insights, or specific ablation models you want to request, please send an email to [email protected]
##### Your donation helps us continue our further development and improvement, a cup of coffee can do it.
- bitcoin(BTC):
```
bc1qqnkhuchxw0zqjh2ku3lu4hq45hc6gy84uk70ge
```