toddric_v2_merged
Merged, ready-to-run weights of a fine-tuned Llama-3.1-8B specialized to be crisp, witty, encouraging, and allergic to fluff. The Stage-C (DPO) LoRA is already merged into the base, so you can load it like any normal HF model folder.
Persona: “You are toddric: crisp, witty, encouraging. Prefer concrete advice over fluff.”
Contents
toddric_v2_merged/ ├─ config.json ├─ generation_config.json ├─ tokenizer_config.json ├─ tokenizer.json (or tokenizer.model) ├─ model.safetensors (or shards model-00001-of-0000N.safetensors) └─ README.md
pgsql Copy code
Quickstart (Transformers)
4-bit (single GPU dev)
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
model_dir = "toddie314/toddric_v2_merged" # or local path
bnb = BitsAndBytesConfig(load_in_4bit=True)
tok = AutoTokenizer.from_pretrained(model_dir, use_fast=True)
model = AutoModelForCausalLM.from_pretrained(
model_dir,
quantization_config=bnb,
device_map="auto",
)
system = "You are toddric: crisp, witty, encouraging. Prefer concrete advice over fluff."
user = "Give three tactics to make technical docs clearer."
messages = [
{"role":"system","content":system},
{"role":"user","content":user},
]
prompt = tok.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tok(prompt, return_tensors="pt").to(model.device)
out = model.generate(
**inputs,
max_new_tokens=200,
do_sample=True,
temperature=0.3,
top_p=0.9,
repetition_penalty=1.12,
)
print(tok.decode(out[0], skip_special_tokens=True))
Greedy “format-strict” tasks often work best with:
python
Copy code
do_sample=False, max_new_tokens=64
bf16/fp16 (server inference)
Use vLLM/TGI with 24–32 GB+ VRAM for maximum throughput. Quant support varies by version.
Why “merged”?
No PEFT adapters at runtime.
Simpler deployment (vLLM/TGI/Transformers).
One folder, one artifact.
To re-merge future adapters, call peft_model.merge_and_unload() or a helper like merge_lora.py.
Prompting patterns (baked-in habits)
Two-line strict style drill
pgsql
Copy code
Return EXACTLY a fenced code block with two lines.
Line 1 must begin with 'Tone:' and give a short tip (<=12 words).
Line 2 must begin with 'Style:' and give a short tip (<=12 words).
Use plain text. Include 'narrative', 'voice', and 'prose' across the two lines.
No extra text before/after.
Safety refusal (medical dosing)
Brief refusal + helpful redirect (doctor/urgent care/emergency line). No first-person, no apologies, 2–4 sentences.
JSON-only tool output
Output exactly one JSON object. No prose/markdown/questions.
Hardware & env notes
4-bit runs on ~16 GB consumer GPUs with device_map="auto".
Set PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to reduce fragmentation.
For CPU fallback, load with low_cpu_mem_usage=True (slower, but fine for tests).
Eval snapshot
Style Meter (greedy): passes strict tasks (RAG vs fine-tuning, dosing refusal, two-line truth/misconception, JSON-only SQL gating).
Stratified Eval: sane length distribution; no runaway outputs.
These are smoke tests—bring your own eval for production.
Limitations & safety
Not a medical/legal/financial advisor; should refuse dosing and high-risk instructions and redirect responsibly.
Concise by design; ask explicitly for longer explanations or examples.
License
Base: Meta Llama 3.1 license.
This fine-tuned merged artifact: inherits the base license unless you specify otherwise.
Citation
bash
Copy code
@software{toddric_v2_merged_2025,
title = {toddric_v2_merged: a crisp, concrete-advice Llama-3.1-8B},
author = {toddie314},
year = {2025},
url = {https://huggingface.co/toddie314/toddric_v2_merged}
}
Changelog
v2 (Stage-C merged): DPO merge; strict formatting stabilized; JSON gating improved.
v1: Base + SFT + refinement adapters (pre-merge).
- Downloads last month
- 88
Model tree for toddie314/toddric_v2_merged
Base model
meta-llama/Llama-3.1-8BEvaluation results
- style_meter_greedy_pass_rate on style_meter_smoketestself-reported1.000