# Trouter-20B Quick Start Guide Get up and running with Trouter-20B in minutes. ## Installation ```bash pip install transformers torch accelerate bitsandbytes ``` ## Basic Usage ### Option 1: Full Precision (Requires ~40GB VRAM) ```python from transformers import AutoTokenizer, AutoModelForCausalLM import torch model = AutoModelForCausalLM.from_pretrained( "Trouter-Library/Trouter-20B", torch_dtype=torch.bfloat16, device_map="auto" ) tokenizer = AutoTokenizer.from_pretrained("Trouter-Library/Trouter-20B") prompt = "Explain machine learning:" inputs = tokenizer(prompt, return_tensors="pt").to(model.device) outputs = model.generate(**inputs, max_new_tokens=200) print(tokenizer.decode(outputs[0], skip_special_tokens=True)) ``` ### Option 2: 4-bit Quantization (Requires ~10GB VRAM) ⭐ Recommended ```python from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig import torch bnb_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.bfloat16 ) model = AutoModelForCausalLM.from_pretrained( "Trouter-Library/Trouter-20B", quantization_config=bnb_config, device_map="auto" ) tokenizer = AutoTokenizer.from_pretrained("Trouter-Library/Trouter-20B") prompt = "Explain machine learning:" inputs = tokenizer(prompt, return_tensors="pt").to(model.device) outputs = model.generate(**inputs, max_new_tokens=200) print(tokenizer.decode(outputs[0], skip_special_tokens=True)) ``` ## Chat Interface ```python from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig import torch # Load model bnb_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.bfloat16 ) model = AutoModelForCausalLM.from_pretrained( "Trouter-Library/Trouter-20B", quantization_config=bnb_config, device_map="auto" ) tokenizer = AutoTokenizer.from_pretrained("Trouter-Library/Trouter-20B") # Create conversation messages = [ {"role": "user", "content": "What is quantum computing?"} ] # Apply chat template prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) inputs = tokenizer(prompt, return_tensors="pt").to(model.device) # Generate response outputs = model.generate( **inputs, max_new_tokens=300, temperature=0.7, top_p=0.95, do_sample=True ) response = tokenizer.decode(outputs[0][inputs['input_ids'].shape[1]:], skip_special_tokens=True) print(response) # Continue conversation messages.append({"role": "assistant", "content": response}) messages.append({"role": "user", "content": "Can you explain it more simply?"}) ``` ## Generation Parameters Adjust these for different use cases: ### Creative Writing (More Random) ```python outputs = model.generate( **inputs, max_new_tokens=500, temperature=0.9, # Higher = more creative top_p=0.95, top_k=50, do_sample=True ) ``` ### Factual/Technical (More Deterministic) ```python outputs = model.generate( **inputs, max_new_tokens=300, temperature=0.3, # Lower = more focused top_p=0.9, do_sample=True ) ``` ### Code Generation (Precise) ```python outputs = model.generate( **inputs, max_new_tokens=400, temperature=0.2, top_p=0.95, repetition_penalty=1.1, do_sample=True ) ``` ## Memory Requirements | Configuration | VRAM Required | Setup | |--------------|---------------|-------| | **Full (BF16)** | ~40GB | `torch_dtype=torch.bfloat16` | | **8-bit** | ~20GB | `load_in_8bit=True` | | **4-bit** | ~10GB | 4-bit quantization config | ## Common Issues ### Out of Memory - Use 4-bit quantization - Reduce `max_new_tokens` - Clear GPU cache: `torch.cuda.empty_cache()` ### Slow Generation - Use smaller `max_new_tokens` - Set `do_sample=False` for greedy decoding - Reduce batch size ### Poor Quality - Adjust temperature (0.7-0.9 for most tasks) - Increase max_new_tokens - Try different prompts ## Next Steps - See [USAGE_GUIDE.md](./USAGE_GUIDE.md) for advanced examples - Check [examples.py](./examples.py) for code samples - Read [EVALUATION.md](./EVALUATION.md) for benchmark results ## Simple Copy-Paste Example ```python # Install first: pip install transformers torch accelerate bitsandbytes from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig import torch # Load model (4-bit for efficiency) model = AutoModelForCausalLM.from_pretrained( "Trouter-Library/Trouter-20B", quantization_config=BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.bfloat16 ), device_map="auto" ) tokenizer = AutoTokenizer.from_pretrained("Trouter-Library/Trouter-20B") # Generate text prompt = "Write a Python function to calculate factorial:" inputs = tokenizer(prompt, return_tensors="pt").to(model.device) outputs = model.generate(**inputs, max_new_tokens=200, temperature=0.7) print(tokenizer.decode(outputs[0], skip_special_tokens=True)) ``` That's it! You're ready to use Trouter-20B.