Spaces:
Sleeping
Sleeping
Commit
·
8db88dd
1
Parent(s):
4386026
Upd Large model for longcontext
Browse files- AGENT_ASNM.md +110 -85
- routes/search.py +16 -16
- utils/api/router.py +125 -24
- utils/service/summarizer.py +33 -3
AGENT_ASNM.md
CHANGED
|
@@ -1,143 +1,168 @@
|
|
| 1 |
-
# Task Assignment Review -
|
| 2 |
|
| 3 |
## Overview
|
| 4 |
-
This document summarizes the
|
| 5 |
-
- **Easy tasks** (immediate execution, simple) → **
|
| 6 |
-
- **
|
| 7 |
-
- **Hard tasks** (
|
|
|
|
| 8 |
|
| 9 |
-
##
|
| 10 |
|
| 11 |
-
### ✅ **Easy Tasks -
|
| 12 |
**Purpose**: Immediate execution, simple operations
|
| 13 |
**Current Assignments**:
|
| 14 |
- `llama_chat()` - Basic chat completion
|
| 15 |
-
- `
|
| 16 |
- `summarize_qa()` - Basic Q&A summarization
|
| 17 |
- `naive_fallback()` - Simple text processing fallback
|
| 18 |
|
| 19 |
-
### ✅ **
|
| 20 |
-
**Purpose**:
|
| 21 |
-
**
|
| 22 |
-
|
| 23 |
-
#### **Search Operations** (`routes/search.py`)
|
| 24 |
-
- `extract_search_keywords()` - Keyword extraction with reasoning
|
| 25 |
-
- `generate_search_strategies()` - Search strategy generation
|
| 26 |
-
- `extract_relevant_content()` - Content relevance filtering
|
| 27 |
-
- `assess_content_quality()` - Quality assessment with reasoning
|
| 28 |
-
- `cross_validate_information()` - Fact-checking and validation
|
| 29 |
-
- `generate_content_summary()` - Content summarization
|
| 30 |
|
| 31 |
#### **Memory Operations** (`memo/`)
|
| 32 |
-
- `files_relevance()` - File relevance classification
|
| 33 |
- `related_recent_context()` - Context selection with reasoning
|
| 34 |
-
- `_ai_intent_detection()` - User intent detection
|
| 35 |
-
- `_ai_select_qa_memories()` - Memory selection with reasoning
|
| 36 |
-
- `_should_enhance_with_context()` - Context enhancement decision
|
| 37 |
-
- `_enhance_question_with_context()` - Question enhancement
|
| 38 |
-
- `_enhance_instructions_with_context()` - Instruction enhancement
|
| 39 |
-
- `consolidate_similar_memories()` - Memory consolidation
|
| 40 |
|
| 41 |
#### **Content Processing** (`utils/service/summarizer.py`)
|
| 42 |
- `clean_chunk_text()` - Content cleaning with reasoning
|
| 43 |
-
- `qwen_summarize()` -
|
| 44 |
|
| 45 |
#### **Chat Operations** (`routes/chats.py`)
|
| 46 |
-
- `generate_query_variations()` - Query variation generation
|
| 47 |
|
| 48 |
-
### ✅ **Hard Tasks -
|
| 49 |
-
**Purpose**:
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 50 |
**Current Assignments**:
|
| 51 |
- `generate_cot_plan()` - Chain of Thought report planning
|
| 52 |
- `analyze_subtask_comprehensive()` - Comprehensive analysis
|
| 53 |
- `synthesize_section_analysis()` - Complex synthesis
|
| 54 |
- `generate_final_report()` - Long-form report generation
|
| 55 |
-
- All complex report generation tasks
|
| 56 |
|
| 57 |
-
## Key
|
| 58 |
|
| 59 |
-
### 1. **
|
| 60 |
-
- **Before**:
|
| 61 |
-
- **After**:
|
| 62 |
-
- **Reason**:
|
| 63 |
|
| 64 |
-
### 2. **
|
| 65 |
-
- **Before**:
|
| 66 |
-
- **After**:
|
| 67 |
-
- **Reason**:
|
| 68 |
|
| 69 |
-
### 3. **
|
| 70 |
-
- **Before**:
|
| 71 |
-
- **After**:
|
| 72 |
-
- **Reason**:
|
| 73 |
|
| 74 |
-
### 4. **
|
| 75 |
-
- **Before**: Used
|
| 76 |
-
- **After**: Uses
|
| 77 |
-
- **Reason**:
|
| 78 |
|
| 79 |
-
### 5. **Memory
|
| 80 |
-
- **Before**:
|
| 81 |
-
- **After**:
|
| 82 |
-
- **Reason**:
|
| 83 |
-
|
| 84 |
-
### 6. **Query Variation Generation** (`routes/chats.py`)
|
| 85 |
-
- **Before**: Used Llama for query variations
|
| 86 |
-
- **After**: Uses Qwen for better reasoning about variations
|
| 87 |
-
- **Reason**: Requires understanding question intent and context
|
| 88 |
|
| 89 |
## Enhanced Model Selection Logic
|
| 90 |
|
| 91 |
-
### **Complexity Heuristics**
|
| 92 |
```python
|
| 93 |
-
#
|
| 94 |
-
- Keywords: "prove", "derivation", "complexity", "algorithm", "optimize", "theorem", "rigorous", "step-by-step", "policy critique", "ambiguity", "counterfactual", "comprehensive", "detailed analysis", "synthesis", "evaluation"
|
| 95 |
-
- Length: >
|
| 96 |
-
- Content: "comprehensive" or "
|
| 97 |
-
|
| 98 |
-
#
|
| 99 |
-
- Keywords: "analyze", "explain", "compare", "evaluate", "summarize", "extract", "classify", "identify", "describe", "discuss", "
|
| 100 |
-
- Length:
|
| 101 |
-
- Content: "
|
| 102 |
-
|
| 103 |
-
#
|
| 104 |
-
- Keywords: "
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 105 |
- Length: ≤ 10 words or ≤ 200 context words
|
| 106 |
```
|
| 107 |
|
| 108 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 109 |
|
| 110 |
### **Performance Improvements**
|
| 111 |
-
- **Better reasoning** for
|
| 112 |
-
- **
|
| 113 |
-
- **
|
|
|
|
| 114 |
|
| 115 |
### **Cost Optimization**
|
| 116 |
-
- **Reduced Gemini usage** for tasks that don't need
|
| 117 |
- **Better task distribution** across model capabilities
|
|
|
|
| 118 |
- **Maintained efficiency** for simple tasks
|
| 119 |
|
| 120 |
### **Quality Improvements**
|
| 121 |
-
- **Better
|
| 122 |
-
- **Improved
|
| 123 |
-
- **Enhanced
|
| 124 |
-
- **More accurate
|
|
|
|
| 125 |
|
| 126 |
## Verification Checklist
|
| 127 |
|
| 128 |
-
- ✅ All easy tasks use
|
| 129 |
-
- ✅ All
|
| 130 |
-
- ✅ All hard tasks use
|
| 131 |
-
- ✅
|
|
|
|
|
|
|
| 132 |
- ✅ No linting errors in modified files
|
| 133 |
- ✅ All functions have proper fallback mechanisms
|
| 134 |
- ✅ Error handling is maintained for all changes
|
| 135 |
|
| 136 |
## Configuration
|
| 137 |
|
| 138 |
-
The system is ready to use with the environment
|
| 139 |
```bash
|
|
|
|
| 140 |
NVIDIA_MEDIUM=qwen/qwen3-next-80b-a3b-thinking
|
|
|
|
| 141 |
```
|
| 142 |
|
| 143 |
-
All changes maintain backward compatibility and include proper error handling.
|
|
|
|
| 1 |
+
# Task Assignment Review - Three-Tier Model System
|
| 2 |
|
| 3 |
## Overview
|
| 4 |
+
This document summarizes the three-tier model selection system that optimizes API usage based on task complexity and reasoning requirements:
|
| 5 |
+
- **Easy tasks** (immediate execution, simple) → **NVIDIA Small** (Llama-8b-instruct)
|
| 6 |
+
- **Reasoning tasks** (thinking, decision-making, context selection) → **NVIDIA Medium** (Qwen-3-next-80b-a3b-thinking)
|
| 7 |
+
- **Hard/long context tasks** (content processing, analysis, generation) → **NVIDIA Large** (GPT-OSS-120b)
|
| 8 |
+
- **Very complex tasks** (research, comprehensive analysis) → **Gemini Pro**
|
| 9 |
|
| 10 |
+
## Three-Tier Task Assignments
|
| 11 |
|
| 12 |
+
### ✅ **Easy Tasks - NVIDIA Small (Llama-8b-instruct)**
|
| 13 |
**Purpose**: Immediate execution, simple operations
|
| 14 |
**Current Assignments**:
|
| 15 |
- `llama_chat()` - Basic chat completion
|
| 16 |
+
- `nvidia_small_summarize()` - Simple text summarization (≤1500 chars)
|
| 17 |
- `summarize_qa()` - Basic Q&A summarization
|
| 18 |
- `naive_fallback()` - Simple text processing fallback
|
| 19 |
|
| 20 |
+
### ✅ **Reasoning Tasks - NVIDIA Medium (Qwen-3-next-80b-a3b-thinking)**
|
| 21 |
+
**Purpose**: Thinking, decision-making, context selection
|
| 22 |
+
**Current Assignments**:
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 23 |
|
| 24 |
#### **Memory Operations** (`memo/`)
|
| 25 |
+
- `files_relevance()` - File relevance classification with reasoning
|
| 26 |
- `related_recent_context()` - Context selection with reasoning
|
| 27 |
+
- `_ai_intent_detection()` - User intent detection with reasoning
|
| 28 |
+
- `_ai_select_qa_memories()` - Memory selection with reasoning
|
| 29 |
+
- `_should_enhance_with_context()` - Context enhancement decision
|
| 30 |
+
- `_enhance_question_with_context()` - Question enhancement with reasoning
|
| 31 |
+
- `_enhance_instructions_with_context()` - Instruction enhancement with reasoning
|
| 32 |
+
- `consolidate_similar_memories()` - Memory consolidation with reasoning
|
| 33 |
|
| 34 |
#### **Content Processing** (`utils/service/summarizer.py`)
|
| 35 |
- `clean_chunk_text()` - Content cleaning with reasoning
|
| 36 |
+
- `qwen_summarize()` - Reasoning-based summarization
|
| 37 |
|
| 38 |
#### **Chat Operations** (`routes/chats.py`)
|
| 39 |
+
- `generate_query_variations()` - Query variation generation with reasoning
|
| 40 |
|
| 41 |
+
### ✅ **Hard/Long Context Tasks - NVIDIA Large (GPT-OSS-120b)**
|
| 42 |
+
**Purpose**: Content processing, analysis, generation, long context
|
| 43 |
+
**Current Assignments**:
|
| 44 |
+
|
| 45 |
+
#### **Search Operations** (`routes/search.py`)
|
| 46 |
+
- `extract_search_keywords()` - Keyword extraction for long queries
|
| 47 |
+
- `generate_search_strategies()` - Search strategy generation
|
| 48 |
+
- `extract_relevant_content()` - Content relevance filtering for long content
|
| 49 |
+
- `assess_content_quality()` - Quality assessment for complex content
|
| 50 |
+
- `cross_validate_information()` - Fact-checking and validation
|
| 51 |
+
- `generate_content_summary()` - Content summarization for long content
|
| 52 |
+
|
| 53 |
+
#### **Content Processing** (`utils/service/summarizer.py`)
|
| 54 |
+
- `nvidia_large_summarize()` - Long context summarization (>1500 chars)
|
| 55 |
+
- `llama_summarize()` - Flexible summarization (auto-selects model based on length)
|
| 56 |
+
|
| 57 |
+
### ✅ **Very Complex Tasks - Gemini Pro**
|
| 58 |
+
**Purpose**: Research, comprehensive analysis, advanced reasoning
|
| 59 |
**Current Assignments**:
|
| 60 |
- `generate_cot_plan()` - Chain of Thought report planning
|
| 61 |
- `analyze_subtask_comprehensive()` - Comprehensive analysis
|
| 62 |
- `synthesize_section_analysis()` - Complex synthesis
|
| 63 |
- `generate_final_report()` - Long-form report generation
|
| 64 |
+
- All complex report generation tasks requiring advanced reasoning
|
| 65 |
|
| 66 |
+
## Key Improvements Made
|
| 67 |
|
| 68 |
+
### 1. **Three-Tier Model Selection**
|
| 69 |
+
- **Before**: Two-tier system (Llama + Gemini)
|
| 70 |
+
- **After**: Four-tier system (NVIDIA Small + NVIDIA Medium + NVIDIA Large + Gemini Pro)
|
| 71 |
+
- **Reason**: Better optimization of model capabilities for different task types
|
| 72 |
|
| 73 |
+
### 2. **Reasoning vs. Processing Separation**
|
| 74 |
+
- **Before**: Mixed reasoning and processing tasks
|
| 75 |
+
- **After**: Clear separation - Qwen for reasoning, NVIDIA Large for processing
|
| 76 |
+
- **Reason**: Qwen excels at thinking, NVIDIA Large excels at content processing
|
| 77 |
|
| 78 |
+
### 3. **Flexible Summarization** (`utils/service/summarizer.py`)
|
| 79 |
+
- **Before**: Fixed model selection for summarization
|
| 80 |
+
- **After**: Dynamic model selection based on context length (>1500 chars → NVIDIA Large)
|
| 81 |
+
- **Reason**: Better handling of long context with appropriate model
|
| 82 |
|
| 83 |
+
### 4. **Search Operations Optimization** (`routes/search.py`)
|
| 84 |
+
- **Before**: Used Qwen for all search operations
|
| 85 |
+
- **After**: Uses NVIDIA Large for content processing tasks
|
| 86 |
+
- **Reason**: Better handling of long content and complex analysis
|
| 87 |
|
| 88 |
+
### 5. **Memory Operations Enhancement** (`memo/`)
|
| 89 |
+
- **Before**: Mixed model usage for memory operations
|
| 90 |
+
- **After**: Consistent use of Qwen for reasoning-based memory tasks
|
| 91 |
+
- **Reason**: Better reasoning capabilities for context selection and enhancement
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 92 |
|
| 93 |
## Enhanced Model Selection Logic
|
| 94 |
|
| 95 |
+
### **Four-Tier Complexity Heuristics**
|
| 96 |
```python
|
| 97 |
+
# Very complex tasks (Gemini Pro)
|
| 98 |
+
- Keywords: "prove", "derivation", "complexity", "algorithm", "optimize", "theorem", "rigorous", "step-by-step", "policy critique", "ambiguity", "counterfactual", "comprehensive", "detailed analysis", "synthesis", "evaluation", "research", "investigation", "comprehensive study"
|
| 99 |
+
- Length: > 120 words or > 4000 context words
|
| 100 |
+
- Content: "comprehensive", "detailed", or "research" in question
|
| 101 |
+
|
| 102 |
+
# Hard/long context tasks (NVIDIA Large)
|
| 103 |
+
- Keywords: "analyze", "explain", "compare", "evaluate", "summarize", "extract", "classify", "identify", "describe", "discuss", "synthesis", "consolidate", "process", "generate", "create", "develop", "build", "construct"
|
| 104 |
+
- Length: > 50 words or > 1500 context words
|
| 105 |
+
- Content: "synthesis", "generate", or "create" in question
|
| 106 |
+
|
| 107 |
+
# Reasoning tasks (NVIDIA Medium - Qwen)
|
| 108 |
+
- Keywords: "reasoning", "context", "enhance", "select", "decide", "choose", "determine", "assess", "judge", "consider", "think", "reason", "logic", "inference", "deduction", "analysis", "interpretation"
|
| 109 |
+
- Length: > 20 words or > 800 context words
|
| 110 |
+
- Content: "enhance", "context", "select", or "decide" in question
|
| 111 |
+
|
| 112 |
+
# Simple tasks (NVIDIA Small - Llama)
|
| 113 |
+
- Keywords: "what", "how", "when", "where", "who", "yes", "no", "count", "list", "find", "search", "lookup"
|
| 114 |
- Length: ≤ 10 words or ≤ 200 context words
|
| 115 |
```
|
| 116 |
|
| 117 |
+
### **Flexible Summarization Logic**
|
| 118 |
+
```python
|
| 119 |
+
# Dynamic model selection for summarization
|
| 120 |
+
if len(text) > 1500:
|
| 121 |
+
use_nvidia_large() # Better for long context
|
| 122 |
+
else:
|
| 123 |
+
use_nvidia_small() # Cost-effective for short text
|
| 124 |
+
```
|
| 125 |
+
|
| 126 |
+
## Benefits of Three-Tier System
|
| 127 |
|
| 128 |
### **Performance Improvements**
|
| 129 |
+
- **Better reasoning** for thinking tasks with Qwen's thinking mode
|
| 130 |
+
- **Enhanced processing** for long context with NVIDIA Large
|
| 131 |
+
- **Faster execution** for simple tasks with NVIDIA Small
|
| 132 |
+
- **Higher quality** for very complex tasks with Gemini Pro
|
| 133 |
|
| 134 |
### **Cost Optimization**
|
| 135 |
+
- **Reduced Gemini usage** for tasks that don't need advanced reasoning
|
| 136 |
- **Better task distribution** across model capabilities
|
| 137 |
+
- **Flexible summarization** using appropriate models based on context length
|
| 138 |
- **Maintained efficiency** for simple tasks
|
| 139 |
|
| 140 |
### **Quality Improvements**
|
| 141 |
+
- **Better reasoning capabilities** with Qwen for decision-making tasks
|
| 142 |
+
- **Improved content processing** with NVIDIA Large for long context
|
| 143 |
+
- **Enhanced memory operations** with better context understanding
|
| 144 |
+
- **More accurate search operations** with specialized models
|
| 145 |
+
- **Dynamic model selection** for optimal performance
|
| 146 |
|
| 147 |
## Verification Checklist
|
| 148 |
|
| 149 |
+
- ✅ All easy tasks use NVIDIA Small (Llama-8b-instruct)
|
| 150 |
+
- ✅ All reasoning tasks use NVIDIA Medium (Qwen-3-next-80b-a3b-thinking)
|
| 151 |
+
- ✅ All hard/long context tasks use NVIDIA Large (GPT-OSS-120b)
|
| 152 |
+
- ✅ All very complex tasks use Gemini Pro
|
| 153 |
+
- ✅ Flexible summarization implemented with dynamic model selection
|
| 154 |
+
- ✅ Model selection logic properly categorizes tasks by complexity and reasoning requirements
|
| 155 |
- ✅ No linting errors in modified files
|
| 156 |
- ✅ All functions have proper fallback mechanisms
|
| 157 |
- ✅ Error handling is maintained for all changes
|
| 158 |
|
| 159 |
## Configuration
|
| 160 |
|
| 161 |
+
The system is ready to use with the environment variables:
|
| 162 |
```bash
|
| 163 |
+
NVIDIA_SMALL=meta/llama-3.1-8b-instruct
|
| 164 |
NVIDIA_MEDIUM=qwen/qwen3-next-80b-a3b-thinking
|
| 165 |
+
NVIDIA_LARGE=openai/gpt-oss-120b
|
| 166 |
```
|
| 167 |
|
| 168 |
+
All changes maintain backward compatibility and include proper error handling with fallback mechanisms.
|
routes/search.py
CHANGED
|
@@ -2,12 +2,12 @@
|
|
| 2 |
import re, asyncio, time, json
|
| 3 |
from typing import List, Dict, Any, Tuple
|
| 4 |
from helpers.setup import logger, embedder, gemini_rotator, nvidia_rotator
|
| 5 |
-
from utils.api.router import select_model, generate_answer_with_model, qwen_chat_completion
|
| 6 |
from utils.service.summarizer import llama_summarize
|
| 7 |
|
| 8 |
|
| 9 |
async def extract_search_keywords(user_query: str, nvidia_rotator) -> List[str]:
|
| 10 |
-
"""Extract intelligent search keywords from user query using
|
| 11 |
if not nvidia_rotator:
|
| 12 |
# Fallback: simple keyword extraction
|
| 13 |
words = re.findall(r'\b\w+\b', user_query.lower())
|
|
@@ -28,8 +28,8 @@ Return only the keywords, separated by spaces, no other text."""
|
|
| 28 |
|
| 29 |
user_prompt = f"User query: {user_query}\n\nExtract search keywords:"
|
| 30 |
|
| 31 |
-
# Use
|
| 32 |
-
response = await
|
| 33 |
|
| 34 |
keywords = [kw.strip() for kw in response.split() if kw.strip()]
|
| 35 |
return keywords[:5] if keywords else [user_query]
|
|
@@ -64,8 +64,8 @@ Return as JSON array of objects."""
|
|
| 64 |
|
| 65 |
user_prompt = f"User query: {user_query}\n\nGenerate search strategies:"
|
| 66 |
|
| 67 |
-
# Use
|
| 68 |
-
response = await
|
| 69 |
|
| 70 |
try:
|
| 71 |
strategies = json.loads(response)
|
|
@@ -290,7 +290,7 @@ async def fetch_and_process_content(url: str, title: str, user_query: str, nvidi
|
|
| 290 |
|
| 291 |
|
| 292 |
async def extract_relevant_content(content: str, user_query: str, nvidia_rotator) -> str:
|
| 293 |
-
"""Use
|
| 294 |
if not nvidia_rotator:
|
| 295 |
# Fallback: return first 2000 chars
|
| 296 |
return content[:2000]
|
|
@@ -324,8 +324,8 @@ Return only the relevant content, no additional commentary."""
|
|
| 324 |
|
| 325 |
user_prompt = f"User Query: {user_query}\n\nWeb Content:\n{content}\n\nExtract relevant information:"
|
| 326 |
|
| 327 |
-
# Use
|
| 328 |
-
response = await
|
| 329 |
|
| 330 |
return response.strip() if response.strip() else ""
|
| 331 |
|
|
@@ -335,7 +335,7 @@ Return only the relevant content, no additional commentary."""
|
|
| 335 |
|
| 336 |
|
| 337 |
async def assess_content_quality(content: str, nvidia_rotator) -> Dict[str, Any]:
|
| 338 |
-
"""Assess content quality using
|
| 339 |
if not nvidia_rotator or not content:
|
| 340 |
return {"quality_score": 0.5, "issues": [], "strengths": []}
|
| 341 |
|
|
@@ -349,8 +349,8 @@ Consider: accuracy, completeness, clarity, authority, recency, bias, factual cla
|
|
| 349 |
|
| 350 |
user_prompt = f"Assess this content quality:\n\n{content[:2000]}"
|
| 351 |
|
| 352 |
-
# Use
|
| 353 |
-
response = await
|
| 354 |
|
| 355 |
try:
|
| 356 |
# Try to parse JSON response
|
|
@@ -413,8 +413,8 @@ Focus on factual claims, statistics, and verifiable information."""
|
|
| 413 |
|
| 414 |
user_prompt = f"Main content:\n{content[:1000]}\n\nOther sources:\n{comparison_text[:2000]}\n\nAnalyze consistency:"
|
| 415 |
|
| 416 |
-
# Use
|
| 417 |
-
response = await
|
| 418 |
|
| 419 |
try:
|
| 420 |
validation = json.loads(response)
|
|
@@ -480,8 +480,8 @@ Be clear and direct."""
|
|
| 480 |
|
| 481 |
user_prompt = f"Summarize this content:\n\n{content}"
|
| 482 |
|
| 483 |
-
# Use
|
| 484 |
-
response = await
|
| 485 |
|
| 486 |
return response.strip() if response.strip() else content[:200] + "..."
|
| 487 |
|
|
|
|
| 2 |
import re, asyncio, time, json
|
| 3 |
from typing import List, Dict, Any, Tuple
|
| 4 |
from helpers.setup import logger, embedder, gemini_rotator, nvidia_rotator
|
| 5 |
+
from utils.api.router import select_model, generate_answer_with_model, qwen_chat_completion, nvidia_large_chat_completion
|
| 6 |
from utils.service.summarizer import llama_summarize
|
| 7 |
|
| 8 |
|
| 9 |
async def extract_search_keywords(user_query: str, nvidia_rotator) -> List[str]:
|
| 10 |
+
"""Extract intelligent search keywords from user query using NVIDIA Large agent."""
|
| 11 |
if not nvidia_rotator:
|
| 12 |
# Fallback: simple keyword extraction
|
| 13 |
words = re.findall(r'\b\w+\b', user_query.lower())
|
|
|
|
| 28 |
|
| 29 |
user_prompt = f"User query: {user_query}\n\nExtract search keywords:"
|
| 30 |
|
| 31 |
+
# Use NVIDIA Large for better keyword extraction
|
| 32 |
+
response = await nvidia_large_chat_completion(sys_prompt, user_prompt, nvidia_rotator)
|
| 33 |
|
| 34 |
keywords = [kw.strip() for kw in response.split() if kw.strip()]
|
| 35 |
return keywords[:5] if keywords else [user_query]
|
|
|
|
| 64 |
|
| 65 |
user_prompt = f"User query: {user_query}\n\nGenerate search strategies:"
|
| 66 |
|
| 67 |
+
# Use NVIDIA Large for better strategy generation
|
| 68 |
+
response = await nvidia_large_chat_completion(sys_prompt, user_prompt, nvidia_rotator)
|
| 69 |
|
| 70 |
try:
|
| 71 |
strategies = json.loads(response)
|
|
|
|
| 290 |
|
| 291 |
|
| 292 |
async def extract_relevant_content(content: str, user_query: str, nvidia_rotator) -> str:
|
| 293 |
+
"""Use NVIDIA Large agent to extract only the content relevant to the user query."""
|
| 294 |
if not nvidia_rotator:
|
| 295 |
# Fallback: return first 2000 chars
|
| 296 |
return content[:2000]
|
|
|
|
| 324 |
|
| 325 |
user_prompt = f"User Query: {user_query}\n\nWeb Content:\n{content}\n\nExtract relevant information:"
|
| 326 |
|
| 327 |
+
# Use NVIDIA Large for better content extraction
|
| 328 |
+
response = await nvidia_large_chat_completion(sys_prompt, user_prompt, nvidia_rotator)
|
| 329 |
|
| 330 |
return response.strip() if response.strip() else ""
|
| 331 |
|
|
|
|
| 335 |
|
| 336 |
|
| 337 |
async def assess_content_quality(content: str, nvidia_rotator) -> Dict[str, Any]:
|
| 338 |
+
"""Assess content quality using NVIDIA Large agent."""
|
| 339 |
if not nvidia_rotator or not content:
|
| 340 |
return {"quality_score": 0.5, "issues": [], "strengths": []}
|
| 341 |
|
|
|
|
| 349 |
|
| 350 |
user_prompt = f"Assess this content quality:\n\n{content[:2000]}"
|
| 351 |
|
| 352 |
+
# Use NVIDIA Large for better quality assessment
|
| 353 |
+
response = await nvidia_large_chat_completion(sys_prompt, user_prompt, nvidia_rotator)
|
| 354 |
|
| 355 |
try:
|
| 356 |
# Try to parse JSON response
|
|
|
|
| 413 |
|
| 414 |
user_prompt = f"Main content:\n{content[:1000]}\n\nOther sources:\n{comparison_text[:2000]}\n\nAnalyze consistency:"
|
| 415 |
|
| 416 |
+
# Use NVIDIA Large for better cross-validation
|
| 417 |
+
response = await nvidia_large_chat_completion(sys_prompt, user_prompt, nvidia_rotator)
|
| 418 |
|
| 419 |
try:
|
| 420 |
validation = json.loads(response)
|
|
|
|
| 480 |
|
| 481 |
user_prompt = f"Summarize this content:\n\n{content}"
|
| 482 |
|
| 483 |
+
# Use NVIDIA Large for better summarization
|
| 484 |
+
response = await nvidia_large_chat_completion(sys_prompt, user_prompt, nvidia_rotator)
|
| 485 |
|
| 486 |
return response.strip() if response.strip() else content[:200] + "..."
|
| 487 |
|
utils/api/router.py
CHANGED
|
@@ -11,44 +11,61 @@ GEMINI_SMALL = os.getenv("GEMINI_SMALL", "gemini-2.5-flash-lite")
|
|
| 11 |
GEMINI_MED = os.getenv("GEMINI_MED", "gemini-2.5-flash")
|
| 12 |
GEMINI_PRO = os.getenv("GEMINI_PRO", "gemini-2.5-pro")
|
| 13 |
|
| 14 |
-
# NVIDIA
|
| 15 |
NVIDIA_SMALL = os.getenv("NVIDIA_SMALL", "meta/llama-3.1-8b-instruct") # Llama model for easy complexity tasks
|
| 16 |
-
NVIDIA_MEDIUM = os.getenv("NVIDIA_MEDIUM", "qwen/qwen3-next-80b-a3b-thinking") # Qwen model for
|
|
|
|
| 17 |
|
| 18 |
def select_model(question: str, context: str) -> Dict[str, Any]:
|
| 19 |
"""
|
| 20 |
-
Enhanced
|
| 21 |
- Easy tasks (immediate execution, simple) -> Llama (NVIDIA small)
|
| 22 |
-
-
|
| 23 |
-
- Hard tasks (complex
|
|
|
|
| 24 |
"""
|
| 25 |
qlen = len(question.split())
|
| 26 |
clen = len(context.split())
|
| 27 |
|
| 28 |
-
#
|
| 29 |
-
|
| 30 |
|
| 31 |
-
#
|
| 32 |
-
|
|
|
|
|
|
|
|
|
|
| 33 |
|
| 34 |
# Simple task keywords - immediate execution
|
| 35 |
-
simple_keywords = ("what", "how", "when", "where", "who", "yes", "no", "count", "list", "find")
|
| 36 |
|
| 37 |
# Determine complexity level
|
| 38 |
is_very_hard = (
|
| 39 |
-
any(k in question.lower() for k in
|
| 40 |
-
qlen >
|
| 41 |
-
clen >
|
| 42 |
"comprehensive" in question.lower() or
|
| 43 |
-
"detailed" in question.lower()
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 44 |
)
|
| 45 |
|
| 46 |
-
|
| 47 |
-
any(k in question.lower() for k in
|
| 48 |
-
|
| 49 |
-
|
| 50 |
-
"
|
| 51 |
-
"context" in question.lower()
|
|
|
|
|
|
|
| 52 |
)
|
| 53 |
|
| 54 |
is_simple = (
|
|
@@ -60,8 +77,11 @@ def select_model(question: str, context: str) -> Dict[str, Any]:
|
|
| 60 |
if is_very_hard:
|
| 61 |
# Use Gemini Pro for very complex tasks requiring advanced reasoning
|
| 62 |
return {"provider": "gemini", "model": GEMINI_PRO}
|
| 63 |
-
elif
|
| 64 |
-
# Use
|
|
|
|
|
|
|
|
|
|
| 65 |
return {"provider": "qwen", "model": NVIDIA_MEDIUM}
|
| 66 |
else:
|
| 67 |
# Use NVIDIA small (Llama) for simple tasks requiring immediate execution
|
|
@@ -125,8 +145,11 @@ async def generate_answer_with_model(selection: Dict[str, Any], system_prompt: s
|
|
| 125 |
return "I couldn't parse the model response."
|
| 126 |
|
| 127 |
elif provider == "qwen":
|
| 128 |
-
# Use Qwen for
|
| 129 |
return await qwen_chat_completion(system_prompt, user_prompt, nvidia_rotator)
|
|
|
|
|
|
|
|
|
|
| 130 |
|
| 131 |
return "Unsupported provider."
|
| 132 |
|
|
@@ -206,4 +229,82 @@ async def qwen_chat_completion(system_prompt: str, user_prompt: str, nvidia_rota
|
|
| 206 |
|
| 207 |
except Exception as e:
|
| 208 |
logger.warning(f"Qwen API error: {e}")
|
| 209 |
-
return "I couldn't process the request with Qwen model."
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 11 |
GEMINI_MED = os.getenv("GEMINI_MED", "gemini-2.5-flash")
|
| 12 |
GEMINI_PRO = os.getenv("GEMINI_PRO", "gemini-2.5-pro")
|
| 13 |
|
| 14 |
+
# NVIDIA model hierarchy (can be overridden via env)
|
| 15 |
NVIDIA_SMALL = os.getenv("NVIDIA_SMALL", "meta/llama-3.1-8b-instruct") # Llama model for easy complexity tasks
|
| 16 |
+
NVIDIA_MEDIUM = os.getenv("NVIDIA_MEDIUM", "qwen/qwen3-next-80b-a3b-thinking") # Qwen model for reasoning tasks
|
| 17 |
+
NVIDIA_LARGE = os.getenv("NVIDIA_LARGE", "openai/gpt-oss-120b") # GPT-OSS model for hard/long context tasks
|
| 18 |
|
| 19 |
def select_model(question: str, context: str) -> Dict[str, Any]:
|
| 20 |
"""
|
| 21 |
+
Enhanced three-tier model selection system:
|
| 22 |
- Easy tasks (immediate execution, simple) -> Llama (NVIDIA small)
|
| 23 |
+
- Reasoning tasks (analysis, decision-making, JSON parsing) -> Qwen (NVIDIA medium)
|
| 24 |
+
- Hard/long context tasks (complex synthesis, long-form) -> GPT-OSS (NVIDIA large)
|
| 25 |
+
- Very complex tasks (research, comprehensive analysis) -> Gemini Pro
|
| 26 |
"""
|
| 27 |
qlen = len(question.split())
|
| 28 |
clen = len(context.split())
|
| 29 |
|
| 30 |
+
# Very hard task keywords - require Gemini Pro (research, comprehensive analysis)
|
| 31 |
+
very_hard_keywords = ("prove", "derivation", "complexity", "algorithm", "optimize", "theorem", "rigorous", "step-by-step", "policy critique", "ambiguity", "counterfactual", "comprehensive", "detailed analysis", "synthesis", "evaluation", "research", "investigation", "comprehensive study")
|
| 32 |
|
| 33 |
+
# Hard/long context keywords - require NVIDIA Large (GPT-OSS)
|
| 34 |
+
hard_keywords = ("analyze", "explain", "compare", "evaluate", "summarize", "extract", "classify", "identify", "describe", "discuss", "synthesis", "consolidate", "process", "generate", "create", "develop", "build", "construct")
|
| 35 |
+
|
| 36 |
+
# Reasoning task keywords - require Qwen (thinking/reasoning)
|
| 37 |
+
reasoning_keywords = ("reasoning", "context", "enhance", "select", "decide", "choose", "determine", "assess", "judge", "consider", "think", "reason", "logic", "inference", "deduction", "analysis", "interpretation")
|
| 38 |
|
| 39 |
# Simple task keywords - immediate execution
|
| 40 |
+
simple_keywords = ("what", "how", "when", "where", "who", "yes", "no", "count", "list", "find", "search", "lookup")
|
| 41 |
|
| 42 |
# Determine complexity level
|
| 43 |
is_very_hard = (
|
| 44 |
+
any(k in question.lower() for k in very_hard_keywords) or
|
| 45 |
+
qlen > 120 or
|
| 46 |
+
clen > 4000 or
|
| 47 |
"comprehensive" in question.lower() or
|
| 48 |
+
"detailed" in question.lower() or
|
| 49 |
+
"research" in question.lower()
|
| 50 |
+
)
|
| 51 |
+
|
| 52 |
+
is_hard = (
|
| 53 |
+
any(k in question.lower() for k in hard_keywords) or
|
| 54 |
+
qlen > 50 or
|
| 55 |
+
clen > 1500 or
|
| 56 |
+
"synthesis" in question.lower() or
|
| 57 |
+
"generate" in question.lower() or
|
| 58 |
+
"create" in question.lower()
|
| 59 |
)
|
| 60 |
|
| 61 |
+
is_reasoning = (
|
| 62 |
+
any(k in question.lower() for k in reasoning_keywords) or
|
| 63 |
+
qlen > 20 or
|
| 64 |
+
clen > 800 or
|
| 65 |
+
"enhance" in question.lower() or
|
| 66 |
+
"context" in question.lower() or
|
| 67 |
+
"select" in question.lower() or
|
| 68 |
+
"decide" in question.lower()
|
| 69 |
)
|
| 70 |
|
| 71 |
is_simple = (
|
|
|
|
| 77 |
if is_very_hard:
|
| 78 |
# Use Gemini Pro for very complex tasks requiring advanced reasoning
|
| 79 |
return {"provider": "gemini", "model": GEMINI_PRO}
|
| 80 |
+
elif is_hard:
|
| 81 |
+
# Use NVIDIA Large (GPT-OSS) for hard/long context tasks
|
| 82 |
+
return {"provider": "nvidia_large", "model": NVIDIA_LARGE}
|
| 83 |
+
elif is_reasoning:
|
| 84 |
+
# Use Qwen for reasoning tasks requiring thinking
|
| 85 |
return {"provider": "qwen", "model": NVIDIA_MEDIUM}
|
| 86 |
else:
|
| 87 |
# Use NVIDIA small (Llama) for simple tasks requiring immediate execution
|
|
|
|
| 145 |
return "I couldn't parse the model response."
|
| 146 |
|
| 147 |
elif provider == "qwen":
|
| 148 |
+
# Use Qwen for reasoning tasks
|
| 149 |
return await qwen_chat_completion(system_prompt, user_prompt, nvidia_rotator)
|
| 150 |
+
elif provider == "nvidia_large":
|
| 151 |
+
# Use NVIDIA Large (GPT-OSS) for hard/long context tasks
|
| 152 |
+
return await nvidia_large_chat_completion(system_prompt, user_prompt, nvidia_rotator)
|
| 153 |
|
| 154 |
return "Unsupported provider."
|
| 155 |
|
|
|
|
| 229 |
|
| 230 |
except Exception as e:
|
| 231 |
logger.warning(f"Qwen API error: {e}")
|
| 232 |
+
return "I couldn't process the request with Qwen model."
|
| 233 |
+
|
| 234 |
+
|
| 235 |
+
async def nvidia_large_chat_completion(system_prompt: str, user_prompt: str, nvidia_rotator: APIKeyRotator) -> str:
|
| 236 |
+
"""
|
| 237 |
+
NVIDIA Large (GPT-OSS) chat completion for hard/long context tasks.
|
| 238 |
+
Uses the NVIDIA API rotator for key management.
|
| 239 |
+
"""
|
| 240 |
+
key = nvidia_rotator.get_key() or ""
|
| 241 |
+
url = "https://integrate.api.nvidia.com/v1/chat/completions"
|
| 242 |
+
|
| 243 |
+
payload = {
|
| 244 |
+
"model": NVIDIA_LARGE,
|
| 245 |
+
"messages": [
|
| 246 |
+
{"role": "system", "content": system_prompt},
|
| 247 |
+
{"role": "user", "content": user_prompt}
|
| 248 |
+
],
|
| 249 |
+
"temperature": 1.0,
|
| 250 |
+
"top_p": 1.0,
|
| 251 |
+
"max_tokens": 4096,
|
| 252 |
+
"stream": True
|
| 253 |
+
}
|
| 254 |
+
|
| 255 |
+
headers = {"Content-Type": "application/json", "Authorization": f"Bearer {key}"}
|
| 256 |
+
|
| 257 |
+
logger.info(f"[NVIDIA_LARGE] API call - Model: {NVIDIA_LARGE}, Key present: {bool(key)}")
|
| 258 |
+
logger.info(f"[NVIDIA_LARGE] System prompt length: {len(system_prompt)}, User prompt length: {len(user_prompt)}")
|
| 259 |
+
|
| 260 |
+
try:
|
| 261 |
+
# For streaming, we need to handle the response differently
|
| 262 |
+
import httpx
|
| 263 |
+
async with httpx.AsyncClient(timeout=60) as client:
|
| 264 |
+
response = await client.post(url, headers=headers, json=payload)
|
| 265 |
+
|
| 266 |
+
if response.status_code in (401, 403, 429) or (500 <= response.status_code < 600):
|
| 267 |
+
logger.warning(f"HTTP {response.status_code} from NVIDIA Large provider. Rotating key and retrying")
|
| 268 |
+
nvidia_rotator.rotate()
|
| 269 |
+
# Retry once with new key
|
| 270 |
+
key = nvidia_rotator.get_key() or ""
|
| 271 |
+
headers = {"Content-Type": "application/json", "Authorization": f"Bearer {key}"}
|
| 272 |
+
response = await client.post(url, headers=headers, json=payload)
|
| 273 |
+
|
| 274 |
+
response.raise_for_status()
|
| 275 |
+
|
| 276 |
+
# Handle streaming response
|
| 277 |
+
content = ""
|
| 278 |
+
async for line in response.aiter_lines():
|
| 279 |
+
if line.startswith("data: "):
|
| 280 |
+
data = line[6:] # Remove "data: " prefix
|
| 281 |
+
if data.strip() == "[DONE]":
|
| 282 |
+
break
|
| 283 |
+
|
| 284 |
+
try:
|
| 285 |
+
import json
|
| 286 |
+
chunk_data = json.loads(data)
|
| 287 |
+
if "choices" in chunk_data and len(chunk_data["choices"]) > 0:
|
| 288 |
+
delta = chunk_data["choices"][0].get("delta", {})
|
| 289 |
+
|
| 290 |
+
# Handle reasoning content (thinking)
|
| 291 |
+
reasoning = delta.get("reasoning_content")
|
| 292 |
+
if reasoning:
|
| 293 |
+
logger.debug(f"[NVIDIA_LARGE] Reasoning: {reasoning}")
|
| 294 |
+
|
| 295 |
+
# Handle regular content
|
| 296 |
+
chunk_content = delta.get("content")
|
| 297 |
+
if chunk_content:
|
| 298 |
+
content += chunk_content
|
| 299 |
+
except json.JSONDecodeError:
|
| 300 |
+
continue
|
| 301 |
+
|
| 302 |
+
if not content or content.strip() == "":
|
| 303 |
+
logger.warning(f"Empty content from NVIDIA Large model")
|
| 304 |
+
return "I received an empty response from the model."
|
| 305 |
+
|
| 306 |
+
return content.strip()
|
| 307 |
+
|
| 308 |
+
except Exception as e:
|
| 309 |
+
logger.warning(f"NVIDIA Large API error: {e}")
|
| 310 |
+
return "I couldn't process the request with NVIDIA Large model."
|
utils/service/summarizer.py
CHANGED
|
@@ -3,7 +3,7 @@ import asyncio
|
|
| 3 |
from typing import List
|
| 4 |
from utils.logger import get_logger
|
| 5 |
from utils.api.rotator import robust_post_json, APIKeyRotator
|
| 6 |
-
from utils.api.router import qwen_chat_completion
|
| 7 |
|
| 8 |
logger = get_logger("SUM", __name__)
|
| 9 |
|
|
@@ -25,9 +25,21 @@ async def llama_chat(messages, temperature: float = 0.2) -> str:
|
|
| 25 |
|
| 26 |
|
| 27 |
async def llama_summarize(text: str, max_sentences: int = 3) -> str:
|
|
|
|
| 28 |
text = (text or "").strip()
|
| 29 |
if not text:
|
| 30 |
return ""
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 31 |
system = (
|
| 32 |
"You are a precise summarizer. Produce a clear, faithful summary of the user's text. "
|
| 33 |
f"Return ~{max_sentences} sentences, no comments, no preface, no markdown."
|
|
@@ -39,7 +51,20 @@ async def llama_summarize(text: str, max_sentences: int = 3) -> str:
|
|
| 39 |
{"role": "user", "content": user},
|
| 40 |
])
|
| 41 |
except Exception as e:
|
| 42 |
-
logger.warning(f"
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 43 |
return naive_fallback(text, max_sentences)
|
| 44 |
|
| 45 |
|
|
@@ -49,11 +74,12 @@ def naive_fallback(text: str, max_sentences: int = 3) -> str:
|
|
| 49 |
|
| 50 |
|
| 51 |
async def summarize_text(text: str, max_sentences: int = 6, chunk_size: int = 2500) -> str:
|
| 52 |
-
"""Hierarchical summarization for long texts using
|
| 53 |
if not text:
|
| 54 |
return ""
|
| 55 |
if len(text) <= chunk_size:
|
| 56 |
return await llama_summarize(text, max_sentences=max_sentences)
|
|
|
|
| 57 |
# Split into chunks on paragraph boundaries if possible
|
| 58 |
paragraphs = text.split('\n\n')
|
| 59 |
chunks: List[str] = []
|
|
@@ -68,10 +94,13 @@ async def summarize_text(text: str, max_sentences: int = 6, chunk_size: int = 25
|
|
| 68 |
if buf:
|
| 69 |
chunks.append('\n\n'.join(buf))
|
| 70 |
|
|
|
|
| 71 |
partials = []
|
| 72 |
for ch in chunks:
|
| 73 |
partials.append(await llama_summarize(ch, max_sentences=3))
|
| 74 |
await asyncio.sleep(0)
|
|
|
|
|
|
|
| 75 |
combined = '\n'.join(partials)
|
| 76 |
return await llama_summarize(combined, max_sentences=max_sentences)
|
| 77 |
|
|
@@ -115,4 +144,5 @@ async def qwen_summarize(text: str, max_sentences: int = 3) -> str:
|
|
| 115 |
|
| 116 |
# Backward-compatible name used by app.py
|
| 117 |
async def cheap_summarize(text: str, max_sentences: int = 3) -> str:
|
|
|
|
| 118 |
return await llama_summarize(text, max_sentences)
|
|
|
|
| 3 |
from typing import List
|
| 4 |
from utils.logger import get_logger
|
| 5 |
from utils.api.rotator import robust_post_json, APIKeyRotator
|
| 6 |
+
from utils.api.router import qwen_chat_completion, nvidia_large_chat_completion
|
| 7 |
|
| 8 |
logger = get_logger("SUM", __name__)
|
| 9 |
|
|
|
|
| 25 |
|
| 26 |
|
| 27 |
async def llama_summarize(text: str, max_sentences: int = 3) -> str:
|
| 28 |
+
"""Flexible summarization using NVIDIA Small (Llama) for short text, NVIDIA Large for long context."""
|
| 29 |
text = (text or "").strip()
|
| 30 |
if not text:
|
| 31 |
return ""
|
| 32 |
+
|
| 33 |
+
# Use NVIDIA Large for long context (>1500 chars), NVIDIA Small for short context
|
| 34 |
+
if len(text) > 1500:
|
| 35 |
+
logger.info(f"[SUMMARIZER] Using NVIDIA Large for long context ({len(text)} chars)")
|
| 36 |
+
return await nvidia_large_summarize(text, max_sentences)
|
| 37 |
+
else:
|
| 38 |
+
logger.info(f"[SUMMARIZER] Using NVIDIA Small for short context ({len(text)} chars)")
|
| 39 |
+
return await nvidia_small_summarize(text, max_sentences)
|
| 40 |
+
|
| 41 |
+
async def nvidia_small_summarize(text: str, max_sentences: int = 3) -> str:
|
| 42 |
+
"""Summarization using NVIDIA Small (Llama) for short text."""
|
| 43 |
system = (
|
| 44 |
"You are a precise summarizer. Produce a clear, faithful summary of the user's text. "
|
| 45 |
f"Return ~{max_sentences} sentences, no comments, no preface, no markdown."
|
|
|
|
| 51 |
{"role": "user", "content": user},
|
| 52 |
])
|
| 53 |
except Exception as e:
|
| 54 |
+
logger.warning(f"NVIDIA Small summarization failed: {e}; using fallback")
|
| 55 |
+
return naive_fallback(text, max_sentences)
|
| 56 |
+
|
| 57 |
+
async def nvidia_large_summarize(text: str, max_sentences: int = 3) -> str:
|
| 58 |
+
"""Summarization using NVIDIA Large (GPT-OSS) for long context."""
|
| 59 |
+
system = (
|
| 60 |
+
"You are a precise summarizer. Produce a clear, faithful summary of the user's text. "
|
| 61 |
+
f"Return ~{max_sentences} sentences, no comments, no preface, no markdown."
|
| 62 |
+
)
|
| 63 |
+
user = f"Summarize this text:\n\n{text}"
|
| 64 |
+
try:
|
| 65 |
+
return await nvidia_large_chat_completion(system, user, ROTATOR)
|
| 66 |
+
except Exception as e:
|
| 67 |
+
logger.warning(f"NVIDIA Large summarization failed: {e}; using fallback")
|
| 68 |
return naive_fallback(text, max_sentences)
|
| 69 |
|
| 70 |
|
|
|
|
| 74 |
|
| 75 |
|
| 76 |
async def summarize_text(text: str, max_sentences: int = 6, chunk_size: int = 2500) -> str:
|
| 77 |
+
"""Hierarchical summarization for long texts using flexible model selection."""
|
| 78 |
if not text:
|
| 79 |
return ""
|
| 80 |
if len(text) <= chunk_size:
|
| 81 |
return await llama_summarize(text, max_sentences=max_sentences)
|
| 82 |
+
|
| 83 |
# Split into chunks on paragraph boundaries if possible
|
| 84 |
paragraphs = text.split('\n\n')
|
| 85 |
chunks: List[str] = []
|
|
|
|
| 94 |
if buf:
|
| 95 |
chunks.append('\n\n'.join(buf))
|
| 96 |
|
| 97 |
+
# Process chunks with flexible model selection
|
| 98 |
partials = []
|
| 99 |
for ch in chunks:
|
| 100 |
partials.append(await llama_summarize(ch, max_sentences=3))
|
| 101 |
await asyncio.sleep(0)
|
| 102 |
+
|
| 103 |
+
# Combine and summarize with flexible model selection
|
| 104 |
combined = '\n'.join(partials)
|
| 105 |
return await llama_summarize(combined, max_sentences=max_sentences)
|
| 106 |
|
|
|
|
| 144 |
|
| 145 |
# Backward-compatible name used by app.py
|
| 146 |
async def cheap_summarize(text: str, max_sentences: int = 3) -> str:
|
| 147 |
+
"""Backward-compatible summarization with flexible model selection."""
|
| 148 |
return await llama_summarize(text, max_sentences)
|