Spaces:

BinKhoaLe1812
/

EdSummariser

Sleeping

App Files Files Community

LiamKhoaLe commited on Sep 23

Commit

8db88dd

1 Parent(s): 4386026

Upd Large model for longcontext

Browse files

Files changed (4) hide show

AGENT_ASNM.md +110 -85
routes/search.py +16 -16
utils/api/router.py +125 -24
utils/service/summarizer.py +33 -3

AGENT_ASNM.md CHANGED Viewed

@@ -1,143 +1,168 @@
-# Task Assignment Review - Corrected Model Hierarchy
 ## Overview
-This document summarizes the corrected task assignments to ensure proper model hierarchy:
-- **Easy tasks** (immediate execution, simple) → **Llama** (NVIDIA small)
-- **Medium tasks** (accurate, reasoning, not too time-consuming) → **Qwen**
-- **Hard tasks** (complex analysis, synthesis, long-form) → **Gemini Pro**
-## Corrected Task Assignments
-### ✅ **Easy Tasks - Llama (NVIDIA Small)**
 **Purpose**: Immediate execution, simple operations
 **Current Assignments**:
 - `llama_chat()` - Basic chat completion
-- `llama_summarize()` - Simple text summarization
 - `summarize_qa()` - Basic Q&A summarization
 - `naive_fallback()` - Simple text processing fallback
-### ✅ **Medium Tasks - Qwen**
-**Purpose**: Accurate reasoning, not too time-consuming
-**Corrected Assignments**:
-#### **Search Operations** (`routes/search.py`)
-- `extract_search_keywords()` - Keyword extraction with reasoning
-- `generate_search_strategies()` - Search strategy generation
-- `extract_relevant_content()` - Content relevance filtering
-- `assess_content_quality()` - Quality assessment with reasoning
-- `cross_validate_information()` - Fact-checking and validation
-- `generate_content_summary()` - Content summarization
 #### **Memory Operations** (`memo/`)
-- `files_relevance()` - File relevance classification
 - `related_recent_context()` - Context selection with reasoning
-- `_ai_intent_detection()` - User intent detection (CORRECTED)
-- `_ai_select_qa_memories()` - Memory selection with reasoning (CORRECTED)
-- `_should_enhance_with_context()` - Context enhancement decision (CORRECTED)
-- `_enhance_question_with_context()` - Question enhancement (CORRECTED)
-- `_enhance_instructions_with_context()` - Instruction enhancement (CORRECTED)
-- `consolidate_similar_memories()` - Memory consolidation (CORRECTED)
 #### **Content Processing** (`utils/service/summarizer.py`)
 - `clean_chunk_text()` - Content cleaning with reasoning
-- `qwen_summarize()` - Medium complexity summarization
 #### **Chat Operations** (`routes/chats.py`)
-- `generate_query_variations()` - Query variation generation (CORRECTED)
-### ✅ **Hard Tasks - Gemini Pro**
-**Purpose**: Complex analysis, synthesis, long-form content
 **Current Assignments**:
 - `generate_cot_plan()` - Chain of Thought report planning
 - `analyze_subtask_comprehensive()` - Comprehensive analysis
 - `synthesize_section_analysis()` - Complex synthesis
 - `generate_final_report()` - Long-form report generation
-- All complex report generation tasks
-## Key Corrections Made
-### 1. **Intent Detection** (`memo/plan/intent.py`)
-- **Before**: Used Llama for simple classification
-- **After**: Uses Qwen for better reasoning about user intent
-- **Reason**: Requires understanding context and nuance
-### 2. **Memory Selection** (`memo/plan/execution.py`)
-- **Before**: Used Llama for memory selection
-- **After**: Uses Qwen for better reasoning about relevance
-- **Reason**: Requires understanding context relationships
-### 3. **Context Enhancement** (`memo/retrieval.py`)
-- **Before**: Used Llama for enhancement decisions
-- **After**: Uses Qwen for better reasoning about context value
-- **Reason**: Requires understanding question-context relationships
-### 4. **Question Enhancement** (`memo/retrieval.py`)
-- **Before**: Used Llama for question enhancement
-- **After**: Uses Qwen for better reasoning about enhancement
-- **Reason**: Requires understanding conversation flow and context
-### 5. **Memory Consolidation** (`memo/consolidation.py`)
-- **Before**: Used Llama for memory consolidation
-- **After**: Uses Qwen for better reasoning about similarity
-- **Reason**: Requires understanding content relationships
-### 6. **Query Variation Generation** (`routes/chats.py`)
-- **Before**: Used Llama for query variations
-- **After**: Uses Qwen for better reasoning about variations
-- **Reason**: Requires understanding question intent and context
 ## Enhanced Model Selection Logic
-### **Complexity Heuristics**
 ```python
-# Hard tasks (Gemini Pro)
-- Keywords: "prove", "derivation", "complexity", "algorithm", "optimize", "theorem", "rigorous", "step-by-step", "policy critique", "ambiguity", "counterfactual", "comprehensive", "detailed analysis", "synthesis", "evaluation"
-- Length: > 100 words or > 3000 context words
-- Content: "comprehensive" or "detailed" in question
-# Medium tasks (Qwen)
-- Keywords: "analyze", "explain", "compare", "evaluate", "summarize", "extract", "classify", "identify", "describe", "discuss", "reasoning", "context", "enhance", "select", "consolidate"
-- Length: 10-100 words or 200-3000 context words
-- Content: "reasoning" or "context" in question
-# Simple tasks (Llama)
-- Keywords: "what", "how", "when", "where", "who", "yes", "no", "count", "list", "find"
 - Length: ≤ 10 words or ≤ 200 context words
 ```
-## Benefits of Corrected Assignments
 ### **Performance Improvements**
-- **Better reasoning** for medium complexity tasks with Qwen
-- **Faster execution** for simple tasks with Llama
-- **Higher quality** for complex tasks with Gemini Pro
 ### **Cost Optimization**
-- **Reduced Gemini usage** for tasks that don't need its full capabilities
 - **Better task distribution** across model capabilities
 - **Maintained efficiency** for simple tasks
 ### **Quality Improvements**
-- **Better intent detection** with Qwen's reasoning
-- **Improved memory operations** with better context understanding
-- **Enhanced search operations** with better relevance filtering
-- **More accurate content processing** with reasoning capabilities
 ## Verification Checklist
-- ✅ All easy tasks use Llama (NVIDIA small)
-- ✅ All medium tasks use Qwen
-- ✅ All hard tasks use Gemini Pro
-- ✅ Model selection logic properly categorizes tasks
 - ✅ No linting errors in modified files
 - ✅ All functions have proper fallback mechanisms
 - ✅ Error handling is maintained for all changes
 ## Configuration
-The system is ready to use with the environment variable:
 ```bash
 NVIDIA_MEDIUM=qwen/qwen3-next-80b-a3b-thinking
 ```
-All changes maintain backward compatibility and include proper error handling.

+# Task Assignment Review - Three-Tier Model System
 ## Overview
+This document summarizes the three-tier model selection system that optimizes API usage based on task complexity and reasoning requirements:
+- **Easy tasks** (immediate execution, simple) → **NVIDIA Small** (Llama-8b-instruct)
+- **Reasoning tasks** (thinking, decision-making, context selection) → **NVIDIA Medium** (Qwen-3-next-80b-a3b-thinking)
+- **Hard/long context tasks** (content processing, analysis, generation) → **NVIDIA Large** (GPT-OSS-120b)
+- **Very complex tasks** (research, comprehensive analysis) → **Gemini Pro**
+## Three-Tier Task Assignments
+### ✅ **Easy Tasks - NVIDIA Small (Llama-8b-instruct)**
 **Purpose**: Immediate execution, simple operations
 **Current Assignments**:
 - `llama_chat()` - Basic chat completion
+- `nvidia_small_summarize()` - Simple text summarization (≤1500 chars)
 - `summarize_qa()` - Basic Q&A summarization
 - `naive_fallback()` - Simple text processing fallback
+### ✅ **Reasoning Tasks - NVIDIA Medium (Qwen-3-next-80b-a3b-thinking)**
+**Purpose**: Thinking, decision-making, context selection
+**Current Assignments**:
 #### **Memory Operations** (`memo/`)
+- `files_relevance()` - File relevance classification with reasoning
 - `related_recent_context()` - Context selection with reasoning
+- `_ai_intent_detection()` - User intent detection with reasoning
+- `_ai_select_qa_memories()` - Memory selection with reasoning
+- `_should_enhance_with_context()` - Context enhancement decision
+- `_enhance_question_with_context()` - Question enhancement with reasoning
+- `_enhance_instructions_with_context()` - Instruction enhancement with reasoning
+- `consolidate_similar_memories()` - Memory consolidation with reasoning
 #### **Content Processing** (`utils/service/summarizer.py`)
 - `clean_chunk_text()` - Content cleaning with reasoning
+- `qwen_summarize()` - Reasoning-based summarization
 #### **Chat Operations** (`routes/chats.py`)
+- `generate_query_variations()` - Query variation generation with reasoning
+### ✅ **Hard/Long Context Tasks - NVIDIA Large (GPT-OSS-120b)**
+**Purpose**: Content processing, analysis, generation, long context
+**Current Assignments**:
+#### **Search Operations** (`routes/search.py`)
+- `extract_search_keywords()` - Keyword extraction for long queries
+- `generate_search_strategies()` - Search strategy generation
+- `extract_relevant_content()` - Content relevance filtering for long content
+- `assess_content_quality()` - Quality assessment for complex content
+- `cross_validate_information()` - Fact-checking and validation
+- `generate_content_summary()` - Content summarization for long content
+#### **Content Processing** (`utils/service/summarizer.py`)
+- `nvidia_large_summarize()` - Long context summarization (>1500 chars)
+- `llama_summarize()` - Flexible summarization (auto-selects model based on length)
+### ✅ **Very Complex Tasks - Gemini Pro**
+**Purpose**: Research, comprehensive analysis, advanced reasoning
 **Current Assignments**:
 - `generate_cot_plan()` - Chain of Thought report planning
 - `analyze_subtask_comprehensive()` - Comprehensive analysis
 - `synthesize_section_analysis()` - Complex synthesis
 - `generate_final_report()` - Long-form report generation
+- All complex report generation tasks requiring advanced reasoning
+## Key Improvements Made
+### 1. **Three-Tier Model Selection**
+- **Before**: Two-tier system (Llama + Gemini)
+- **After**: Four-tier system (NVIDIA Small + NVIDIA Medium + NVIDIA Large + Gemini Pro)
+- **Reason**: Better optimization of model capabilities for different task types
+### 2. **Reasoning vs. Processing Separation**
+- **Before**: Mixed reasoning and processing tasks
+- **After**: Clear separation - Qwen for reasoning, NVIDIA Large for processing
+- **Reason**: Qwen excels at thinking, NVIDIA Large excels at content processing
+### 3. **Flexible Summarization** (`utils/service/summarizer.py`)
+- **Before**: Fixed model selection for summarization
+- **After**: Dynamic model selection based on context length (>1500 chars → NVIDIA Large)
+- **Reason**: Better handling of long context with appropriate model
+### 4. **Search Operations Optimization** (`routes/search.py`)
+- **Before**: Used Qwen for all search operations
+- **After**: Uses NVIDIA Large for content processing tasks
+- **Reason**: Better handling of long content and complex analysis
+### 5. **Memory Operations Enhancement** (`memo/`)
+- **Before**: Mixed model usage for memory operations
+- **After**: Consistent use of Qwen for reasoning-based memory tasks
+- **Reason**: Better reasoning capabilities for context selection and enhancement
 ## Enhanced Model Selection Logic
+### **Four-Tier Complexity Heuristics**
 ```python
+# Very complex tasks (Gemini Pro)
+- Keywords: "prove", "derivation", "complexity", "algorithm", "optimize", "theorem", "rigorous", "step-by-step", "policy critique", "ambiguity", "counterfactual", "comprehensive", "detailed analysis", "synthesis", "evaluation", "research", "investigation", "comprehensive study"
+- Length: > 120 words or > 4000 context words
+- Content: "comprehensive", "detailed", or "research" in question
+# Hard/long context tasks (NVIDIA Large)
+- Keywords: "analyze", "explain", "compare", "evaluate", "summarize", "extract", "classify", "identify", "describe", "discuss", "synthesis", "consolidate", "process", "generate", "create", "develop", "build", "construct"
+- Length: > 50 words or > 1500 context words
+- Content: "synthesis", "generate", or "create" in question
+# Reasoning tasks (NVIDIA Medium - Qwen)
+- Keywords: "reasoning", "context", "enhance", "select", "decide", "choose", "determine", "assess", "judge", "consider", "think", "reason", "logic", "inference", "deduction", "analysis", "interpretation"
+- Length: > 20 words or > 800 context words
+- Content: "enhance", "context", "select", or "decide" in question
+# Simple tasks (NVIDIA Small - Llama)
+- Keywords: "what", "how", "when", "where", "who", "yes", "no", "count", "list", "find", "search", "lookup"
 - Length: ≤ 10 words or ≤ 200 context words
 ```
+### **Flexible Summarization Logic**
+```python
+# Dynamic model selection for summarization
+if len(text) > 1500:
+    use_nvidia_large()  # Better for long context
+else:
+    use_nvidia_small()  # Cost-effective for short text
+```
+## Benefits of Three-Tier System
 ### **Performance Improvements**
+- **Better reasoning** for thinking tasks with Qwen's thinking mode
+- **Enhanced processing** for long context with NVIDIA Large
+- **Faster execution** for simple tasks with NVIDIA Small
+- **Higher quality** for very complex tasks with Gemini Pro
 ### **Cost Optimization**
+- **Reduced Gemini usage** for tasks that don't need advanced reasoning
 - **Better task distribution** across model capabilities
+- **Flexible summarization** using appropriate models based on context length
 - **Maintained efficiency** for simple tasks
 ### **Quality Improvements**
+- **Better reasoning capabilities** with Qwen for decision-making tasks
+- **Improved content processing** with NVIDIA Large for long context
+- **Enhanced memory operations** with better context understanding
+- **More accurate search operations** with specialized models
+- **Dynamic model selection** for optimal performance
 ## Verification Checklist
+- ✅ All easy tasks use NVIDIA Small (Llama-8b-instruct)
+- ✅ All reasoning tasks use NVIDIA Medium (Qwen-3-next-80b-a3b-thinking)
+- ✅ All hard/long context tasks use NVIDIA Large (GPT-OSS-120b)
+- ✅ All very complex tasks use Gemini Pro
+- ✅ Flexible summarization implemented with dynamic model selection
+- ✅ Model selection logic properly categorizes tasks by complexity and reasoning requirements
 - ✅ No linting errors in modified files
 - ✅ All functions have proper fallback mechanisms
 - ✅ Error handling is maintained for all changes
 ## Configuration
+The system is ready to use with the environment variables:
 ```bash
+NVIDIA_SMALL=meta/llama-3.1-8b-instruct
 NVIDIA_MEDIUM=qwen/qwen3-next-80b-a3b-thinking
+NVIDIA_LARGE=openai/gpt-oss-120b
 ```
+All changes maintain backward compatibility and include proper error handling with fallback mechanisms.

routes/search.py CHANGED Viewed

@@ -2,12 +2,12 @@
 import re, asyncio, time, json
 from typing import List, Dict, Any, Tuple
 from helpers.setup import logger, embedder, gemini_rotator, nvidia_rotator
-from utils.api.router import select_model, generate_answer_with_model, qwen_chat_completion
 from utils.service.summarizer import llama_summarize
 async def extract_search_keywords(user_query: str, nvidia_rotator) -> List[str]:
-    """Extract intelligent search keywords from user query using Qwen agent."""
     if not nvidia_rotator:
         # Fallback: simple keyword extraction
         words = re.findall(r'\b\w+\b', user_query.lower())
@@ -28,8 +28,8 @@ Return only the keywords, separated by spaces, no other text."""
         user_prompt = f"User query: {user_query}\n\nExtract search keywords:"
-        # Use Qwen for better keyword extraction
-        response = await qwen_chat_completion(sys_prompt, user_prompt, nvidia_rotator)
         keywords = [kw.strip() for kw in response.split() if kw.strip()]
         return keywords[:5] if keywords else [user_query]
@@ -64,8 +64,8 @@ Return as JSON array of objects."""
         user_prompt = f"User query: {user_query}\n\nGenerate search strategies:"
-        # Use Qwen for better strategy generation
-        response = await qwen_chat_completion(sys_prompt, user_prompt, nvidia_rotator)
         try:
             strategies = json.loads(response)
@@ -290,7 +290,7 @@ async def fetch_and_process_content(url: str, title: str, user_query: str, nvidi
 async def extract_relevant_content(content: str, user_query: str, nvidia_rotator) -> str:
-    """Use Qwen agent to extract only the content relevant to the user query."""
     if not nvidia_rotator:
         # Fallback: return first 2000 chars
         return content[:2000]
@@ -324,8 +324,8 @@ Return only the relevant content, no additional commentary."""
         user_prompt = f"User Query: {user_query}\n\nWeb Content:\n{content}\n\nExtract relevant information:"
-        # Use Qwen for better content extraction
-        response = await qwen_chat_completion(sys_prompt, user_prompt, nvidia_rotator)
         return response.strip() if response.strip() else ""
@@ -335,7 +335,7 @@ Return only the relevant content, no additional commentary."""
 async def assess_content_quality(content: str, nvidia_rotator) -> Dict[str, Any]:
-    """Assess content quality using Qwen agent."""
     if not nvidia_rotator or not content:
         return {"quality_score": 0.5, "issues": [], "strengths": []}
@@ -349,8 +349,8 @@ Consider: accuracy, completeness, clarity, authority, recency, bias, factual cla
         user_prompt = f"Assess this content quality:\n\n{content[:2000]}"
-        # Use Qwen for better quality assessment
-        response = await qwen_chat_completion(sys_prompt, user_prompt, nvidia_rotator)
         try:
             # Try to parse JSON response
@@ -413,8 +413,8 @@ Focus on factual claims, statistics, and verifiable information."""
         user_prompt = f"Main content:\n{content[:1000]}\n\nOther sources:\n{comparison_text[:2000]}\n\nAnalyze consistency:"
-        # Use Qwen for better cross-validation
-        response = await qwen_chat_completion(sys_prompt, user_prompt, nvidia_rotator)
         try:
             validation = json.loads(response)
@@ -480,8 +480,8 @@ Be clear and direct."""
         user_prompt = f"Summarize this content:\n\n{content}"
-        # Use Qwen for better summarization
-        response = await qwen_chat_completion(sys_prompt, user_prompt, nvidia_rotator)
         return response.strip() if response.strip() else content[:200] + "..."

 import re, asyncio, time, json
 from typing import List, Dict, Any, Tuple
 from helpers.setup import logger, embedder, gemini_rotator, nvidia_rotator
+from utils.api.router import select_model, generate_answer_with_model, qwen_chat_completion, nvidia_large_chat_completion
 from utils.service.summarizer import llama_summarize
 async def extract_search_keywords(user_query: str, nvidia_rotator) -> List[str]:
+    """Extract intelligent search keywords from user query using NVIDIA Large agent."""
     if not nvidia_rotator:
         # Fallback: simple keyword extraction
         words = re.findall(r'\b\w+\b', user_query.lower())
         user_prompt = f"User query: {user_query}\n\nExtract search keywords:"
+        # Use NVIDIA Large for better keyword extraction
+        response = await nvidia_large_chat_completion(sys_prompt, user_prompt, nvidia_rotator)
         keywords = [kw.strip() for kw in response.split() if kw.strip()]
         return keywords[:5] if keywords else [user_query]
         user_prompt = f"User query: {user_query}\n\nGenerate search strategies:"
+        # Use NVIDIA Large for better strategy generation
+        response = await nvidia_large_chat_completion(sys_prompt, user_prompt, nvidia_rotator)
         try:
             strategies = json.loads(response)
 async def extract_relevant_content(content: str, user_query: str, nvidia_rotator) -> str:
+    """Use NVIDIA Large agent to extract only the content relevant to the user query."""
     if not nvidia_rotator:
         # Fallback: return first 2000 chars
         return content[:2000]
         user_prompt = f"User Query: {user_query}\n\nWeb Content:\n{content}\n\nExtract relevant information:"
+        # Use NVIDIA Large for better content extraction
+        response = await nvidia_large_chat_completion(sys_prompt, user_prompt, nvidia_rotator)
         return response.strip() if response.strip() else ""
 async def assess_content_quality(content: str, nvidia_rotator) -> Dict[str, Any]:
+    """Assess content quality using NVIDIA Large agent."""
     if not nvidia_rotator or not content:
         return {"quality_score": 0.5, "issues": [], "strengths": []}
         user_prompt = f"Assess this content quality:\n\n{content[:2000]}"
+        # Use NVIDIA Large for better quality assessment
+        response = await nvidia_large_chat_completion(sys_prompt, user_prompt, nvidia_rotator)
         try:
             # Try to parse JSON response
         user_prompt = f"Main content:\n{content[:1000]}\n\nOther sources:\n{comparison_text[:2000]}\n\nAnalyze consistency:"
+        # Use NVIDIA Large for better cross-validation
+        response = await nvidia_large_chat_completion(sys_prompt, user_prompt, nvidia_rotator)
         try:
             validation = json.loads(response)
         user_prompt = f"Summarize this content:\n\n{content}"
+        # Use NVIDIA Large for better summarization
+        response = await nvidia_large_chat_completion(sys_prompt, user_prompt, nvidia_rotator)
         return response.strip() if response.strip() else content[:200] + "..."

utils/api/router.py CHANGED Viewed

@@ -11,44 +11,61 @@ GEMINI_SMALL = os.getenv("GEMINI_SMALL", "gemini-2.5-flash-lite")
 GEMINI_MED   = os.getenv("GEMINI_MED",   "gemini-2.5-flash")
 GEMINI_PRO   = os.getenv("GEMINI_PRO",   "gemini-2.5-pro")
-# NVIDIA small default (can be override)
 NVIDIA_SMALL = os.getenv("NVIDIA_SMALL", "meta/llama-3.1-8b-instruct")         # Llama model for easy complexity tasks
-NVIDIA_MEDIUM = os.getenv("NVIDIA_MEDIUM", "qwen/qwen3-next-80b-a3b-thinking") # Qwen model for medium complexity tasks
 def select_model(question: str, context: str) -> Dict[str, Any]:
     """
-    Enhanced complexity heuristic with proper model hierarchy:
     - Easy tasks (immediate execution, simple) -> Llama (NVIDIA small)
-    - Medium tasks (accurate, reasoning, not too time-consuming) -> Qwen
-    - Hard tasks (complex analysis, synthesis, long-form) -> Gemini Pro
     """
     qlen = len(question.split())
     clen = len(context.split())
-    # Hard task keywords - require complex reasoning and analysis
-    hard_keywords = ("prove", "derivation", "complexity", "algorithm", "optimize", "theorem", "rigorous", "step-by-step", "policy critique", "ambiguity", "counterfactual", "comprehensive", "detailed analysis", "synthesis", "evaluation")
-    # Medium task keywords - require reasoning but not too complex
-    medium_keywords = ("analyze", "explain", "compare", "evaluate", "summarize", "extract", "classify", "identify", "describe", "discuss", "reasoning", "context", "enhance", "select", "consolidate")
     # Simple task keywords - immediate execution
-    simple_keywords = ("what", "how", "when", "where", "who", "yes", "no", "count", "list", "find")
     # Determine complexity level
     is_very_hard = (
-        any(k in question.lower() for k in hard_keywords) or
-        qlen > 100 or
-        clen > 3000 or
         "comprehensive" in question.lower() or
-        "detailed" in question.lower()
     )
-    is_medium = (
-        any(k in question.lower() for k in medium_keywords) or
-        (qlen > 10 and qlen <= 100) or
-        (clen > 200 and clen <= 3000) or
-        "reasoning" in question.lower() or
-        "context" in question.lower()
     )
     is_simple = (
@@ -60,8 +77,11 @@ def select_model(question: str, context: str) -> Dict[str, Any]:
     if is_very_hard:
         # Use Gemini Pro for very complex tasks requiring advanced reasoning
         return {"provider": "gemini", "model": GEMINI_PRO}
-    elif is_medium:
-        # Use Qwen for medium complexity tasks requiring reasoning but not too time-consuming
         return {"provider": "qwen", "model": NVIDIA_MEDIUM}
     else:
         # Use NVIDIA small (Llama) for simple tasks requiring immediate execution
@@ -125,8 +145,11 @@ async def generate_answer_with_model(selection: Dict[str, Any], system_prompt: s
             return "I couldn't parse the model response."
     elif provider == "qwen":
-        # Use Qwen for medium complexity tasks
         return await qwen_chat_completion(system_prompt, user_prompt, nvidia_rotator)
     return "Unsupported provider."
@@ -206,4 +229,82 @@ async def qwen_chat_completion(system_prompt: str, user_prompt: str, nvidia_rota
     except Exception as e:
         logger.warning(f"Qwen API error: {e}")
-        return "I couldn't process the request with Qwen model."

 GEMINI_MED   = os.getenv("GEMINI_MED",   "gemini-2.5-flash")
 GEMINI_PRO   = os.getenv("GEMINI_PRO",   "gemini-2.5-pro")
+# NVIDIA model hierarchy (can be overridden via env)
 NVIDIA_SMALL = os.getenv("NVIDIA_SMALL", "meta/llama-3.1-8b-instruct")         # Llama model for easy complexity tasks
+NVIDIA_MEDIUM = os.getenv("NVIDIA_MEDIUM", "qwen/qwen3-next-80b-a3b-thinking") # Qwen model for reasoning tasks
+NVIDIA_LARGE = os.getenv("NVIDIA_LARGE", "openai/gpt-oss-120b")                # GPT-OSS model for hard/long context tasks
 def select_model(question: str, context: str) -> Dict[str, Any]:
     """
+    Enhanced three-tier model selection system:
     - Easy tasks (immediate execution, simple) -> Llama (NVIDIA small)
+    - Reasoning tasks (analysis, decision-making, JSON parsing) -> Qwen (NVIDIA medium)
+    - Hard/long context tasks (complex synthesis, long-form) -> GPT-OSS (NVIDIA large)
+    - Very complex tasks (research, comprehensive analysis) -> Gemini Pro
     """
     qlen = len(question.split())
     clen = len(context.split())
+    # Very hard task keywords - require Gemini Pro (research, comprehensive analysis)
+    very_hard_keywords = ("prove", "derivation", "complexity", "algorithm", "optimize", "theorem", "rigorous", "step-by-step", "policy critique", "ambiguity", "counterfactual", "comprehensive", "detailed analysis", "synthesis", "evaluation", "research", "investigation", "comprehensive study")
+    # Hard/long context keywords - require NVIDIA Large (GPT-OSS)
+    hard_keywords = ("analyze", "explain", "compare", "evaluate", "summarize", "extract", "classify", "identify", "describe", "discuss", "synthesis", "consolidate", "process", "generate", "create", "develop", "build", "construct")
+    # Reasoning task keywords - require Qwen (thinking/reasoning)
+    reasoning_keywords = ("reasoning", "context", "enhance", "select", "decide", "choose", "determine", "assess", "judge", "consider", "think", "reason", "logic", "inference", "deduction", "analysis", "interpretation")
     # Simple task keywords - immediate execution
+    simple_keywords = ("what", "how", "when", "where", "who", "yes", "no", "count", "list", "find", "search", "lookup")
     # Determine complexity level
     is_very_hard = (
+        any(k in question.lower() for k in very_hard_keywords) or
+        qlen > 120 or
+        clen > 4000 or
         "comprehensive" in question.lower() or
+        "detailed" in question.lower() or
+        "research" in question.lower()
+    )
+    is_hard = (
+        any(k in question.lower() for k in hard_keywords) or
+        qlen > 50 or
+        clen > 1500 or
+        "synthesis" in question.lower() or
+        "generate" in question.lower() or
+        "create" in question.lower()
     )
+    is_reasoning = (
+        any(k in question.lower() for k in reasoning_keywords) or
+        qlen > 20 or
+        clen > 800 or
+        "enhance" in question.lower() or
+        "context" in question.lower() or
+        "select" in question.lower() or
+        "decide" in question.lower()
     )
     is_simple = (
     if is_very_hard:
         # Use Gemini Pro for very complex tasks requiring advanced reasoning
         return {"provider": "gemini", "model": GEMINI_PRO}
+    elif is_hard:
+        # Use NVIDIA Large (GPT-OSS) for hard/long context tasks
+        return {"provider": "nvidia_large", "model": NVIDIA_LARGE}
+    elif is_reasoning:
+        # Use Qwen for reasoning tasks requiring thinking
         return {"provider": "qwen", "model": NVIDIA_MEDIUM}
     else:
         # Use NVIDIA small (Llama) for simple tasks requiring immediate execution
             return "I couldn't parse the model response."
     elif provider == "qwen":
+        # Use Qwen for reasoning tasks
         return await qwen_chat_completion(system_prompt, user_prompt, nvidia_rotator)
+    elif provider == "nvidia_large":
+        # Use NVIDIA Large (GPT-OSS) for hard/long context tasks
+        return await nvidia_large_chat_completion(system_prompt, user_prompt, nvidia_rotator)
     return "Unsupported provider."
     except Exception as e:
         logger.warning(f"Qwen API error: {e}")
+        return "I couldn't process the request with Qwen model."
+async def nvidia_large_chat_completion(system_prompt: str, user_prompt: str, nvidia_rotator: APIKeyRotator) -> str:
+    """
+    NVIDIA Large (GPT-OSS) chat completion for hard/long context tasks.
+    Uses the NVIDIA API rotator for key management.
+    """
+    key = nvidia_rotator.get_key() or ""
+    url = "https://integrate.api.nvidia.com/v1/chat/completions"
+    payload = {
+        "model": NVIDIA_LARGE,
+        "messages": [
+            {"role": "system", "content": system_prompt},
+            {"role": "user", "content": user_prompt}
+        ],
+        "temperature": 1.0,
+        "top_p": 1.0,
+        "max_tokens": 4096,
+        "stream": True
+    }
+    headers = {"Content-Type": "application/json", "Authorization": f"Bearer {key}"}
+    logger.info(f"[NVIDIA_LARGE] API call - Model: {NVIDIA_LARGE}, Key present: {bool(key)}")
+    logger.info(f"[NVIDIA_LARGE] System prompt length: {len(system_prompt)}, User prompt length: {len(user_prompt)}")
+    try:
+        # For streaming, we need to handle the response differently
+        import httpx
+        async with httpx.AsyncClient(timeout=60) as client:
+            response = await client.post(url, headers=headers, json=payload)
+            if response.status_code in (401, 403, 429) or (500 <= response.status_code < 600):
+                logger.warning(f"HTTP {response.status_code} from NVIDIA Large provider. Rotating key and retrying")
+                nvidia_rotator.rotate()
+                # Retry once with new key
+                key = nvidia_rotator.get_key() or ""
+                headers = {"Content-Type": "application/json", "Authorization": f"Bearer {key}"}
+                response = await client.post(url, headers=headers, json=payload)
+            response.raise_for_status()
+            # Handle streaming response
+            content = ""
+            async for line in response.aiter_lines():
+                if line.startswith("data: "):
+                    data = line[6:]  # Remove "data: " prefix
+                    if data.strip() == "[DONE]":
+                        break
+                    try:
+                        import json
+                        chunk_data = json.loads(data)
+                        if "choices" in chunk_data and len(chunk_data["choices"]) > 0:
+                            delta = chunk_data["choices"][0].get("delta", {})
+                            # Handle reasoning content (thinking)
+                            reasoning = delta.get("reasoning_content")
+                            if reasoning:
+                                logger.debug(f"[NVIDIA_LARGE] Reasoning: {reasoning}")
+                            # Handle regular content
+                            chunk_content = delta.get("content")
+                            if chunk_content:
+                                content += chunk_content
+                    except json.JSONDecodeError:
+                        continue
+            if not content or content.strip() == "":
+                logger.warning(f"Empty content from NVIDIA Large model")
+                return "I received an empty response from the model."
+            return content.strip()
+    except Exception as e:
+        logger.warning(f"NVIDIA Large API error: {e}")
+        return "I couldn't process the request with NVIDIA Large model."

utils/service/summarizer.py CHANGED Viewed

@@ -3,7 +3,7 @@ import asyncio
 from typing import List
 from utils.logger import get_logger
 from utils.api.rotator import robust_post_json, APIKeyRotator
-from utils.api.router import qwen_chat_completion
 logger = get_logger("SUM", __name__)
@@ -25,9 +25,21 @@ async def llama_chat(messages, temperature: float = 0.2) -> str:
 async def llama_summarize(text: str, max_sentences: int = 3) -> str:
   text = (text or "").strip()
   if not text:
     return ""
   system = (
     "You are a precise summarizer. Produce a clear, faithful summary of the user's text. "
     f"Return ~{max_sentences} sentences, no comments, no preface, no markdown."
@@ -39,7 +51,20 @@ async def llama_summarize(text: str, max_sentences: int = 3) -> str:
       {"role": "user", "content": user},
     ])
   except Exception as e:
-    logger.warning(f"LLAMA summarization failed: {e}; using fallback")
     return naive_fallback(text, max_sentences)
@@ -49,11 +74,12 @@ def naive_fallback(text: str, max_sentences: int = 3) -> str:
 async def summarize_text(text: str, max_sentences: int = 6, chunk_size: int = 2500) -> str:
-  """Hierarchical summarization for long texts using NVIDIA Llama."""
   if not text:
     return ""
   if len(text) <= chunk_size:
     return await llama_summarize(text, max_sentences=max_sentences)
   # Split into chunks on paragraph boundaries if possible
   paragraphs = text.split('\n\n')
   chunks: List[str] = []
@@ -68,10 +94,13 @@ async def summarize_text(text: str, max_sentences: int = 6, chunk_size: int = 25
   if buf:
     chunks.append('\n\n'.join(buf))
   partials = []
   for ch in chunks:
     partials.append(await llama_summarize(ch, max_sentences=3))
     await asyncio.sleep(0)
   combined = '\n'.join(partials)
   return await llama_summarize(combined, max_sentences=max_sentences)
@@ -115,4 +144,5 @@ async def qwen_summarize(text: str, max_sentences: int = 3) -> str:
 # Backward-compatible name used by app.py
 async def cheap_summarize(text: str, max_sentences: int = 3) -> str:
   return await llama_summarize(text, max_sentences)

 from typing import List
 from utils.logger import get_logger
 from utils.api.rotator import robust_post_json, APIKeyRotator
+from utils.api.router import qwen_chat_completion, nvidia_large_chat_completion
 logger = get_logger("SUM", __name__)
 async def llama_summarize(text: str, max_sentences: int = 3) -> str:
+  """Flexible summarization using NVIDIA Small (Llama) for short text, NVIDIA Large for long context."""
   text = (text or "").strip()
   if not text:
     return ""
+  # Use NVIDIA Large for long context (>1500 chars), NVIDIA Small for short context
+  if len(text) > 1500:
+    logger.info(f"[SUMMARIZER] Using NVIDIA Large for long context ({len(text)} chars)")
+    return await nvidia_large_summarize(text, max_sentences)
+  else:
+    logger.info(f"[SUMMARIZER] Using NVIDIA Small for short context ({len(text)} chars)")
+    return await nvidia_small_summarize(text, max_sentences)
+async def nvidia_small_summarize(text: str, max_sentences: int = 3) -> str:
+  """Summarization using NVIDIA Small (Llama) for short text."""
   system = (
     "You are a precise summarizer. Produce a clear, faithful summary of the user's text. "
     f"Return ~{max_sentences} sentences, no comments, no preface, no markdown."
       {"role": "user", "content": user},
     ])
   except Exception as e:
+    logger.warning(f"NVIDIA Small summarization failed: {e}; using fallback")
+    return naive_fallback(text, max_sentences)
+async def nvidia_large_summarize(text: str, max_sentences: int = 3) -> str:
+  """Summarization using NVIDIA Large (GPT-OSS) for long context."""
+  system = (
+    "You are a precise summarizer. Produce a clear, faithful summary of the user's text. "
+    f"Return ~{max_sentences} sentences, no comments, no preface, no markdown."
+  )
+  user = f"Summarize this text:\n\n{text}"
+  try:
+    return await nvidia_large_chat_completion(system, user, ROTATOR)
+  except Exception as e:
+    logger.warning(f"NVIDIA Large summarization failed: {e}; using fallback")
     return naive_fallback(text, max_sentences)
 async def summarize_text(text: str, max_sentences: int = 6, chunk_size: int = 2500) -> str:
+  """Hierarchical summarization for long texts using flexible model selection."""
   if not text:
     return ""
   if len(text) <= chunk_size:
     return await llama_summarize(text, max_sentences=max_sentences)
   # Split into chunks on paragraph boundaries if possible
   paragraphs = text.split('\n\n')
   chunks: List[str] = []
   if buf:
     chunks.append('\n\n'.join(buf))
+  # Process chunks with flexible model selection
   partials = []
   for ch in chunks:
     partials.append(await llama_summarize(ch, max_sentences=3))
     await asyncio.sleep(0)
+  # Combine and summarize with flexible model selection
   combined = '\n'.join(partials)
   return await llama_summarize(combined, max_sentences=max_sentences)
 # Backward-compatible name used by app.py
 async def cheap_summarize(text: str, max_sentences: int = 3) -> str:
+  """Backward-compatible summarization with flexible model selection."""
   return await llama_summarize(text, max_sentences)