LiamKhoaLe commited on
Commit
8db88dd
·
1 Parent(s): 4386026

Upd Large model for longcontext

Browse files
Files changed (4) hide show
  1. AGENT_ASNM.md +110 -85
  2. routes/search.py +16 -16
  3. utils/api/router.py +125 -24
  4. utils/service/summarizer.py +33 -3
AGENT_ASNM.md CHANGED
@@ -1,143 +1,168 @@
1
- # Task Assignment Review - Corrected Model Hierarchy
2
 
3
  ## Overview
4
- This document summarizes the corrected task assignments to ensure proper model hierarchy:
5
- - **Easy tasks** (immediate execution, simple) → **Llama** (NVIDIA small)
6
- - **Medium tasks** (accurate, reasoning, not too time-consuming) → **Qwen**
7
- - **Hard tasks** (complex analysis, synthesis, long-form) → **Gemini Pro**
 
8
 
9
- ## Corrected Task Assignments
10
 
11
- ### ✅ **Easy Tasks - Llama (NVIDIA Small)**
12
  **Purpose**: Immediate execution, simple operations
13
  **Current Assignments**:
14
  - `llama_chat()` - Basic chat completion
15
- - `llama_summarize()` - Simple text summarization
16
  - `summarize_qa()` - Basic Q&A summarization
17
  - `naive_fallback()` - Simple text processing fallback
18
 
19
- ### ✅ **Medium Tasks - Qwen**
20
- **Purpose**: Accurate reasoning, not too time-consuming
21
- **Corrected Assignments**:
22
-
23
- #### **Search Operations** (`routes/search.py`)
24
- - `extract_search_keywords()` - Keyword extraction with reasoning
25
- - `generate_search_strategies()` - Search strategy generation
26
- - `extract_relevant_content()` - Content relevance filtering
27
- - `assess_content_quality()` - Quality assessment with reasoning
28
- - `cross_validate_information()` - Fact-checking and validation
29
- - `generate_content_summary()` - Content summarization
30
 
31
  #### **Memory Operations** (`memo/`)
32
- - `files_relevance()` - File relevance classification
33
  - `related_recent_context()` - Context selection with reasoning
34
- - `_ai_intent_detection()` - User intent detection (CORRECTED)
35
- - `_ai_select_qa_memories()` - Memory selection with reasoning (CORRECTED)
36
- - `_should_enhance_with_context()` - Context enhancement decision (CORRECTED)
37
- - `_enhance_question_with_context()` - Question enhancement (CORRECTED)
38
- - `_enhance_instructions_with_context()` - Instruction enhancement (CORRECTED)
39
- - `consolidate_similar_memories()` - Memory consolidation (CORRECTED)
40
 
41
  #### **Content Processing** (`utils/service/summarizer.py`)
42
  - `clean_chunk_text()` - Content cleaning with reasoning
43
- - `qwen_summarize()` - Medium complexity summarization
44
 
45
  #### **Chat Operations** (`routes/chats.py`)
46
- - `generate_query_variations()` - Query variation generation (CORRECTED)
47
 
48
- ### ✅ **Hard Tasks - Gemini Pro**
49
- **Purpose**: Complex analysis, synthesis, long-form content
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
50
  **Current Assignments**:
51
  - `generate_cot_plan()` - Chain of Thought report planning
52
  - `analyze_subtask_comprehensive()` - Comprehensive analysis
53
  - `synthesize_section_analysis()` - Complex synthesis
54
  - `generate_final_report()` - Long-form report generation
55
- - All complex report generation tasks
56
 
57
- ## Key Corrections Made
58
 
59
- ### 1. **Intent Detection** (`memo/plan/intent.py`)
60
- - **Before**: Used Llama for simple classification
61
- - **After**: Uses Qwen for better reasoning about user intent
62
- - **Reason**: Requires understanding context and nuance
63
 
64
- ### 2. **Memory Selection** (`memo/plan/execution.py`)
65
- - **Before**: Used Llama for memory selection
66
- - **After**: Uses Qwen for better reasoning about relevance
67
- - **Reason**: Requires understanding context relationships
68
 
69
- ### 3. **Context Enhancement** (`memo/retrieval.py`)
70
- - **Before**: Used Llama for enhancement decisions
71
- - **After**: Uses Qwen for better reasoning about context value
72
- - **Reason**: Requires understanding question-context relationships
73
 
74
- ### 4. **Question Enhancement** (`memo/retrieval.py`)
75
- - **Before**: Used Llama for question enhancement
76
- - **After**: Uses Qwen for better reasoning about enhancement
77
- - **Reason**: Requires understanding conversation flow and context
78
 
79
- ### 5. **Memory Consolidation** (`memo/consolidation.py`)
80
- - **Before**: Used Llama for memory consolidation
81
- - **After**: Uses Qwen for better reasoning about similarity
82
- - **Reason**: Requires understanding content relationships
83
-
84
- ### 6. **Query Variation Generation** (`routes/chats.py`)
85
- - **Before**: Used Llama for query variations
86
- - **After**: Uses Qwen for better reasoning about variations
87
- - **Reason**: Requires understanding question intent and context
88
 
89
  ## Enhanced Model Selection Logic
90
 
91
- ### **Complexity Heuristics**
92
  ```python
93
- # Hard tasks (Gemini Pro)
94
- - Keywords: "prove", "derivation", "complexity", "algorithm", "optimize", "theorem", "rigorous", "step-by-step", "policy critique", "ambiguity", "counterfactual", "comprehensive", "detailed analysis", "synthesis", "evaluation"
95
- - Length: > 100 words or > 3000 context words
96
- - Content: "comprehensive" or "detailed" in question
97
-
98
- # Medium tasks (Qwen)
99
- - Keywords: "analyze", "explain", "compare", "evaluate", "summarize", "extract", "classify", "identify", "describe", "discuss", "reasoning", "context", "enhance", "select", "consolidate"
100
- - Length: 10-100 words or 200-3000 context words
101
- - Content: "reasoning" or "context" in question
102
-
103
- # Simple tasks (Llama)
104
- - Keywords: "what", "how", "when", "where", "who", "yes", "no", "count", "list", "find"
 
 
 
 
 
105
  - Length: ≤ 10 words or ≤ 200 context words
106
  ```
107
 
108
- ## Benefits of Corrected Assignments
 
 
 
 
 
 
 
 
 
109
 
110
  ### **Performance Improvements**
111
- - **Better reasoning** for medium complexity tasks with Qwen
112
- - **Faster execution** for simple tasks with Llama
113
- - **Higher quality** for complex tasks with Gemini Pro
 
114
 
115
  ### **Cost Optimization**
116
- - **Reduced Gemini usage** for tasks that don't need its full capabilities
117
  - **Better task distribution** across model capabilities
 
118
  - **Maintained efficiency** for simple tasks
119
 
120
  ### **Quality Improvements**
121
- - **Better intent detection** with Qwen's reasoning
122
- - **Improved memory operations** with better context understanding
123
- - **Enhanced search operations** with better relevance filtering
124
- - **More accurate content processing** with reasoning capabilities
 
125
 
126
  ## Verification Checklist
127
 
128
- - ✅ All easy tasks use Llama (NVIDIA small)
129
- - ✅ All medium tasks use Qwen
130
- - ✅ All hard tasks use Gemini Pro
131
- - ✅ Model selection logic properly categorizes tasks
 
 
132
  - ✅ No linting errors in modified files
133
  - ✅ All functions have proper fallback mechanisms
134
  - ✅ Error handling is maintained for all changes
135
 
136
  ## Configuration
137
 
138
- The system is ready to use with the environment variable:
139
  ```bash
 
140
  NVIDIA_MEDIUM=qwen/qwen3-next-80b-a3b-thinking
 
141
  ```
142
 
143
- All changes maintain backward compatibility and include proper error handling.
 
1
+ # Task Assignment Review - Three-Tier Model System
2
 
3
  ## Overview
4
+ This document summarizes the three-tier model selection system that optimizes API usage based on task complexity and reasoning requirements:
5
+ - **Easy tasks** (immediate execution, simple) → **NVIDIA Small** (Llama-8b-instruct)
6
+ - **Reasoning tasks** (thinking, decision-making, context selection) → **NVIDIA Medium** (Qwen-3-next-80b-a3b-thinking)
7
+ - **Hard/long context tasks** (content processing, analysis, generation) → **NVIDIA Large** (GPT-OSS-120b)
8
+ - **Very complex tasks** (research, comprehensive analysis) → **Gemini Pro**
9
 
10
+ ## Three-Tier Task Assignments
11
 
12
+ ### ✅ **Easy Tasks - NVIDIA Small (Llama-8b-instruct)**
13
  **Purpose**: Immediate execution, simple operations
14
  **Current Assignments**:
15
  - `llama_chat()` - Basic chat completion
16
+ - `nvidia_small_summarize()` - Simple text summarization (≤1500 chars)
17
  - `summarize_qa()` - Basic Q&A summarization
18
  - `naive_fallback()` - Simple text processing fallback
19
 
20
+ ### ✅ **Reasoning Tasks - NVIDIA Medium (Qwen-3-next-80b-a3b-thinking)**
21
+ **Purpose**: Thinking, decision-making, context selection
22
+ **Current Assignments**:
 
 
 
 
 
 
 
 
23
 
24
  #### **Memory Operations** (`memo/`)
25
+ - `files_relevance()` - File relevance classification with reasoning
26
  - `related_recent_context()` - Context selection with reasoning
27
+ - `_ai_intent_detection()` - User intent detection with reasoning
28
+ - `_ai_select_qa_memories()` - Memory selection with reasoning
29
+ - `_should_enhance_with_context()` - Context enhancement decision
30
+ - `_enhance_question_with_context()` - Question enhancement with reasoning
31
+ - `_enhance_instructions_with_context()` - Instruction enhancement with reasoning
32
+ - `consolidate_similar_memories()` - Memory consolidation with reasoning
33
 
34
  #### **Content Processing** (`utils/service/summarizer.py`)
35
  - `clean_chunk_text()` - Content cleaning with reasoning
36
+ - `qwen_summarize()` - Reasoning-based summarization
37
 
38
  #### **Chat Operations** (`routes/chats.py`)
39
+ - `generate_query_variations()` - Query variation generation with reasoning
40
 
41
+ ### ✅ **Hard/Long Context Tasks - NVIDIA Large (GPT-OSS-120b)**
42
+ **Purpose**: Content processing, analysis, generation, long context
43
+ **Current Assignments**:
44
+
45
+ #### **Search Operations** (`routes/search.py`)
46
+ - `extract_search_keywords()` - Keyword extraction for long queries
47
+ - `generate_search_strategies()` - Search strategy generation
48
+ - `extract_relevant_content()` - Content relevance filtering for long content
49
+ - `assess_content_quality()` - Quality assessment for complex content
50
+ - `cross_validate_information()` - Fact-checking and validation
51
+ - `generate_content_summary()` - Content summarization for long content
52
+
53
+ #### **Content Processing** (`utils/service/summarizer.py`)
54
+ - `nvidia_large_summarize()` - Long context summarization (>1500 chars)
55
+ - `llama_summarize()` - Flexible summarization (auto-selects model based on length)
56
+
57
+ ### ✅ **Very Complex Tasks - Gemini Pro**
58
+ **Purpose**: Research, comprehensive analysis, advanced reasoning
59
  **Current Assignments**:
60
  - `generate_cot_plan()` - Chain of Thought report planning
61
  - `analyze_subtask_comprehensive()` - Comprehensive analysis
62
  - `synthesize_section_analysis()` - Complex synthesis
63
  - `generate_final_report()` - Long-form report generation
64
+ - All complex report generation tasks requiring advanced reasoning
65
 
66
+ ## Key Improvements Made
67
 
68
+ ### 1. **Three-Tier Model Selection**
69
+ - **Before**: Two-tier system (Llama + Gemini)
70
+ - **After**: Four-tier system (NVIDIA Small + NVIDIA Medium + NVIDIA Large + Gemini Pro)
71
+ - **Reason**: Better optimization of model capabilities for different task types
72
 
73
+ ### 2. **Reasoning vs. Processing Separation**
74
+ - **Before**: Mixed reasoning and processing tasks
75
+ - **After**: Clear separation - Qwen for reasoning, NVIDIA Large for processing
76
+ - **Reason**: Qwen excels at thinking, NVIDIA Large excels at content processing
77
 
78
+ ### 3. **Flexible Summarization** (`utils/service/summarizer.py`)
79
+ - **Before**: Fixed model selection for summarization
80
+ - **After**: Dynamic model selection based on context length (>1500 chars → NVIDIA Large)
81
+ - **Reason**: Better handling of long context with appropriate model
82
 
83
+ ### 4. **Search Operations Optimization** (`routes/search.py`)
84
+ - **Before**: Used Qwen for all search operations
85
+ - **After**: Uses NVIDIA Large for content processing tasks
86
+ - **Reason**: Better handling of long content and complex analysis
87
 
88
+ ### 5. **Memory Operations Enhancement** (`memo/`)
89
+ - **Before**: Mixed model usage for memory operations
90
+ - **After**: Consistent use of Qwen for reasoning-based memory tasks
91
+ - **Reason**: Better reasoning capabilities for context selection and enhancement
 
 
 
 
 
92
 
93
  ## Enhanced Model Selection Logic
94
 
95
+ ### **Four-Tier Complexity Heuristics**
96
  ```python
97
+ # Very complex tasks (Gemini Pro)
98
+ - Keywords: "prove", "derivation", "complexity", "algorithm", "optimize", "theorem", "rigorous", "step-by-step", "policy critique", "ambiguity", "counterfactual", "comprehensive", "detailed analysis", "synthesis", "evaluation", "research", "investigation", "comprehensive study"
99
+ - Length: > 120 words or > 4000 context words
100
+ - Content: "comprehensive", "detailed", or "research" in question
101
+
102
+ # Hard/long context tasks (NVIDIA Large)
103
+ - Keywords: "analyze", "explain", "compare", "evaluate", "summarize", "extract", "classify", "identify", "describe", "discuss", "synthesis", "consolidate", "process", "generate", "create", "develop", "build", "construct"
104
+ - Length: > 50 words or > 1500 context words
105
+ - Content: "synthesis", "generate", or "create" in question
106
+
107
+ # Reasoning tasks (NVIDIA Medium - Qwen)
108
+ - Keywords: "reasoning", "context", "enhance", "select", "decide", "choose", "determine", "assess", "judge", "consider", "think", "reason", "logic", "inference", "deduction", "analysis", "interpretation"
109
+ - Length: > 20 words or > 800 context words
110
+ - Content: "enhance", "context", "select", or "decide" in question
111
+
112
+ # Simple tasks (NVIDIA Small - Llama)
113
+ - Keywords: "what", "how", "when", "where", "who", "yes", "no", "count", "list", "find", "search", "lookup"
114
  - Length: ≤ 10 words or ≤ 200 context words
115
  ```
116
 
117
+ ### **Flexible Summarization Logic**
118
+ ```python
119
+ # Dynamic model selection for summarization
120
+ if len(text) > 1500:
121
+ use_nvidia_large() # Better for long context
122
+ else:
123
+ use_nvidia_small() # Cost-effective for short text
124
+ ```
125
+
126
+ ## Benefits of Three-Tier System
127
 
128
  ### **Performance Improvements**
129
+ - **Better reasoning** for thinking tasks with Qwen's thinking mode
130
+ - **Enhanced processing** for long context with NVIDIA Large
131
+ - **Faster execution** for simple tasks with NVIDIA Small
132
+ - **Higher quality** for very complex tasks with Gemini Pro
133
 
134
  ### **Cost Optimization**
135
+ - **Reduced Gemini usage** for tasks that don't need advanced reasoning
136
  - **Better task distribution** across model capabilities
137
+ - **Flexible summarization** using appropriate models based on context length
138
  - **Maintained efficiency** for simple tasks
139
 
140
  ### **Quality Improvements**
141
+ - **Better reasoning capabilities** with Qwen for decision-making tasks
142
+ - **Improved content processing** with NVIDIA Large for long context
143
+ - **Enhanced memory operations** with better context understanding
144
+ - **More accurate search operations** with specialized models
145
+ - **Dynamic model selection** for optimal performance
146
 
147
  ## Verification Checklist
148
 
149
+ - ✅ All easy tasks use NVIDIA Small (Llama-8b-instruct)
150
+ - ✅ All reasoning tasks use NVIDIA Medium (Qwen-3-next-80b-a3b-thinking)
151
+ - ✅ All hard/long context tasks use NVIDIA Large (GPT-OSS-120b)
152
+ - ✅ All very complex tasks use Gemini Pro
153
+ - ✅ Flexible summarization implemented with dynamic model selection
154
+ - ✅ Model selection logic properly categorizes tasks by complexity and reasoning requirements
155
  - ✅ No linting errors in modified files
156
  - ✅ All functions have proper fallback mechanisms
157
  - ✅ Error handling is maintained for all changes
158
 
159
  ## Configuration
160
 
161
+ The system is ready to use with the environment variables:
162
  ```bash
163
+ NVIDIA_SMALL=meta/llama-3.1-8b-instruct
164
  NVIDIA_MEDIUM=qwen/qwen3-next-80b-a3b-thinking
165
+ NVIDIA_LARGE=openai/gpt-oss-120b
166
  ```
167
 
168
+ All changes maintain backward compatibility and include proper error handling with fallback mechanisms.
routes/search.py CHANGED
@@ -2,12 +2,12 @@
2
  import re, asyncio, time, json
3
  from typing import List, Dict, Any, Tuple
4
  from helpers.setup import logger, embedder, gemini_rotator, nvidia_rotator
5
- from utils.api.router import select_model, generate_answer_with_model, qwen_chat_completion
6
  from utils.service.summarizer import llama_summarize
7
 
8
 
9
  async def extract_search_keywords(user_query: str, nvidia_rotator) -> List[str]:
10
- """Extract intelligent search keywords from user query using Qwen agent."""
11
  if not nvidia_rotator:
12
  # Fallback: simple keyword extraction
13
  words = re.findall(r'\b\w+\b', user_query.lower())
@@ -28,8 +28,8 @@ Return only the keywords, separated by spaces, no other text."""
28
 
29
  user_prompt = f"User query: {user_query}\n\nExtract search keywords:"
30
 
31
- # Use Qwen for better keyword extraction
32
- response = await qwen_chat_completion(sys_prompt, user_prompt, nvidia_rotator)
33
 
34
  keywords = [kw.strip() for kw in response.split() if kw.strip()]
35
  return keywords[:5] if keywords else [user_query]
@@ -64,8 +64,8 @@ Return as JSON array of objects."""
64
 
65
  user_prompt = f"User query: {user_query}\n\nGenerate search strategies:"
66
 
67
- # Use Qwen for better strategy generation
68
- response = await qwen_chat_completion(sys_prompt, user_prompt, nvidia_rotator)
69
 
70
  try:
71
  strategies = json.loads(response)
@@ -290,7 +290,7 @@ async def fetch_and_process_content(url: str, title: str, user_query: str, nvidi
290
 
291
 
292
  async def extract_relevant_content(content: str, user_query: str, nvidia_rotator) -> str:
293
- """Use Qwen agent to extract only the content relevant to the user query."""
294
  if not nvidia_rotator:
295
  # Fallback: return first 2000 chars
296
  return content[:2000]
@@ -324,8 +324,8 @@ Return only the relevant content, no additional commentary."""
324
 
325
  user_prompt = f"User Query: {user_query}\n\nWeb Content:\n{content}\n\nExtract relevant information:"
326
 
327
- # Use Qwen for better content extraction
328
- response = await qwen_chat_completion(sys_prompt, user_prompt, nvidia_rotator)
329
 
330
  return response.strip() if response.strip() else ""
331
 
@@ -335,7 +335,7 @@ Return only the relevant content, no additional commentary."""
335
 
336
 
337
  async def assess_content_quality(content: str, nvidia_rotator) -> Dict[str, Any]:
338
- """Assess content quality using Qwen agent."""
339
  if not nvidia_rotator or not content:
340
  return {"quality_score": 0.5, "issues": [], "strengths": []}
341
 
@@ -349,8 +349,8 @@ Consider: accuracy, completeness, clarity, authority, recency, bias, factual cla
349
 
350
  user_prompt = f"Assess this content quality:\n\n{content[:2000]}"
351
 
352
- # Use Qwen for better quality assessment
353
- response = await qwen_chat_completion(sys_prompt, user_prompt, nvidia_rotator)
354
 
355
  try:
356
  # Try to parse JSON response
@@ -413,8 +413,8 @@ Focus on factual claims, statistics, and verifiable information."""
413
 
414
  user_prompt = f"Main content:\n{content[:1000]}\n\nOther sources:\n{comparison_text[:2000]}\n\nAnalyze consistency:"
415
 
416
- # Use Qwen for better cross-validation
417
- response = await qwen_chat_completion(sys_prompt, user_prompt, nvidia_rotator)
418
 
419
  try:
420
  validation = json.loads(response)
@@ -480,8 +480,8 @@ Be clear and direct."""
480
 
481
  user_prompt = f"Summarize this content:\n\n{content}"
482
 
483
- # Use Qwen for better summarization
484
- response = await qwen_chat_completion(sys_prompt, user_prompt, nvidia_rotator)
485
 
486
  return response.strip() if response.strip() else content[:200] + "..."
487
 
 
2
  import re, asyncio, time, json
3
  from typing import List, Dict, Any, Tuple
4
  from helpers.setup import logger, embedder, gemini_rotator, nvidia_rotator
5
+ from utils.api.router import select_model, generate_answer_with_model, qwen_chat_completion, nvidia_large_chat_completion
6
  from utils.service.summarizer import llama_summarize
7
 
8
 
9
  async def extract_search_keywords(user_query: str, nvidia_rotator) -> List[str]:
10
+ """Extract intelligent search keywords from user query using NVIDIA Large agent."""
11
  if not nvidia_rotator:
12
  # Fallback: simple keyword extraction
13
  words = re.findall(r'\b\w+\b', user_query.lower())
 
28
 
29
  user_prompt = f"User query: {user_query}\n\nExtract search keywords:"
30
 
31
+ # Use NVIDIA Large for better keyword extraction
32
+ response = await nvidia_large_chat_completion(sys_prompt, user_prompt, nvidia_rotator)
33
 
34
  keywords = [kw.strip() for kw in response.split() if kw.strip()]
35
  return keywords[:5] if keywords else [user_query]
 
64
 
65
  user_prompt = f"User query: {user_query}\n\nGenerate search strategies:"
66
 
67
+ # Use NVIDIA Large for better strategy generation
68
+ response = await nvidia_large_chat_completion(sys_prompt, user_prompt, nvidia_rotator)
69
 
70
  try:
71
  strategies = json.loads(response)
 
290
 
291
 
292
  async def extract_relevant_content(content: str, user_query: str, nvidia_rotator) -> str:
293
+ """Use NVIDIA Large agent to extract only the content relevant to the user query."""
294
  if not nvidia_rotator:
295
  # Fallback: return first 2000 chars
296
  return content[:2000]
 
324
 
325
  user_prompt = f"User Query: {user_query}\n\nWeb Content:\n{content}\n\nExtract relevant information:"
326
 
327
+ # Use NVIDIA Large for better content extraction
328
+ response = await nvidia_large_chat_completion(sys_prompt, user_prompt, nvidia_rotator)
329
 
330
  return response.strip() if response.strip() else ""
331
 
 
335
 
336
 
337
  async def assess_content_quality(content: str, nvidia_rotator) -> Dict[str, Any]:
338
+ """Assess content quality using NVIDIA Large agent."""
339
  if not nvidia_rotator or not content:
340
  return {"quality_score": 0.5, "issues": [], "strengths": []}
341
 
 
349
 
350
  user_prompt = f"Assess this content quality:\n\n{content[:2000]}"
351
 
352
+ # Use NVIDIA Large for better quality assessment
353
+ response = await nvidia_large_chat_completion(sys_prompt, user_prompt, nvidia_rotator)
354
 
355
  try:
356
  # Try to parse JSON response
 
413
 
414
  user_prompt = f"Main content:\n{content[:1000]}\n\nOther sources:\n{comparison_text[:2000]}\n\nAnalyze consistency:"
415
 
416
+ # Use NVIDIA Large for better cross-validation
417
+ response = await nvidia_large_chat_completion(sys_prompt, user_prompt, nvidia_rotator)
418
 
419
  try:
420
  validation = json.loads(response)
 
480
 
481
  user_prompt = f"Summarize this content:\n\n{content}"
482
 
483
+ # Use NVIDIA Large for better summarization
484
+ response = await nvidia_large_chat_completion(sys_prompt, user_prompt, nvidia_rotator)
485
 
486
  return response.strip() if response.strip() else content[:200] + "..."
487
 
utils/api/router.py CHANGED
@@ -11,44 +11,61 @@ GEMINI_SMALL = os.getenv("GEMINI_SMALL", "gemini-2.5-flash-lite")
11
  GEMINI_MED = os.getenv("GEMINI_MED", "gemini-2.5-flash")
12
  GEMINI_PRO = os.getenv("GEMINI_PRO", "gemini-2.5-pro")
13
 
14
- # NVIDIA small default (can be override)
15
  NVIDIA_SMALL = os.getenv("NVIDIA_SMALL", "meta/llama-3.1-8b-instruct") # Llama model for easy complexity tasks
16
- NVIDIA_MEDIUM = os.getenv("NVIDIA_MEDIUM", "qwen/qwen3-next-80b-a3b-thinking") # Qwen model for medium complexity tasks
 
17
 
18
  def select_model(question: str, context: str) -> Dict[str, Any]:
19
  """
20
- Enhanced complexity heuristic with proper model hierarchy:
21
  - Easy tasks (immediate execution, simple) -> Llama (NVIDIA small)
22
- - Medium tasks (accurate, reasoning, not too time-consuming) -> Qwen
23
- - Hard tasks (complex analysis, synthesis, long-form) -> Gemini Pro
 
24
  """
25
  qlen = len(question.split())
26
  clen = len(context.split())
27
 
28
- # Hard task keywords - require complex reasoning and analysis
29
- hard_keywords = ("prove", "derivation", "complexity", "algorithm", "optimize", "theorem", "rigorous", "step-by-step", "policy critique", "ambiguity", "counterfactual", "comprehensive", "detailed analysis", "synthesis", "evaluation")
30
 
31
- # Medium task keywords - require reasoning but not too complex
32
- medium_keywords = ("analyze", "explain", "compare", "evaluate", "summarize", "extract", "classify", "identify", "describe", "discuss", "reasoning", "context", "enhance", "select", "consolidate")
 
 
 
33
 
34
  # Simple task keywords - immediate execution
35
- simple_keywords = ("what", "how", "when", "where", "who", "yes", "no", "count", "list", "find")
36
 
37
  # Determine complexity level
38
  is_very_hard = (
39
- any(k in question.lower() for k in hard_keywords) or
40
- qlen > 100 or
41
- clen > 3000 or
42
  "comprehensive" in question.lower() or
43
- "detailed" in question.lower()
 
 
 
 
 
 
 
 
 
 
44
  )
45
 
46
- is_medium = (
47
- any(k in question.lower() for k in medium_keywords) or
48
- (qlen > 10 and qlen <= 100) or
49
- (clen > 200 and clen <= 3000) or
50
- "reasoning" in question.lower() or
51
- "context" in question.lower()
 
 
52
  )
53
 
54
  is_simple = (
@@ -60,8 +77,11 @@ def select_model(question: str, context: str) -> Dict[str, Any]:
60
  if is_very_hard:
61
  # Use Gemini Pro for very complex tasks requiring advanced reasoning
62
  return {"provider": "gemini", "model": GEMINI_PRO}
63
- elif is_medium:
64
- # Use Qwen for medium complexity tasks requiring reasoning but not too time-consuming
 
 
 
65
  return {"provider": "qwen", "model": NVIDIA_MEDIUM}
66
  else:
67
  # Use NVIDIA small (Llama) for simple tasks requiring immediate execution
@@ -125,8 +145,11 @@ async def generate_answer_with_model(selection: Dict[str, Any], system_prompt: s
125
  return "I couldn't parse the model response."
126
 
127
  elif provider == "qwen":
128
- # Use Qwen for medium complexity tasks
129
  return await qwen_chat_completion(system_prompt, user_prompt, nvidia_rotator)
 
 
 
130
 
131
  return "Unsupported provider."
132
 
@@ -206,4 +229,82 @@ async def qwen_chat_completion(system_prompt: str, user_prompt: str, nvidia_rota
206
 
207
  except Exception as e:
208
  logger.warning(f"Qwen API error: {e}")
209
- return "I couldn't process the request with Qwen model."
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
11
  GEMINI_MED = os.getenv("GEMINI_MED", "gemini-2.5-flash")
12
  GEMINI_PRO = os.getenv("GEMINI_PRO", "gemini-2.5-pro")
13
 
14
+ # NVIDIA model hierarchy (can be overridden via env)
15
  NVIDIA_SMALL = os.getenv("NVIDIA_SMALL", "meta/llama-3.1-8b-instruct") # Llama model for easy complexity tasks
16
+ NVIDIA_MEDIUM = os.getenv("NVIDIA_MEDIUM", "qwen/qwen3-next-80b-a3b-thinking") # Qwen model for reasoning tasks
17
+ NVIDIA_LARGE = os.getenv("NVIDIA_LARGE", "openai/gpt-oss-120b") # GPT-OSS model for hard/long context tasks
18
 
19
  def select_model(question: str, context: str) -> Dict[str, Any]:
20
  """
21
+ Enhanced three-tier model selection system:
22
  - Easy tasks (immediate execution, simple) -> Llama (NVIDIA small)
23
+ - Reasoning tasks (analysis, decision-making, JSON parsing) -> Qwen (NVIDIA medium)
24
+ - Hard/long context tasks (complex synthesis, long-form) -> GPT-OSS (NVIDIA large)
25
+ - Very complex tasks (research, comprehensive analysis) -> Gemini Pro
26
  """
27
  qlen = len(question.split())
28
  clen = len(context.split())
29
 
30
+ # Very hard task keywords - require Gemini Pro (research, comprehensive analysis)
31
+ very_hard_keywords = ("prove", "derivation", "complexity", "algorithm", "optimize", "theorem", "rigorous", "step-by-step", "policy critique", "ambiguity", "counterfactual", "comprehensive", "detailed analysis", "synthesis", "evaluation", "research", "investigation", "comprehensive study")
32
 
33
+ # Hard/long context keywords - require NVIDIA Large (GPT-OSS)
34
+ hard_keywords = ("analyze", "explain", "compare", "evaluate", "summarize", "extract", "classify", "identify", "describe", "discuss", "synthesis", "consolidate", "process", "generate", "create", "develop", "build", "construct")
35
+
36
+ # Reasoning task keywords - require Qwen (thinking/reasoning)
37
+ reasoning_keywords = ("reasoning", "context", "enhance", "select", "decide", "choose", "determine", "assess", "judge", "consider", "think", "reason", "logic", "inference", "deduction", "analysis", "interpretation")
38
 
39
  # Simple task keywords - immediate execution
40
+ simple_keywords = ("what", "how", "when", "where", "who", "yes", "no", "count", "list", "find", "search", "lookup")
41
 
42
  # Determine complexity level
43
  is_very_hard = (
44
+ any(k in question.lower() for k in very_hard_keywords) or
45
+ qlen > 120 or
46
+ clen > 4000 or
47
  "comprehensive" in question.lower() or
48
+ "detailed" in question.lower() or
49
+ "research" in question.lower()
50
+ )
51
+
52
+ is_hard = (
53
+ any(k in question.lower() for k in hard_keywords) or
54
+ qlen > 50 or
55
+ clen > 1500 or
56
+ "synthesis" in question.lower() or
57
+ "generate" in question.lower() or
58
+ "create" in question.lower()
59
  )
60
 
61
+ is_reasoning = (
62
+ any(k in question.lower() for k in reasoning_keywords) or
63
+ qlen > 20 or
64
+ clen > 800 or
65
+ "enhance" in question.lower() or
66
+ "context" in question.lower() or
67
+ "select" in question.lower() or
68
+ "decide" in question.lower()
69
  )
70
 
71
  is_simple = (
 
77
  if is_very_hard:
78
  # Use Gemini Pro for very complex tasks requiring advanced reasoning
79
  return {"provider": "gemini", "model": GEMINI_PRO}
80
+ elif is_hard:
81
+ # Use NVIDIA Large (GPT-OSS) for hard/long context tasks
82
+ return {"provider": "nvidia_large", "model": NVIDIA_LARGE}
83
+ elif is_reasoning:
84
+ # Use Qwen for reasoning tasks requiring thinking
85
  return {"provider": "qwen", "model": NVIDIA_MEDIUM}
86
  else:
87
  # Use NVIDIA small (Llama) for simple tasks requiring immediate execution
 
145
  return "I couldn't parse the model response."
146
 
147
  elif provider == "qwen":
148
+ # Use Qwen for reasoning tasks
149
  return await qwen_chat_completion(system_prompt, user_prompt, nvidia_rotator)
150
+ elif provider == "nvidia_large":
151
+ # Use NVIDIA Large (GPT-OSS) for hard/long context tasks
152
+ return await nvidia_large_chat_completion(system_prompt, user_prompt, nvidia_rotator)
153
 
154
  return "Unsupported provider."
155
 
 
229
 
230
  except Exception as e:
231
  logger.warning(f"Qwen API error: {e}")
232
+ return "I couldn't process the request with Qwen model."
233
+
234
+
235
+ async def nvidia_large_chat_completion(system_prompt: str, user_prompt: str, nvidia_rotator: APIKeyRotator) -> str:
236
+ """
237
+ NVIDIA Large (GPT-OSS) chat completion for hard/long context tasks.
238
+ Uses the NVIDIA API rotator for key management.
239
+ """
240
+ key = nvidia_rotator.get_key() or ""
241
+ url = "https://integrate.api.nvidia.com/v1/chat/completions"
242
+
243
+ payload = {
244
+ "model": NVIDIA_LARGE,
245
+ "messages": [
246
+ {"role": "system", "content": system_prompt},
247
+ {"role": "user", "content": user_prompt}
248
+ ],
249
+ "temperature": 1.0,
250
+ "top_p": 1.0,
251
+ "max_tokens": 4096,
252
+ "stream": True
253
+ }
254
+
255
+ headers = {"Content-Type": "application/json", "Authorization": f"Bearer {key}"}
256
+
257
+ logger.info(f"[NVIDIA_LARGE] API call - Model: {NVIDIA_LARGE}, Key present: {bool(key)}")
258
+ logger.info(f"[NVIDIA_LARGE] System prompt length: {len(system_prompt)}, User prompt length: {len(user_prompt)}")
259
+
260
+ try:
261
+ # For streaming, we need to handle the response differently
262
+ import httpx
263
+ async with httpx.AsyncClient(timeout=60) as client:
264
+ response = await client.post(url, headers=headers, json=payload)
265
+
266
+ if response.status_code in (401, 403, 429) or (500 <= response.status_code < 600):
267
+ logger.warning(f"HTTP {response.status_code} from NVIDIA Large provider. Rotating key and retrying")
268
+ nvidia_rotator.rotate()
269
+ # Retry once with new key
270
+ key = nvidia_rotator.get_key() or ""
271
+ headers = {"Content-Type": "application/json", "Authorization": f"Bearer {key}"}
272
+ response = await client.post(url, headers=headers, json=payload)
273
+
274
+ response.raise_for_status()
275
+
276
+ # Handle streaming response
277
+ content = ""
278
+ async for line in response.aiter_lines():
279
+ if line.startswith("data: "):
280
+ data = line[6:] # Remove "data: " prefix
281
+ if data.strip() == "[DONE]":
282
+ break
283
+
284
+ try:
285
+ import json
286
+ chunk_data = json.loads(data)
287
+ if "choices" in chunk_data and len(chunk_data["choices"]) > 0:
288
+ delta = chunk_data["choices"][0].get("delta", {})
289
+
290
+ # Handle reasoning content (thinking)
291
+ reasoning = delta.get("reasoning_content")
292
+ if reasoning:
293
+ logger.debug(f"[NVIDIA_LARGE] Reasoning: {reasoning}")
294
+
295
+ # Handle regular content
296
+ chunk_content = delta.get("content")
297
+ if chunk_content:
298
+ content += chunk_content
299
+ except json.JSONDecodeError:
300
+ continue
301
+
302
+ if not content or content.strip() == "":
303
+ logger.warning(f"Empty content from NVIDIA Large model")
304
+ return "I received an empty response from the model."
305
+
306
+ return content.strip()
307
+
308
+ except Exception as e:
309
+ logger.warning(f"NVIDIA Large API error: {e}")
310
+ return "I couldn't process the request with NVIDIA Large model."
utils/service/summarizer.py CHANGED
@@ -3,7 +3,7 @@ import asyncio
3
  from typing import List
4
  from utils.logger import get_logger
5
  from utils.api.rotator import robust_post_json, APIKeyRotator
6
- from utils.api.router import qwen_chat_completion
7
 
8
  logger = get_logger("SUM", __name__)
9
 
@@ -25,9 +25,21 @@ async def llama_chat(messages, temperature: float = 0.2) -> str:
25
 
26
 
27
  async def llama_summarize(text: str, max_sentences: int = 3) -> str:
 
28
  text = (text or "").strip()
29
  if not text:
30
  return ""
 
 
 
 
 
 
 
 
 
 
 
31
  system = (
32
  "You are a precise summarizer. Produce a clear, faithful summary of the user's text. "
33
  f"Return ~{max_sentences} sentences, no comments, no preface, no markdown."
@@ -39,7 +51,20 @@ async def llama_summarize(text: str, max_sentences: int = 3) -> str:
39
  {"role": "user", "content": user},
40
  ])
41
  except Exception as e:
42
- logger.warning(f"LLAMA summarization failed: {e}; using fallback")
 
 
 
 
 
 
 
 
 
 
 
 
 
43
  return naive_fallback(text, max_sentences)
44
 
45
 
@@ -49,11 +74,12 @@ def naive_fallback(text: str, max_sentences: int = 3) -> str:
49
 
50
 
51
  async def summarize_text(text: str, max_sentences: int = 6, chunk_size: int = 2500) -> str:
52
- """Hierarchical summarization for long texts using NVIDIA Llama."""
53
  if not text:
54
  return ""
55
  if len(text) <= chunk_size:
56
  return await llama_summarize(text, max_sentences=max_sentences)
 
57
  # Split into chunks on paragraph boundaries if possible
58
  paragraphs = text.split('\n\n')
59
  chunks: List[str] = []
@@ -68,10 +94,13 @@ async def summarize_text(text: str, max_sentences: int = 6, chunk_size: int = 25
68
  if buf:
69
  chunks.append('\n\n'.join(buf))
70
 
 
71
  partials = []
72
  for ch in chunks:
73
  partials.append(await llama_summarize(ch, max_sentences=3))
74
  await asyncio.sleep(0)
 
 
75
  combined = '\n'.join(partials)
76
  return await llama_summarize(combined, max_sentences=max_sentences)
77
 
@@ -115,4 +144,5 @@ async def qwen_summarize(text: str, max_sentences: int = 3) -> str:
115
 
116
  # Backward-compatible name used by app.py
117
  async def cheap_summarize(text: str, max_sentences: int = 3) -> str:
 
118
  return await llama_summarize(text, max_sentences)
 
3
  from typing import List
4
  from utils.logger import get_logger
5
  from utils.api.rotator import robust_post_json, APIKeyRotator
6
+ from utils.api.router import qwen_chat_completion, nvidia_large_chat_completion
7
 
8
  logger = get_logger("SUM", __name__)
9
 
 
25
 
26
 
27
  async def llama_summarize(text: str, max_sentences: int = 3) -> str:
28
+ """Flexible summarization using NVIDIA Small (Llama) for short text, NVIDIA Large for long context."""
29
  text = (text or "").strip()
30
  if not text:
31
  return ""
32
+
33
+ # Use NVIDIA Large for long context (>1500 chars), NVIDIA Small for short context
34
+ if len(text) > 1500:
35
+ logger.info(f"[SUMMARIZER] Using NVIDIA Large for long context ({len(text)} chars)")
36
+ return await nvidia_large_summarize(text, max_sentences)
37
+ else:
38
+ logger.info(f"[SUMMARIZER] Using NVIDIA Small for short context ({len(text)} chars)")
39
+ return await nvidia_small_summarize(text, max_sentences)
40
+
41
+ async def nvidia_small_summarize(text: str, max_sentences: int = 3) -> str:
42
+ """Summarization using NVIDIA Small (Llama) for short text."""
43
  system = (
44
  "You are a precise summarizer. Produce a clear, faithful summary of the user's text. "
45
  f"Return ~{max_sentences} sentences, no comments, no preface, no markdown."
 
51
  {"role": "user", "content": user},
52
  ])
53
  except Exception as e:
54
+ logger.warning(f"NVIDIA Small summarization failed: {e}; using fallback")
55
+ return naive_fallback(text, max_sentences)
56
+
57
+ async def nvidia_large_summarize(text: str, max_sentences: int = 3) -> str:
58
+ """Summarization using NVIDIA Large (GPT-OSS) for long context."""
59
+ system = (
60
+ "You are a precise summarizer. Produce a clear, faithful summary of the user's text. "
61
+ f"Return ~{max_sentences} sentences, no comments, no preface, no markdown."
62
+ )
63
+ user = f"Summarize this text:\n\n{text}"
64
+ try:
65
+ return await nvidia_large_chat_completion(system, user, ROTATOR)
66
+ except Exception as e:
67
+ logger.warning(f"NVIDIA Large summarization failed: {e}; using fallback")
68
  return naive_fallback(text, max_sentences)
69
 
70
 
 
74
 
75
 
76
  async def summarize_text(text: str, max_sentences: int = 6, chunk_size: int = 2500) -> str:
77
+ """Hierarchical summarization for long texts using flexible model selection."""
78
  if not text:
79
  return ""
80
  if len(text) <= chunk_size:
81
  return await llama_summarize(text, max_sentences=max_sentences)
82
+
83
  # Split into chunks on paragraph boundaries if possible
84
  paragraphs = text.split('\n\n')
85
  chunks: List[str] = []
 
94
  if buf:
95
  chunks.append('\n\n'.join(buf))
96
 
97
+ # Process chunks with flexible model selection
98
  partials = []
99
  for ch in chunks:
100
  partials.append(await llama_summarize(ch, max_sentences=3))
101
  await asyncio.sleep(0)
102
+
103
+ # Combine and summarize with flexible model selection
104
  combined = '\n'.join(partials)
105
  return await llama_summarize(combined, max_sentences=max_sentences)
106
 
 
144
 
145
  # Backward-compatible name used by app.py
146
  async def cheap_summarize(text: str, max_sentences: int = 3) -> str:
147
+ """Backward-compatible summarization with flexible model selection."""
148
  return await llama_summarize(text, max_sentences)