PokeeAI_pokee_research_7b-EXL3 / README.md

Upload folder using huggingface_hub

473aac5 verified 21 days ago

8.12 kB

	---
	license: apache-2.0
	language:
	- en
	tags:
	- agent
	- deepresearch
	- llm
	- rl
	- reinforcementlearning
	datasets:
	- miromind-ai/MiroRL-GenQA
	base_model:
	- Qwen/Qwen2.5-7B-Instruct
	---

	# Model Card for PokeeResearch

	## Model Details

	### Model Description

	PokeeResearch-7B is a 7-billion-parameter deep research agent developed by Pokee AI to advance reliable, aligned, and scalable research-grade reasoning in tool-augmented LLMs.
	The model integrates Reinforcement Learning from AI Feedback (RLAIF) with a robust reasoning scaffold, enabling it to conduct complex, multi-step research workflows that include self-correction, verification, and synthesis across multiple independent research threads.

	- Developed by: Pokee AI
	- Model type: Tool-augmented large language model (LLM) research agent
	- Language(s): English, Chinese and many more
	- License: Apache 2.0
	- Finetuned from model: Qwen2.5-7B-Instruct

	### Model Sources

	- Repository: [https://github.com/Pokee-AI/PokeeResearchOSS](https://github.com/Pokee-AI/PokeeResearchOSS)
	- Paper: [PokeeResearch: Effective Deep Research via Reinforcement Learning from AI Feedback and Robust Reasoning Scaffold](https://arxiv.org/pdf/2510.15862), Pokee AI, October 2025
	- API Access: [https://pokee.ai/deepresearch-preview](https://pokee.ai/deepresearch-preview)

	---

	## Uses

	### Direct Use
	PokeeResearch-7B is designed for deep research automation, where the model autonomously:
	- Decomposes complex user queries
	- Retrieves and reads from external sources
	- Synthesizes factual, verifiable, and grounded answers

	It can be used as a standalone research assistant or integrated into multi-agent systems to support academic, enterprise, or product-level research tasks.

	### Downstream Use
	PokeeResearch-7B can be fine-tuned or extended for:
	- Domain-specific scientific discovery
	- Autonomous document retrieval and synthesis
	- Multi-source verification and summarization pipelines
	- Integration into reinforcement learning research agents (RLHF/RLAIF frameworks)

	### Out-of-Scope Use
	The model should not be used for:
	- Generating unverified or speculative claims
	- Automated decision-making in high-stakes domains (medical, legal, or financial)
	- Applications requiring strict factual precision without external verification
	- Generating content without citation or evidence tracing

	---

	## Bias, Risks, and Limitations

	PokeeResearch-7B is optimized for factual grounding and robustness, but limitations include:
	- Dependence on external data quality and retrieval accuracy
	- Potential semantic bias introduced by AI-based feedback signals
	- Limited coverage for non-English or multi-modal reasoning tasks
	- Risk of hallucinated synthesis when sources conflict or lack clarity

	### Recommendations
	Users should:
	- Cross-verify answers, especially in multi-hop reasoning cases
	- Monitor output for citation accuracy and alignment with source data
	- Refrain from using outputs as sole evidence in decision-critical contexts

	---

	## How to Get Started with the Model
	please refer to the following codebase for how to use PokeeResearch-7B
	https://github.com/Pokee-AI/PokeeResearchOSS/blob/main/README.md

	---

	## Training Details

	### Training Data
	- Dataset: MiroRL-GenQA dataset (MiroMind AI, 2025)
	- Data characteristics: Complex, multi-turn question–answer pairs requiring multi-step reasoning
	- Data filtering: No benchmark data used for testing; the model was trained only on open-domain text Q&A samples

	### Training Procedure

	#### Preprocessing
	- Normalization and tokenization aligned with Qwen2.5 tokenizer
	- Structured prompt–response pairs in research/verification format (`<tool_call>`, `<answer>`, `<verification>`)

	#### Training Hyperparameters
	- Algorithm: RLOO (REINFORCE Leave-One-Out)
	- Batch size: 64
	- Research threads per prompt: 8
	- Learning rate: 3e-6
	- Context limit: 32,768 tokens
	- Steps: 140 fine-tuning iterations
	- Regularization: None (no entropy or KL regularization)
	- Precision regime: bf16 mixed precision

	#### Reward Design
	- Combined reward signal from:
	- AI feedback (semantic equivalence via external LLM judge)
	- Format adherence reward (ensures correct agent behavior)

	#### Speeds, Sizes, Times
	- Model size: 7 billion parameters
	- Training duration: ~5 days on 8 × A100 80G GPUs
	- Checkpoint size: ~13 GB

	---

	## Evaluation

	### Testing Data, Factors & Metrics

	#### Testing Data
	10 open-domain research and QA benchmarks:
	- NQ, TriviaQA, PopQA, HotpotQA, 2WikiMultiHopQA, Musique, Bamboogle, GAIA, BrowseComp, Humanity’s Last Exam

	#### Factors
	- Benchmarks differ by reasoning depth, retrieval dependence, and factual precision requirements.
	- Evaluations disaggregate by dataset difficulty and task type (single-hop vs multi-hop).

	#### Metrics
	- Mean accuracy (mean@4 across independent research threads) based on

	### Results

	PokeeResearch-7B (RTS variant) and PokeeResearch-7B outperforms all baselines at 7B scale across 10 benchmarks.
	Highlights (mean@4 accuracy):
	\| Method \| HLE \| GAIA \| BrowseComp \| BAMB \| 2WIKI \| TQ \| NQ \| POPQA \| MUSIQUE \| HOTPOTQA \|
	\|-------------\|----------\|-----------\|----------------\|-----------\|-----------\|----------\|----------\|-------------\|---------------\|----------------\|
	\| R1searcher \| 5.4 \| 8.3 \| 1.0 \| 63.2 \| 61.4 \| 77.2 \| 59.6 \| 51.8 \| 35.8 \| 62.4 \|
	\| SearchR1 \| 13.0 \| 18.7 \| 0.4 \| 67.8 \| 62.8 \| 81.0 \| 67.6 \| 59.6 \| 33.2 \| 63.2 \|
	\| ZeroSearch \| 8.6 \| 9.9 \| 1.4 \| 51.4 \| 33.6 \| 61.6 \| 48.2 \| 38.0 \| 19.0 \| 32.4 \|
	\| ASearcher \| 13.8 \| 22.1 \| 3.2 \| 68.8 \| 69.2 \| 85.2 \| 71.2 \| 58.2 \| 35.8 \| 71.0 \|
	\| DeepResearcher \| 6.0 \| 24.03 \| 1.8 \| 71.0 \| 58.8 \| 82.2 \| 60.2 \| 55.2 \| 26.8 \| 56.6 \|
	\| PR \| 15.2 \| 36.9 \| 5.4 \| 74.5 \| 74.0 \| 91.3 \| 75.1 \| 59.8 \| 39.8 \| 71.2 \|
	\| PR+ \| 17.6 \| 41.3 \| 8.4 \| 75.0 \| 75.0 \| 91.8 \| 75.0 \| 60.0 \| 41.4 \| 71.6 \|

	#### Summary
	PokeeResearch-7B variants achieves state-of-the-art performance among 7B-scale open deep research agents, validating RLAIF and reasoning scaffold design for robust, verifiable research workflows.

	---

	## Technical Specifications

	### Model Architecture and Objective
	- Base Architecture: Transformer decoder (Qwen2.5-7B-Instruct backbone)
	- Objective: Reinforcement learning with AI feedback to maximize semantic correctness and alignment with human-style reasoning

	### Compute Infrastructure
	#### Hardware
	- NVIDIA A100 80GB GPUs ×8 for training and x1 for inference
	---

	## Citation

	BibTeX:
	```bibtex
	@article{pokee2025deepresearch,
	title={PokeeResearch: Effective Deep Research via
	Reinforcement Learning from AI Feedback and Robust Reasoning Scaffold},
	author={Yi Wan* and Jiuqi Wang* and Liam Li
	and Jinsong Liu and Ruihao Zhu and Zheqing Zhu},
	journal={Pokee AI Technical Report},
	year={2025},
	url={https://arxiv.org/pdf/2510.15862}
	}
	```

	APA:
	Wan, Y., Wang, J., Li, L., Liu, J., Zhu, R., & Zhu, Z. (2025). PokeeResearch: Effective Deep Research via Reinforcement Learning from AI Feedback and Robust Reasoning Scaffold. Pokee AI.

	---

	## Glossary

	- RLAIF: Reinforcement Learning from AI Feedback – optimization using LLM-based reward signals.
	- RLOO: REINFORCE Leave-One-Out – unbiased policy gradient variant for on-policy learning.
	- RTS: Research Threads Synthesis – synthesis of multiple independent reasoning threads at inference time.

	---

	## More Information
	For technical details, visit: [https://github.com/Pokee-AI/PokeeResearchOSS](https://github.com/Pokee-AI/PokeeResearchOSS)
	For inquiries, contact: [email protected]

	---

	## Model Card Authors
	Yi Wan, Jiuqi Wang, Liam Li, Jinsong Liu, Ruihao Zhu, and Zheqing Zhu — Pokee AI Research Team

	## Model Card Contact
	Pokee AI Team — [email protected]