|
|
--- |
|
|
license: apache-2.0 |
|
|
language: |
|
|
- en |
|
|
tags: |
|
|
- agent |
|
|
- deepresearch |
|
|
- llm |
|
|
- rl |
|
|
- reinforcementlearning |
|
|
datasets: |
|
|
- miromind-ai/MiroRL-GenQA |
|
|
base_model: |
|
|
- Qwen/Qwen2.5-7B-Instruct |
|
|
--- |
|
|
|
|
|
# Model Card for PokeeResearch |
|
|
|
|
|
## Model Details |
|
|
|
|
|
### Model Description |
|
|
|
|
|
**PokeeResearch-7B** is a **7-billion-parameter deep research agent** developed by **Pokee AI** to advance reliable, aligned, and scalable research-grade reasoning in tool-augmented LLMs. |
|
|
The model integrates **Reinforcement Learning from AI Feedback (RLAIF)** with a **robust reasoning scaffold**, enabling it to conduct complex, multi-step research workflows that include self-correction, verification, and synthesis across multiple independent research threads. |
|
|
|
|
|
- **Developed by:** Pokee AI |
|
|
- **Model type:** Tool-augmented large language model (LLM) research agent |
|
|
- **Language(s):** English, Chinese and many more |
|
|
- **License:** Apache 2.0 |
|
|
- **Finetuned from model:** Qwen2.5-7B-Instruct |
|
|
|
|
|
### Model Sources |
|
|
|
|
|
- **Repository:** [https://github.com/Pokee-AI/PokeeResearchOSS](https://github.com/Pokee-AI/PokeeResearchOSS) |
|
|
- **Paper:** [*PokeeResearch: Effective Deep Research via Reinforcement Learning from AI Feedback and Robust Reasoning Scaffold*](https://arxiv.org/pdf/2510.15862), Pokee AI, October 2025 |
|
|
- **API Access:** [https://pokee.ai/deepresearch-preview](https://pokee.ai/deepresearch-preview) |
|
|
|
|
|
--- |
|
|
|
|
|
## Uses |
|
|
|
|
|
### Direct Use |
|
|
PokeeResearch-7B is designed for **deep research automation**, where the model autonomously: |
|
|
- Decomposes complex user queries |
|
|
- Retrieves and reads from external sources |
|
|
- Synthesizes factual, verifiable, and grounded answers |
|
|
|
|
|
It can be used as a **standalone research assistant** or integrated into **multi-agent systems** to support academic, enterprise, or product-level research tasks. |
|
|
|
|
|
### Downstream Use |
|
|
PokeeResearch-7B can be **fine-tuned** or **extended** for: |
|
|
- Domain-specific scientific discovery |
|
|
- Autonomous document retrieval and synthesis |
|
|
- Multi-source verification and summarization pipelines |
|
|
- Integration into reinforcement learning research agents (RLHF/RLAIF frameworks) |
|
|
|
|
|
### Out-of-Scope Use |
|
|
The model should **not** be used for: |
|
|
- Generating unverified or speculative claims |
|
|
- Automated decision-making in high-stakes domains (medical, legal, or financial) |
|
|
- Applications requiring strict factual precision without external verification |
|
|
- Generating content without citation or evidence tracing |
|
|
|
|
|
--- |
|
|
|
|
|
## Bias, Risks, and Limitations |
|
|
|
|
|
PokeeResearch-7B is optimized for factual grounding and robustness, but limitations include: |
|
|
- Dependence on **external data quality** and **retrieval accuracy** |
|
|
- Potential **semantic bias** introduced by AI-based feedback signals |
|
|
- Limited coverage for **non-English** or **multi-modal** reasoning tasks |
|
|
- Risk of **hallucinated synthesis** when sources conflict or lack clarity |
|
|
|
|
|
### Recommendations |
|
|
Users should: |
|
|
- Cross-verify answers, especially in multi-hop reasoning cases |
|
|
- Monitor output for citation accuracy and alignment with source data |
|
|
- Refrain from using outputs as sole evidence in decision-critical contexts |
|
|
|
|
|
--- |
|
|
|
|
|
## How to Get Started with the Model |
|
|
please refer to the following codebase for how to use PokeeResearch-7B |
|
|
https://github.com/Pokee-AI/PokeeResearchOSS/blob/main/README.md |
|
|
|
|
|
--- |
|
|
|
|
|
## Training Details |
|
|
|
|
|
### Training Data |
|
|
- **Dataset:** MiroRL-GenQA dataset (MiroMind AI, 2025) |
|
|
- **Data characteristics:** Complex, multi-turn question–answer pairs requiring multi-step reasoning |
|
|
- **Data filtering:** No benchmark data used for testing; the model was trained only on open-domain text Q&A samples |
|
|
|
|
|
### Training Procedure |
|
|
|
|
|
#### Preprocessing |
|
|
- Normalization and tokenization aligned with Qwen2.5 tokenizer |
|
|
- Structured prompt–response pairs in research/verification format (`<tool_call>`, `<answer>`, `<verification>`) |
|
|
|
|
|
#### Training Hyperparameters |
|
|
- **Algorithm:** RLOO (REINFORCE Leave-One-Out) |
|
|
- **Batch size:** 64 |
|
|
- **Research threads per prompt:** 8 |
|
|
- **Learning rate:** 3e-6 |
|
|
- **Context limit:** 32,768 tokens |
|
|
- **Steps:** 140 fine-tuning iterations |
|
|
- **Regularization:** None (no entropy or KL regularization) |
|
|
- **Precision regime:** bf16 mixed precision |
|
|
|
|
|
#### Reward Design |
|
|
- Combined reward signal from: |
|
|
- **AI feedback** (semantic equivalence via external LLM judge) |
|
|
- **Format adherence reward** (ensures correct agent behavior) |
|
|
|
|
|
#### Speeds, Sizes, Times |
|
|
- **Model size:** 7 billion parameters |
|
|
- **Training duration:** ~5 days on 8 × A100 80G GPUs |
|
|
- **Checkpoint size:** ~13 GB |
|
|
|
|
|
--- |
|
|
|
|
|
## Evaluation |
|
|
|
|
|
### Testing Data, Factors & Metrics |
|
|
|
|
|
#### Testing Data |
|
|
10 open-domain research and QA benchmarks: |
|
|
- NQ, TriviaQA, PopQA, HotpotQA, 2WikiMultiHopQA, Musique, Bamboogle, GAIA, BrowseComp, Humanity’s Last Exam |
|
|
|
|
|
#### Factors |
|
|
- Benchmarks differ by reasoning depth, retrieval dependence, and factual precision requirements. |
|
|
- Evaluations disaggregate by dataset difficulty and task type (single-hop vs multi-hop). |
|
|
|
|
|
#### Metrics |
|
|
- Mean accuracy (mean@4 across independent research threads) based on |
|
|
|
|
|
### Results |
|
|
|
|
|
**PokeeResearch-7B (RTS variant)** and **PokeeResearch-7B** outperforms all baselines at 7B scale across 10 benchmarks. |
|
|
Highlights (mean@4 accuracy): |
|
|
| **Method** | **HLE** | **GAIA** | **BrowseComp** | **BAMB** | **2WIKI** | **TQ** | **NQ** | **POPQA** | **MUSIQUE** | **HOTPOTQA** | |
|
|
|-------------|----------|-----------|----------------|-----------|-----------|----------|----------|-------------|---------------|----------------| |
|
|
| R1searcher | 5.4 | 8.3 | 1.0 | 63.2 | 61.4 | 77.2 | 59.6 | 51.8 | 35.8 | 62.4 | |
|
|
| SearchR1 | 13.0 | 18.7 | 0.4 | 67.8 | 62.8 | 81.0 | 67.6 | 59.6 | 33.2 | 63.2 | |
|
|
| ZeroSearch | 8.6 | 9.9 | 1.4 | 51.4 | 33.6 | 61.6 | 48.2 | 38.0 | 19.0 | 32.4 | |
|
|
| ASearcher | 13.8 | 22.1 | 3.2 | 68.8 | 69.2 | 85.2 | 71.2 | 58.2 | 35.8 | 71.0 | |
|
|
| DeepResearcher | 6.0 | 24.03 | 1.8 | 71.0 | 58.8 | 82.2 | 60.2 | 55.2 | 26.8 | 56.6 | |
|
|
| **PR** | **15.2** | **36.9** | **5.4** | **74.5** | **74.0** | **91.3** | **75.1** | **59.8** | **39.8** | **71.2** | |
|
|
| **PR+** | **17.6** | **41.3** | **8.4** | **75.0** | **75.0** | **91.8** | **75.0** | **60.0** | **41.4** | **71.6** | |
|
|
|
|
|
#### Summary |
|
|
PokeeResearch-7B variants achieves **state-of-the-art performance among 7B-scale open deep research agents**, validating RLAIF and reasoning scaffold design for robust, verifiable research workflows. |
|
|
|
|
|
--- |
|
|
|
|
|
## Technical Specifications |
|
|
|
|
|
### Model Architecture and Objective |
|
|
- **Base Architecture:** Transformer decoder (Qwen2.5-7B-Instruct backbone) |
|
|
- **Objective:** Reinforcement learning with AI feedback to maximize semantic correctness and alignment with human-style reasoning |
|
|
|
|
|
### Compute Infrastructure |
|
|
#### Hardware |
|
|
- NVIDIA A100 80GB GPUs ×8 for training and x1 for inference |
|
|
--- |
|
|
|
|
|
## Citation |
|
|
|
|
|
**BibTeX:** |
|
|
```bibtex |
|
|
@article{pokee2025deepresearch, |
|
|
title={PokeeResearch: Effective Deep Research via |
|
|
Reinforcement Learning from AI Feedback and Robust Reasoning Scaffold}, |
|
|
author={Yi Wan* and Jiuqi Wang* and Liam Li |
|
|
and Jinsong Liu and Ruihao Zhu and Zheqing Zhu}, |
|
|
journal={Pokee AI Technical Report}, |
|
|
year={2025}, |
|
|
url={https://arxiv.org/pdf/2510.15862} |
|
|
} |
|
|
``` |
|
|
|
|
|
**APA:** |
|
|
Wan, Y., Wang, J., Li, L., Liu, J., Zhu, R., & Zhu, Z. (2025). *PokeeResearch: Effective Deep Research via Reinforcement Learning from AI Feedback and Robust Reasoning Scaffold.* Pokee AI. |
|
|
|
|
|
--- |
|
|
|
|
|
## Glossary |
|
|
|
|
|
- **RLAIF:** Reinforcement Learning from AI Feedback – optimization using LLM-based reward signals. |
|
|
- **RLOO:** REINFORCE Leave-One-Out – unbiased policy gradient variant for on-policy learning. |
|
|
- **RTS:** Research Threads Synthesis – synthesis of multiple independent reasoning threads at inference time. |
|
|
|
|
|
--- |
|
|
|
|
|
## More Information |
|
|
For technical details, visit: [https://github.com/Pokee-AI/PokeeResearchOSS](https://github.com/Pokee-AI/PokeeResearchOSS) |
|
|
For inquiries, contact: [email protected] |
|
|
|
|
|
--- |
|
|
|
|
|
## Model Card Authors |
|
|
**Yi Wan**, **Jiuqi Wang**, Liam Li, Jinsong Liu, Ruihao Zhu, and Zheqing Zhu — Pokee AI Research Team |
|
|
|
|
|
## Model Card Contact |
|
|
Pokee AI Team — [email protected] |
|
|
|