MIRIAD

community
Activity Feed

AI & ML interests

None defined yet.

Recent Activity

Queeey  updated a Space 8 days ago
miriad/README
Queeey  updated a Space 18 days ago
miriad/README
View all activity

Centered Image

MIRIAD is a curated million scale Medical Instruction and RetrIeval Dataset. It contains 5.8 | 4.4 million medical question-answer pairs, distilled from peer-reviewed biomedical literature using LLMs. MIRIAD provides structured, high-quality QA pairs, enabling diverse downstream tasks like RAG, medical retrieval, hallucination detection, and instruction tuning. Any follow-up works will also be hosted here. We hope you find it helpful. Have fun building!

The dataset was introduced in our arXiv preprint.

Licensing

In this paper, we use the Semantic Scholar Open Research Corpus (S2ORC) as the source of documents to generate our dataset. These documents are made available under the Open Data Commons Attribution License (ODC-By) v1.0 (https://opendatacommons.org/licenses/by/1-0/), which permits reuse and modification of the dataset, including for commercial use, provided that proper attribution is given. To construct our dataset, we used S2ORC documents as input to OpenAI's language models. The resulting model-generated outputs are owned by us, as per OpenAI's Terms & Policies (https://openai.com/policies/), which also address medical use of outputs. Since our outputs are generated using both S2ORC documents and OpenAI's models, MIRIAD is released under the ODC-By v1.0 license, and subject to the restrictions in OpenAI's Terms & Policies to the extent that they may be applicable.

Intended use

At this stage, the outputs of this study and the provided assets are supplied exclusively for academic research and educational exploration. They have not been reviewed or cleared by any regulatory body, and accordingly must not be used for clinical decision-making or considered a certified medical device.

📖 Cite

@misc{zheng2025miriadaugmentingllmsmillions,
      title={MIRIAD: Augmenting LLMs with millions of medical query-response pairs}, 
      author={Qinyue Zheng and Salman Abdullah and Sam Rawal and Cyril Zakka and Sophie Ostmeier and Maximilian Purk and Eduardo Reis and Eric J. Topol and Jure Leskovec and Michael Moor},
      year={2025},
      eprint={2506.06091},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2506.06091}, 
}

For dataset feedbacks

Please send feedbacks regarding correcting the factual errors or issues within this dataset via email to us with key word MIRIAD Edit in the subject. We will collect them and update the maintenance in batch. Thank you!

models 0

None public yet