Medical Finetuning Datasets
Viewer • Updated • 134M • 11 • 1Note LEVEL 1 HUGE 130M
FremyCompany/AGCT-Dataset
Viewer • Updated • 421k • 23 • 17Note This dataset contains 422,070 short, computer-generated definitions for SnomedCT concepts, covering various domains such as diseases, procedures, drugs, and anatomy. To do so, we prompted the OpenAI Turbo model, a variant of GPT 3.5, using a high-quality verbalization of the SnomedCT relationships of the to-be-defined concept. Good for CP LEVEL 1
uiyunkim-hub/pubmed-abstract
Viewer • Updated • 27.7M • 62 • 6Note Level 1 base vocab pretrain LEVEL 1
dmariko/clinical-trials-xml-2018-2024
Viewer • Updated • 628k • 15 • 4Note good long texts for CP clinical trials LEVEL 1
darknight054/pubmed_clean
Viewer • Updated • 4.36M • 42 • 1Note Best one for continued pretraining
Jonas7/pubmed_full
Viewer • Updated • 6.62M • 91Note Anchor , positive, negative - might be useful in RL downstream tasks
rntc/mm-icd-notes
Viewer • Updated • 122k • 11 • 6Note Clean 122k notes but lot is missing from original 330k. This is good for SFT and later stages of ICD code prediction
yanjx21/PubMed
Viewer • Updated • 19.5M • 11 • 1Note 81% under 2000 char and 99% under 4000 char. good abstract list. 19m ? don't know how they collected.
TomTBT/pmc_open_access_section
Viewer • Updated • 7.22M • 605 • 3Note full pubmed texts with commercial, non commercial and others - too big to handle
EleutherAI/pile
Updated • 8.96k • 459Note The big PILE 800 gbs of book data. Suitable for LLM training, but not medical domain
starmpcc/Asclepius-Synthetic-Clinical-Notes
Viewer • Updated • 158k • 396 • 98Note Good for SFT, and also CP but needs to manually check data quality. I have doubts
ravistech/clinical-trial-llm-Open_condition_Cleaned_dup_NCT_ID
Viewer • Updated • 277k • 49 • 1Note Good clinical trial quality data with inclusion and exclusion criteria
laion/medrXiv-pdf
Viewer • Updated • 57.6k • 231 • 5Note need to write script to pdf to text but 80gb of text is too much. Can be converted to a knowledge graph for insights maybe. A RAG on this will be exceptional
-
FreedomIntelligence/medical-o1-reasoning-SFT
Viewer • Updated • 90.1k • 7.79k • 924 -
FreedomIntelligence/DxBench
Viewer • Updated • 2.79k • 47 • 8 -
FreedomIntelligence/Disease_Database
Viewer • Updated • 19.2k • 281 • 28 -
FreedomIntelligence/CoD-PatientSymDisease
Viewer • Updated • 76.4k • 193 • 14 -
FreedomIntelligence/XMedbench
Viewer • Updated • 21.3k • 112 • 12 -
stellalisy/MediQ_AskDocs_preference
Viewer • Updated • 193k • 25 • 2 -
empirischtech/med-qa-orpo-dpo
Viewer • Updated • 357k • 35 • 5 -
hw-hwei/MedThoughts-8K
Viewer • Updated • 7.72k • 31 • 3 -
FremyCompany/BioLORD-Dataset
Viewer • Updated • 25M • 30 • 18 -
FreedomIntelligence/ApolloMoEDataset
Viewer • Updated • 293k • 162 • 5 -
rntc/open-clinical-cases-pubmed
Viewer • Updated • 913k • 35 • 2 -
blue-blues/medical_cot
Viewer • Updated • 403k • 28 • 6 -
xz97/MedInstruct
Viewer • Updated • 216 • 108 • 20 -
dmis-lab/meerkat-instructions
Viewer • Updated • 440k • 160 • 7 -
dvssr/umls
Viewer • Updated • 150M • 93 • 4 -
adlbh/umls-concepts
Viewer • Updated • 475k • 12 • 4 -
Jaafer/cleaned_umls_corpus
Viewer • Updated • 1.46M • 6 • 1 -
Jaafer/biomedical_question_one_disease_word
Viewer • Updated • 1.19k • 6 • 1 -
Jaafer/ICD_ontology_dataset
Viewer • Updated • 15.1k • 8 • 1 -
Jaafer/disease_ontology_dataset
Viewer • Updated • 11.5k • 8 • 1 -
epfl-llm/guidelines
Viewer • Updated • 38k • 954 • 138 -
MedRAG/textbooks
Viewer • Updated • 126k • 385 • 51 -
RecurvAI/Recurv-Clinical-Dataset
Viewer • Updated • 12.6k • 28 • 5 -
RJT1990/GeneralThoughtArchive
Viewer • Updated • 431k • 14.5k • 58 -
FreedomIntelligence/Medical-R1-Distill-Data
Viewer • Updated • 22k • 258 • 63 -
UCSC-VLAA/m23k-tokenized
Viewer • Updated • 23.5k • 159 • 5 -
UCSC-VLAA/MedReason
Viewer • Updated • 32.7k • 389 • 75 -
openlifescienceai/medmcqa
Viewer • Updated • 193k • 13.4k • 181 -
bigbio/med_qa
Updated • 3.72k • 115 -
leowei31/MIMIC_IV_lab_test_individual
Viewer • Updated • 633k • 5 • 1