HarmBench Classifiers Classifiers for red teaming evaluation in HarmBench HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal Paper • 2402.04249 • Published Feb 6, 2024 • 6 cais/HarmBench-Llama-2-13b-cls Text Generation • 13B • Updated Mar 17, 2024 • 22.8k • • 24 cais/HarmBench-Llama-2-13b-cls-multimodal-behaviors Text Generation • 13B • Updated Apr 11, 2024 • 162 • cais/HarmBench-Mistral-7b-val-cls Text Generation • 7B • Updated Mar 17, 2024 • 22.6k • 6
HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal Paper • 2402.04249 • Published Feb 6, 2024 • 6
cais/HarmBench-Llama-2-13b-cls-multimodal-behaviors Text Generation • 13B • Updated Apr 11, 2024 • 162 •
WMDP Benchmark The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning Paper • 2403.03218 • Published Mar 5, 2024 • 1 cais/wmdp Viewer • Updated Apr 27, 2024 • 3.67k • 9.4k • 21 cais/wmdp-bio-forget-corpus Viewer • Updated May 29 • 24.5k • 962 • 1 cais/wmdp-cyber-forget-corpus Viewer • Updated May 29 • 1k • 314 • 3
The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning Paper • 2403.03218 • Published Mar 5, 2024 • 1
HarmBench Classifiers Classifiers for red teaming evaluation in HarmBench HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal Paper • 2402.04249 • Published Feb 6, 2024 • 6 cais/HarmBench-Llama-2-13b-cls Text Generation • 13B • Updated Mar 17, 2024 • 22.8k • • 24 cais/HarmBench-Llama-2-13b-cls-multimodal-behaviors Text Generation • 13B • Updated Apr 11, 2024 • 162 • cais/HarmBench-Mistral-7b-val-cls Text Generation • 7B • Updated Mar 17, 2024 • 22.6k • 6
HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal Paper • 2402.04249 • Published Feb 6, 2024 • 6
cais/HarmBench-Llama-2-13b-cls-multimodal-behaviors Text Generation • 13B • Updated Apr 11, 2024 • 162 •
WMDP Benchmark The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning Paper • 2403.03218 • Published Mar 5, 2024 • 1 cais/wmdp Viewer • Updated Apr 27, 2024 • 3.67k • 9.4k • 21 cais/wmdp-bio-forget-corpus Viewer • Updated May 29 • 24.5k • 962 • 1 cais/wmdp-cyber-forget-corpus Viewer • Updated May 29 • 1k • 314 • 3
The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning Paper • 2403.03218 • Published Mar 5, 2024 • 1
cais/HarmBench-Llama-2-13b-cls-multimodal-behaviors Text Generation • 13B • Updated Apr 11, 2024 • 162 •