rogue-security/prompt-injections-benchmark
Viewer • Updated • 5k • 923 • 30
This is a simple classifier meant to filter out common attack vectors for LLMs.
The main usecase for this in AI agents. This model is best used as a gate between a outside input (via email, text, etc) and the inner model (Opus, Codex, etc) that actually will run the prompts. This is not a catchall for all of the attacks, but it akin to making sure the doors are locked to your house.
Base model
distilbert/distilbert-base-cased