Method Card β Superhero Classification Prompting (0/1/5-shot)
Summary
Added zero-shot, adaptive one-shot (similarity-based example selection), and variable-K few-shot pipelines (evaluated at K=5). Selection is reproducible via TF-IDF + cosine similarity. Evaluation utilities compute accuracy / precision / recall / F1 and produce a per-example results DataFrame. Ultimately, these pipelines were used to identiry the relevant comic universe from a short text describing a specific superhero.
Data
- Dataset:
rlogh/superhero-texts(Hugging Face) - Task: Multiclass (DC, Marvel, Image, Dark Horse)
- Splits: sklearn train/test 0.80/0.20
Models / APIs
- LLM used: LLM-engine created via llama-ccp-python and the following Hugging Face Repo: repo_id="bartowski/Qwen_Qwen3-4B-Instruct-2507-GGUF" filename="Qwen_Qwen3-4B-Instruct-2507-Q4_K_M.gguf"
- Similarity backend: sklearn TF-IDF + cosine similarity
Prompting Strategy
- Zero-shot: Direct classification prompt containing no examples.
- Adaptive one-shot: selects the most similar single example from the training pool using TF-IDF + cosine similarity, inserts it as an example.
- Few-shot (K=5): selects the top-K most similar examples by cosine similarity over TF-IDF vectors
Evaluation Protocol
- Metrics: accuracy, precision, recall, F1; confusion matrix
- Latency: avg wall-clock per example
- Seed: 42
- Reproducibility: prompts/selection/eval code in this repo
Results (Val/Test)
- Test:
- Zero-shot: acc 0.0, f1 0.0, ~216.700s/ex
- One-shot: acc 0.84, f1 0.89605, ~129.000s/ex
- 5-shot: acc 0.85, f1 0.90739, ~156.470s/ex
Tradeoffs
- Zero-shot: No context tokens from examples, lowest expected quality. Since this is a multi-class dataset (not binary), the zero-shot is unsuitable for this model.
- One-shot (adaptive): small added context cost but potentially large quality boost if the nearest example is highly relevant. Less latency than one-shot as examples help minimize potential classes (one extra example tokens).
- Few-shot (K=5): higher context token cost and latency; substantially improvse quality for this task as the model benefits from examples to define classes. Cost scales roughly linearly with K for token-based APIs; latency also increases with prompt size and with model call throughput limits.
- Selection cost: computing TF-IDF on the training set is cheap and local while retaining reproducability (consistent example selection).
Limits & Risks
- Label leakage: if any example in examples reveals the correct label distribution or target leakage, evaluation can be optimistic.
- Bias: selection favors examples textually similar to query; if training data is biased or has label imbalance, selection amplifies biases for similar inputs.
- Semantic mismatch: TF-IDF selection may mis-rank examples for queries with novel wording.
Reproducibility
- Code:
prompts/,selection.py,evaluate_prompting.py - Seed: 42
- Python β₯ 3.10
Usage Disclosure
- Pipelines are intended for research and educational comparison.
- Outputs should not be deployed in production without further validation for robustness, fairness, and safety.