Method Card — Superhero Classification Prompting (0/1/5-shot)

Summary

Added zero-shot, adaptive one-shot (similarity-based example selection), and variable-K few-shot pipelines (evaluated at K=5). Selection is reproducible via TF-IDF + cosine similarity. Evaluation utilities compute accuracy / precision / recall / F1 and produce a per-example results DataFrame. Ultimately, these pipelines were used to identiry the relevant comic universe from a short text describing a specific superhero.

Data

Dataset: rlogh/superhero-texts (Hugging Face)
Task: Multiclass (DC, Marvel, Image, Dark Horse)
Splits: sklearn train/test 0.80/0.20

Models / APIs

LLM used: LLM-engine created via llama-ccp-python and the following Hugging Face Repo: repo_id="bartowski/Qwen_Qwen3-4B-Instruct-2507-GGUF" filename="Qwen_Qwen3-4B-Instruct-2507-Q4_K_M.gguf"
Similarity backend: sklearn TF-IDF + cosine similarity

Prompting Strategy

Zero-shot: Direct classification prompt containing no examples.
Adaptive one-shot: selects the most similar single example from the training pool using TF-IDF + cosine similarity, inserts it as an example.
Few-shot (K=5): selects the top-K most similar examples by cosine similarity over TF-IDF vectors

Evaluation Protocol

Metrics: accuracy, precision, recall, F1; confusion matrix
Latency: avg wall-clock per example
Seed: 42
Reproducibility: prompts/selection/eval code in this repo

Results (Val/Test)

Test:
- Zero-shot: acc 0.0, f1 0.0, ~216.700s/ex
- One-shot: acc 0.84, f1 0.89605, ~129.000s/ex
- 5-shot: acc 0.85, f1 0.90739, ~156.470s/ex

Tradeoffs

Zero-shot: No context tokens from examples, lowest expected quality. Since this is a multi-class dataset (not binary), the zero-shot is unsuitable for this model.
One-shot (adaptive): small added context cost but potentially large quality boost if the nearest example is highly relevant. Less latency than one-shot as examples help minimize potential classes (one extra example tokens).
Few-shot (K=5): higher context token cost and latency; substantially improvse quality for this task as the model benefits from examples to define classes. Cost scales roughly linearly with K for token-based APIs; latency also increases with prompt size and with model call throughput limits.
Selection cost: computing TF-IDF on the training set is cheap and local while retaining reproducability (consistent example selection).

Limits & Risks

Label leakage: if any example in examples reveals the correct label distribution or target leakage, evaluation can be optimistic.
Bias: selection favors examples textually similar to query; if training data is biased or has label imbalance, selection amplifies biases for similar inputs.
Semantic mismatch: TF-IDF selection may mis-rank examples for queries with novel wording.

Reproducibility

Code: prompts/, selection.py, evaluate_prompting.py
Seed: 42
Python ≥ 3.10

Usage Disclosure

Pipelines are intended for research and educational comparison.
Outputs should not be deployed in production without further validation for robustness, fairness, and safety.

Downloads last month: -; Downloads are not tracked for this model. How to track