Method Card β€” Superhero Classification Prompting (0/1/5-shot)

Summary

Added zero-shot, adaptive one-shot (similarity-based example selection), and variable-K few-shot pipelines (evaluated at K=5). Selection is reproducible via TF-IDF + cosine similarity. Evaluation utilities compute accuracy / precision / recall / F1 and produce a per-example results DataFrame. Ultimately, these pipelines were used to identiry the relevant comic universe from a short text describing a specific superhero.

Data

  • Dataset: rlogh/superhero-texts (Hugging Face)
  • Task: Multiclass (DC, Marvel, Image, Dark Horse)
  • Splits: sklearn train/test 0.80/0.20

Models / APIs

  • LLM used: LLM-engine created via llama-ccp-python and the following Hugging Face Repo: repo_id="bartowski/Qwen_Qwen3-4B-Instruct-2507-GGUF" filename="Qwen_Qwen3-4B-Instruct-2507-Q4_K_M.gguf"
  • Similarity backend: sklearn TF-IDF + cosine similarity

Prompting Strategy

  • Zero-shot: Direct classification prompt containing no examples.
  • Adaptive one-shot: selects the most similar single example from the training pool using TF-IDF + cosine similarity, inserts it as an example.
  • Few-shot (K=5): selects the top-K most similar examples by cosine similarity over TF-IDF vectors

Evaluation Protocol

  • Metrics: accuracy, precision, recall, F1; confusion matrix
  • Latency: avg wall-clock per example
  • Seed: 42
  • Reproducibility: prompts/selection/eval code in this repo

Results (Val/Test)

  • Test:
    • Zero-shot: acc 0.0, f1 0.0, ~216.700s/ex
    • One-shot: acc 0.84, f1 0.89605, ~129.000s/ex
    • 5-shot: acc 0.85, f1 0.90739, ~156.470s/ex

Tradeoffs

  • Zero-shot: No context tokens from examples, lowest expected quality. Since this is a multi-class dataset (not binary), the zero-shot is unsuitable for this model.
  • One-shot (adaptive): small added context cost but potentially large quality boost if the nearest example is highly relevant. Less latency than one-shot as examples help minimize potential classes (one extra example tokens).
  • Few-shot (K=5): higher context token cost and latency; substantially improvse quality for this task as the model benefits from examples to define classes. Cost scales roughly linearly with K for token-based APIs; latency also increases with prompt size and with model call throughput limits.
  • Selection cost: computing TF-IDF on the training set is cheap and local while retaining reproducability (consistent example selection).

Limits & Risks

  • Label leakage: if any example in examples reveals the correct label distribution or target leakage, evaluation can be optimistic.
  • Bias: selection favors examples textually similar to query; if training data is biased or has label imbalance, selection amplifies biases for similar inputs.
  • Semantic mismatch: TF-IDF selection may mis-rank examples for queries with novel wording.

Reproducibility

  • Code: prompts/, selection.py, evaluate_prompting.py
  • Seed: 42
  • Python β‰₯ 3.10

Usage Disclosure

  • Pipelines are intended for research and educational comparison.
  • Outputs should not be deployed in production without further validation for robustness, fairness, and safety.
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support