Aparecium v2 – Pooled MPNet Reverser (S1 Baseline)
Summary
- Task: Reconstruct natural-language crypto social-media posts from a single pooled MPNet embedding (reverse embedding).
- Focus: Crypto domain (social-media posts / short-form content).
- Checkpoint:
aparecium_v2_s1.pt— S1 supervised baseline, trained on synthetic crypto social-media posts. - Input contract: a pooled
all-mpnet-base-v2vector of shape(768,), not a token-level(seq_len, 768)matrix. - Code: this repo only hosts weights; loading & decoding are implemented in the Aparecium codebase
(the v2 training repo and service are analogous in spirit to the v1 project
SentiChain/aparecium-seq2seq-reverser).
This is a pooled-embedding variant of Aparecium, distinct from the original token-level seq2seq reverser described inSentiChain/aparecium-seq2seq-reverser.
Intended use
- Research / engineering:
- Study how much crypto-domain information is recoverable from a single pooled embedding.
- Prototype tools around embedding interpretability, diagnostics, and “gist reconstruction” from vectors.
- Not intended for:
- Reconstructing private, user-identifying, or sensitive content.
- Any de‑anonymization of embedding corpora.
Reconstruction quality depends heavily on:
- The upstream encoder (
sentence-transformers/all-mpnet-base-v2), - Domain match (crypto social-media posts vs. your data),
- Decode settings (beam vs. sampling, constraints, reranking).
Model architecture
On the encoder side, we assume a pooled MPNet encoder:
- Recommended:
sentence-transformers/all-mpnet-base-v2(768‑D pooled output).
On the decoder side, v2 uses the Aparecium components:
- EmbAdapter:
- Input: pooled vector
e ∈ R^768. - Output: pseudo‑sequence memory
H ∈ R^{B × S × D}suitable for a transformer decoder (multi‑scale).
- Input: pooled vector
- Sketcher:
- Lightweight network producing a “plan” and simple control flags (e.g., URL presence) from
e. - In the S1 baseline checkpoint, it is trained but only lightly used at inference.
- Lightweight network producing a “plan” and simple control flags (e.g., URL presence) from
- RealizerDecoder:
- Transformer decoder (GPT‑style) with:
d_model = 768n_layer = 12n_head = 8d_ff = 3072- Dropout ≈ 0.1
- Consumes
Has cross‑attention memory and generates text tokens.
- Transformer decoder (GPT‑style) with:
Decoding:
- Deterministic beam search or sampling, with optional:
- Constraints (e.g., require certain tickers/hashtags/amounts based on a plan).
- Surrogate similarity scorer
r(x, e)for reranking candidates. - Final MPNet cosine rerank across top‑K candidates.
The aparecium_v2_s1.pt checkpoint contains the adapter, sketcher, decoder, and tokenizer name, matching the training repo layout.
Training data and provenance
- Source: synthetic crypto social-media posts generated via OpenAI models into a DB (e.g.,
tweets.db). - Domain:
- Crypto markets, DeFi, L2s, MEV, governance, NFTs, etc.
- Preparation (v2 pipeline):
- Extract raw text from the DB into JSONL.
- Embed each tweet with
sentence-transformers/all-mpnet-base-v2:embedding ∈ R^768(pooled), L2‑normalized.- Optionally store a simple “plan” (tickers, hashtags, amounts, addresses).
- Split into train/val/test and shard into JSONL files.
No real social‑media content is used; all posts are synthetic, similar in spirit to the v1 projectSentiChain/aparecium-seq2seq-reverser.
Training procedure (S1 baseline regimen)
This checkpoint corresponds to S1 supervised training only (no SCST/RL):
- Objective: teacher‑forcing cross‑entropy over the crypto tweet text, given the pooled embedding.
- Optimizer: AdamW
- Typical hyperparameters (baseline run):
- Batch size: 64
- Max length: 96 tokens (tweets)
- Learning rate: 3e‑4 (cosine decay), warmup ~1k steps
- Weight decay: 0.01
- Grad clip: 1.0
- Dropout: 0.1
- Data:
- ~100k synthetic crypto tweets (train/val split).
- Embeddings precomputed via
all-mpnet-base-v2and normalized.
- Checkpointing:
- Save final weights as
aparecium_v2_s1.ptonce training plateaus on validation cross‑entropy.
- Save final weights as
Future work (not in this checkpoint):
- SCST RL (S2) with a reward combining MPNet cosine, surrogate
r, repetition penalty, and entity coverage. - Stronger constraints and rerank policies as described in the training plan.
Evaluation protocol (baseline qualitative)
This repo does not include a full eval harness. The S1 baseline was validated qualitatively:
- Sample 10–20 crypto sentences (held‑out).
- For each:
- Embed text with
all-mpnet-base-v2(pooled, normalized). - Invert with Aparecium v2 S1 (beam search + rerank).
- Re‑embed the generated text with MPNet and compute cosine with the original embedding.
- Embed text with
For a v1‑style, large‑scale evaluation (crypto/equities split, cosine statistics, degeneracy rate, domain drift), refer to the v1 model card:SentiChain/aparecium-seq2seq-reverser.
Input contract and usage
Input (v2, S1 baseline):
- A single pooled MPNet embedding (crypto tweet) of shape
(768,), L2‑normalized. - Recommended encoder:
sentence-transformers/all-mpnet-base-v2fromsentence-transformers.
Do not pass a token‑level (seq_len, 768) matrix – that is the contract for the v1 seq2seq modelSentiChain/aparecium-seq2seq-reverser, not this checkpoint.
Usage pattern (high level, pseudocode):
import torch, json
from sentence_transformers import SentenceTransformer
# 1) Pooled MPNet embedding
mpnet = SentenceTransformer("sentence-transformers/all-mpnet-base-v2",
device="cuda" if torch.cuda.is_available() else "cpu")
text = "Ethereum L2 blob fees spiked after EIP-4844; MEV still shapes order flow."
e = mpnet.encode([text], convert_to_numpy=True, normalize_embeddings=True)[0] # (768,)
# 2) Load Aparecium v2 S1 checkpoint
ckpt = torch.load("aparecium_v2_s1.pt", map_location="cpu")
# 3) Recreate models from the Aparecium codebase (not included in this HF repo)
# from aparecium.aparecium.models.emb_adapter import EmbAdapter
# from aparecium.aparecium.models.decoder import RealizerDecoder
# from aparecium.aparecium.models.sketcher import Sketcher
# from aparecium.aparecium.utils.tokens import build_tokenizer
# and run the same decoding logic as in `aparecium/infer/service.py` or
# `aparecium/scripts/invert_once.py`.
# 4) Use beam search / constraints / reranking as in the training repo.
To actually use the model, you need the Aparecium codebase (training repo) where the EmbAdapter, Sketcher, RealizerDecoder, constraints, and decoding functions are defined.
Limitations and responsible use
- Outputs are approximations of the original text under the MPNet embedding and LM prior:
- They aim to preserve semantic gist and domain entities,
- They are not exact reconstructions.
- The model can:
- Produce generic phrasing,
- Over‑use crypto buzzwords/hashtags,
- Occasionally show noisy punctuation/emoji.
- Data are synthetic; domain semantics might differ from real social‑media distributions.
- Do not use this model to attempt to reconstruct sensitive or private user content from embeddings.
Reproducibility (high‑level)
To reproduce or extend this checkpoint:
- Prepare data:
- Generate synthetic crypto tweets (or your own domain) into a DB (e.g., SQLite).
- Extract raw text to
train/val/testJSONL. - Embed with
all-mpnet-base-v2(pooled 768‑D) and save as JSONL with{"text","embedding","plan"}fields.
- Train S1:
- Use the Aparecium v2 trainer (S1 supervised) with:
batch_size ≈ 64,max_len ≈ 96,lr ≈ 3e-4, cosine scheduler, warmup steps.
- Train until validation cross‑entropy and cosine proxy metrics plateau.
- Use the Aparecium v2 trainer (S1 supervised) with:
- Optional:
- Train surrogate similarity scorer
rfor reranking. - Add SCST RL (S2) if you implement the safe reward/decoding policies.
- Train surrogate similarity scorer
- Evaluate:
- Build a small evaluation harness (as in the v1 project) to measure cosine, degeneracy, and domain drift.
License
- Code: MIT (per Aparecium repositories).
- Weights: MIT, same as the code, unless explicitly overridden.
Citation
If you use this model or the Aparecium codebase, please cite:
Aparecium v2: Pooled MPNet Embedding Reversal for Crypto Tweets
SentiChain (Aparecium project)
You may also reference the v1 baseline model card:SentiChain/aparecium-seq2seq-reverser.
Model tree for SentiChain/aparecium-v2-pooled-reverser
Base model
sentence-transformers/all-mpnet-base-v2