Aparecium v2 – Pooled MPNet Reverser (S1 Baseline)

Summary

Task: Reconstruct natural-language crypto social-media posts from a single pooled MPNet embedding (reverse embedding).
Focus: Crypto domain (social-media posts / short-form content).
Checkpoint: aparecium_v2_s1.pt — S1 supervised baseline, trained on synthetic crypto social-media posts.
Input contract: a pooled all-mpnet-base-v2 vector of shape (768,), not a token-level (seq_len, 768) matrix.
Code: this repo only hosts weights; loading & decoding are implemented in the Aparecium codebase (the v2 training repo and service are analogous in spirit to the v1 project
SentiChain/aparecium-seq2seq-reverser).

This is a pooled-embedding variant of Aparecium, distinct from the original token-level seq2seq reverser described in
SentiChain/aparecium-seq2seq-reverser.

Intended use

Research / engineering:
- Study how much crypto-domain information is recoverable from a single pooled embedding.
- Prototype tools around embedding interpretability, diagnostics, and “gist reconstruction” from vectors.
Not intended for:
- Reconstructing private, user-identifying, or sensitive content.
- Any de‑anonymization of embedding corpora.

Reconstruction quality depends heavily on:

The upstream encoder (sentence-transformers/all-mpnet-base-v2),
Domain match (crypto social-media posts vs. your data),
Decode settings (beam vs. sampling, constraints, reranking).

Model architecture

On the encoder side, we assume a pooled MPNet encoder:

Recommended: sentence-transformers/all-mpnet-base-v2 (768‑D pooled output).

On the decoder side, v2 uses the Aparecium components:

EmbAdapter:
- Input: pooled vector e ∈ R^768.
- Output: pseudo‑sequence memory H ∈ R^{B × S × D} suitable for a transformer decoder (multi‑scale).
Sketcher:
- Lightweight network producing a “plan” and simple control flags (e.g., URL presence) from e.
- In the S1 baseline checkpoint, it is trained but only lightly used at inference.
RealizerDecoder:
- Transformer decoder (GPT‑style) with:
  - d_model = 768
  - n_layer = 12
  - n_head = 8
  - d_ff = 3072
  - Dropout ≈ 0.1
- Consumes H as cross‑attention memory and generates text tokens.

Decoding:

Deterministic beam search or sampling, with optional:
- Constraints (e.g., require certain tickers/hashtags/amounts based on a plan).
- Surrogate similarity scorer r(x, e) for reranking candidates.
- Final MPNet cosine rerank across top‑K candidates.

The aparecium_v2_s1.pt checkpoint contains the adapter, sketcher, decoder, and tokenizer name, matching the training repo layout.

Training data and provenance

Source: synthetic crypto social-media posts generated via OpenAI models into a DB (e.g., tweets.db).
Domain:
- Crypto markets, DeFi, L2s, MEV, governance, NFTs, etc.
Preparation (v2 pipeline):
1. Extract raw text from the DB into JSONL.
2. Embed each tweet with sentence-transformers/all-mpnet-base-v2:
  - embedding ∈ R^768 (pooled), L2‑normalized.
  - Optionally store a simple “plan” (tickers, hashtags, amounts, addresses).
3. Split into train/val/test and shard into JSONL files.

No real social‑media content is used; all posts are synthetic, similar in spirit to the v1 project
SentiChain/aparecium-seq2seq-reverser.

Training procedure (S1 baseline regimen)

This checkpoint corresponds to S1 supervised training only (no SCST/RL):

Objective: teacher‑forcing cross‑entropy over the crypto tweet text, given the pooled embedding.
Optimizer: AdamW
Typical hyperparameters (baseline run):
- Batch size: 64
- Max length: 96 tokens (tweets)
- Learning rate: 3e‑4 (cosine decay), warmup ~1k steps
- Weight decay: 0.01
- Grad clip: 1.0
- Dropout: 0.1
Data:
- ~100k synthetic crypto tweets (train/val split).
- Embeddings precomputed via all-mpnet-base-v2 and normalized.
Checkpointing:
- Save final weights as aparecium_v2_s1.pt once training plateaus on validation cross‑entropy.

Future work (not in this checkpoint):

SCST RL (S2) with a reward combining MPNet cosine, surrogate r, repetition penalty, and entity coverage.
Stronger constraints and rerank policies as described in the training plan.

Evaluation protocol (baseline qualitative)

This repo does not include a full eval harness. The S1 baseline was validated qualitatively:

Sample 10–20 crypto sentences (held‑out).
For each:
1. Embed text with all-mpnet-base-v2 (pooled, normalized).
2. Invert with Aparecium v2 S1 (beam search + rerank).
3. Re‑embed the generated text with MPNet and compute cosine with the original embedding.

For a v1‑style, large‑scale evaluation (crypto/equities split, cosine statistics, degeneracy rate, domain drift), refer to the v1 model card:
SentiChain/aparecium-seq2seq-reverser.

Input contract and usage

Input (v2, S1 baseline):

A single pooled MPNet embedding (crypto tweet) of shape (768,), L2‑normalized.
Recommended encoder: sentence-transformers/all-mpnet-base-v2 from sentence-transformers.

Do not pass a token‑level (seq_len, 768) matrix – that is the contract for the v1 seq2seq model
SentiChain/aparecium-seq2seq-reverser, not this checkpoint.

Usage pattern (high level, pseudocode):

import torch, json
from sentence_transformers import SentenceTransformer

# 1) Pooled MPNet embedding
mpnet = SentenceTransformer("sentence-transformers/all-mpnet-base-v2",
                            device="cuda" if torch.cuda.is_available() else "cpu")
text = "Ethereum L2 blob fees spiked after EIP-4844; MEV still shapes order flow."
e = mpnet.encode([text], convert_to_numpy=True, normalize_embeddings=True)[0]  # (768,)

# 2) Load Aparecium v2 S1 checkpoint
ckpt = torch.load("aparecium_v2_s1.pt", map_location="cpu")

# 3) Recreate models from the Aparecium codebase (not included in this HF repo)
# from aparecium.aparecium.models.emb_adapter import EmbAdapter
# from aparecium.aparecium.models.decoder import RealizerDecoder
# from aparecium.aparecium.models.sketcher import Sketcher
# from aparecium.aparecium.utils.tokens import build_tokenizer
# and run the same decoding logic as in `aparecium/infer/service.py` or
# `aparecium/scripts/invert_once.py`.

# 4) Use beam search / constraints / reranking as in the training repo.

To actually use the model, you need the Aparecium codebase (training repo) where the EmbAdapter, Sketcher, RealizerDecoder, constraints, and decoding functions are defined.

Limitations and responsible use

Outputs are approximations of the original text under the MPNet embedding and LM prior:
- They aim to preserve semantic gist and domain entities,
- They are not exact reconstructions.
The model can:
- Produce generic phrasing,
- Over‑use crypto buzzwords/hashtags,
- Occasionally show noisy punctuation/emoji.
Data are synthetic; domain semantics might differ from real social‑media distributions.
Do not use this model to attempt to reconstruct sensitive or private user content from embeddings.

Reproducibility (high‑level)

To reproduce or extend this checkpoint:

Prepare data:
- Generate synthetic crypto tweets (or your own domain) into a DB (e.g., SQLite).
- Extract raw text to train/val/test JSONL.
- Embed with all-mpnet-base-v2 (pooled 768‑D) and save as JSONL with {"text","embedding","plan"} fields.
Train S1:
- Use the Aparecium v2 trainer (S1 supervised) with:
  - batch_size ≈ 64, max_len ≈ 96, lr ≈ 3e-4, cosine scheduler, warmup steps.
- Train until validation cross‑entropy and cosine proxy metrics plateau.
Optional:
- Train surrogate similarity scorer r for reranking.
- Add SCST RL (S2) if you implement the safe reward/decoding policies.
Evaluate:
- Build a small evaluation harness (as in the v1 project) to measure cosine, degeneracy, and domain drift.

License

Code: MIT (per Aparecium repositories).
Weights: MIT, same as the code, unless explicitly overridden.

Citation

If you use this model or the Aparecium codebase, please cite:

Aparecium v2: Pooled MPNet Embedding Reversal for Crypto Tweets
SentiChain (Aparecium project)

You may also reference the v1 baseline model card:
SentiChain/aparecium-seq2seq-reverser.

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for SentiChain/aparecium-v2-pooled-reverser

Base model

sentence-transformers/all-mpnet-base-v2

Finetuned

(316)

this model