Create FORMULAS.md

c972e15 verified 2 months ago

20 kB

	# Geometric Formula Catalog
	## Token Topology & Loss System — AbstractPhil + Claude

	ROSE loss discarded. These are the active formulas.

	---

	## 1. Multi-Scale Crystal Loss

	Classification through learnable crystal prototypes at multiple projection dimensions. Each class has a crystal centroid at each scale. No softmax — geometric distance IS the classifier.

	Scales: `[64, 128, 256, 512, 1024]` (each is a projection dimension, not spatial)

	### 1.1 Per-Scale Crystal Similarity

	```
	sim(x, c_k) = (x̂ · ĉ_k) / τ

	where:
	x̂ = normalize(proj_k(features)) # [B, scale_dim]
	ĉ_k = normalize(crystals_k) # [num_classes, scale_dim]
	τ = temperature (default 0.07)
	```

	### 1.2 Per-Scale Coherence Loss

	Pull features toward their correct class crystal:

	```
	L_coherence = -mean(log(exp(sim(x, c_y)) / Σ_j exp(sim(x, c_j))))

	where y = true class label
	```

	### 1.3 Per-Scale Separation Loss

	Push class crystals apart with margin:

	```
	L_separation = Σ_{i≠j} max(0, margin - \|\|ĉ_i - ĉ_j\|\|₂)² / (C(C-1))

	where C = num_classes, margin = 1.0
	```

	### 1.4 Per-Scale Discretization Loss (Cantor Targets)

	Cluster crystal Cantor values toward `{0.0, 0.5, 1.0}`:

	```
	L_discretization = mean(min_t(\|\|cantor(c_i) - t\|\|²))

	where t ∈ {0.0, 0.5, 1.0}
	```

	### 1.5 Per-Scale Crystal Geometry Loss

	Maintain target distance from features to class prototypes:

	```
	L_geometry = mean((\|\|x - c_y\|\|₂ - d_target)²)

	where d_target = 1.0
	```

	### 1.6 Total Multi-Scale Crystal Loss

	```
	L_crystal = (1/S) Σ_{k=1}^{S} w_k · (
	w_coh · L_coherence_k +
	w_sep · L_separation_k +
	w_disc · L_discretization_k +
	w_geom · L_geometry_k
	)

	Proven weights: w_coh=1.0, w_sep=0.5, w_disc=1.0, w_geom=0.5
	```

	### 1.7 Crystal Prediction (No Softmax Head)

	```
	logits = Σ_k w_k · (α · cos_sim_k + β · cantor_coherence_k + γ · crystal_geometry_k)

	where prediction = argmax(logits)
	```

	Results: 86% ImageNet (CLIP bigG features), 74.87% CIFAR-100 (393K params), ~92% CIFAR-100 (78KB model)

	---

	## 2. Geometric Basin Compatibility Loss

	Classification through geometric formula satisfaction. Four structural checks produce compatibility scores ∈ [0,1]. No cross-entropy needed.

	### 2.1 Triadic Compatibility

	```
	T(x, c) = exp(-\|\|proj(x) - c\|\|₂² / (2σ²))

	where c = class centroid, σ = learned bandwidth
	```

	### 2.2 Self-Similarity Check

	```
	S(x) = exp(-Var(cantor_levels(x)))

	where cantor_levels extracts per-level Cantor measures
	High self-similarity → low variance across levels → high score
	```

	### 2.3 Cantor Coherence Check

	```
	C(x, p_y) = exp(-\|\|cantor(x) - p_y\|\|₂²)

	where p_y = class Cantor prototype
	```

	### 2.4 Hierarchical Check

	```
	H(x) = Σ_{k=1}^{L} 0.5^k · match(level_k(x), expected_k)
	```

	### 2.5 Combined Compatibility Score

	```
	compat(x, class_j) = T(x, c_j) · S(x) · C(x, p_j) · H(x)

	Product of four factors ∈ [0,1] → output ∈ [0,1]
	```

	### 2.6 Basin Loss (Three-Term, No Cross-Entropy)

	```
	L_correct = -mean(log(compat(x, y) + ε))
	L_incorrect = -mean(log(1 - compat(x, j≠y) + ε))
	L_contrastive = NLL(log_softmax(compat / τ), y)

	L_basin = L_correct + 0.5 · L_incorrect + 0.5 · L_contrastive
	```

	Results: 67.69% CIFAR-100 with NO attention, NO cross-entropy, NO transformers (geo-beatrix). Beat ViT-beatrix (66.0%).

	---

	## 3. K-Simplex Channel Formulas

	Tokens represented as k-simplices with Cayley-Menger validated geometry. Shape `[B, T, K+1, F]` where K+1 = vertices.

	### 3.1 Template + Deformation

	```
	v_i = v_i^{template} + α · Δv_i

	where:
	v_i^{template} = regular k-simplex vertices (frozen)
	α = deformation scale (0.05 base, per-k scaled)
	Δv_i = learned offset from neural network
	```

	### 3.2 K-Scaled Deformation

	Volume scales as `edge^k`, so higher k needs smaller deformation:

	```
	α_k = α_base / √(k + 1)

	k=1: α × 0.71 k=3: α × 0.50
	k=2: α × 0.58 k=4: α × 0.45
	```

	### 3.3 Per-Token Simplex Coordinates

	```
	coords = proj(token_embedding) # [B, T, edim]
	vertex_weights = softmax(route(token_embedding)) # [B, T, K+1]
	simplex_state = vertex_weights @ vertices # [B, T, edim]
	```

	### 3.4 K-Simplex Attention (Proven Superior to K-Simplex Classification)

	```
	For each token pair (i, j):
	d²_ij = \|\|simplex_i - simplex_j\|\|² # pairwise simplex distance
	attn_ij = softmax(-d²_ij / τ) # geometric attention weights

	Output = attn @ V # standard value projection
	```

	Results: 89.13% FMNIST, 84.59% CIFAR-10, 69.08% CIFAR-100 as attention. Entropy decreases through layers (sharpening). Fewer tokens = sharper attention (25 patches > 64 patches).

	---

	## 4. Cayley-Menger Formulas

	The structural invariant. If CM fails, geometry is invalid. Non-negotiable.

	### 4.1 Cayley-Menger Matrix

	```
	CM = \| 0 1 1 ... 1 \|
	\| 1 0 d₀₁² ... d₀ₖ² \|
	\| 1 d₀₁² 0 ... d₁ₖ² \|
	\| ⋮ ⋮ ⋮ ⋱ ⋮ \|
	\| 1 d₀ₖ² d₁ₖ² ... 0 \|

	Size: (K+2) × (K+2) for a K-simplex
	```

	### 4.2 Volume Formula (Corrected)

	```
	Vol² = (-1)^(K+1) / (2^K · (K!)²) · det(CM)

	Validity: Vol² > 0 indicates non-degenerate simplex
	```

	### 4.3 Gram Determinant Alternative (More Stable)

	```
	X_translated = X[:, 1:, :] - X[:, 0:1, :] # [B, K, D]
	G = X_translated @ X_translated.T # [B, K, K]
	Vol = √(det(G)) / K!
	```

	### 4.4 Validity Loss

	```
	L_validity = mean(ReLU(-Vol²))

	Penalizes collapsed simplices (Vol² < 0)
	```

	### 4.5 Volume Consistency Loss

	```
	L_vol_consistency = Var(Vol²) across batch

	Encourages uniform geometric structure
	```

	### 4.6 Hierarchical Cell Loss (k=4 pentachoron)

	```
	5 cells (tetrahedra), each with 4 vertices, 6 edges:

	L_cell = mean(ReLU(ε - Vol²_cell_i))

	for i = 1..5 cells of the pentachoron
	```

	### 4.7 Vol² Scaling Reference

	```
	k=1: Vol² ~ 1e+0 (edge length squared)
	k=2: Vol² ~ 1e-1 (triangle area squared)
	k=3: Vol² ~ 1e-2 (tetrahedron volume squared)
	k=4: Vol² ~ 1e-3 (5-cell hypervolume squared)
	```

	---

	## 5. Cantor Lens Formulas

	The Devil's Staircase as a hierarchical lens for viewing token relationships.

	### 5.1 Devil's Staircase (Beatrix Staircase)

	```
	C(x) = Σ_{k=1}^{levels} bit_k × 0.5^k

	where:
	y_k = x × 3^k # scale to level k
	p = softmax(-d²/τ) over centers [0.5, 1.5, 2.5]
	bit_k = p_right + α × p_middle # soft ternary assignment
	α = learnable middle-third fill (default 0.5)
	τ = softmax temperature (default 0.25)
	```

	### 5.2 Branch Path Extraction

	```
	branch_path(x) = [argmax(p_1), argmax(p_2), ..., argmax(p_L)]

	Each level: L (left third), M (middle third), R (right third)
	```

	### 5.3 Hierarchical Alignment (NOT Distance)

	CRITICAL: Distance is meaningless on Cantor set.

	```
	alignment(i, j) = Σ_{k=1}^{L} 0.5^k · 𝟙(path_i[k] == path_j[k])

	Level weights: [0.5, 0.25, 0.125, 0.0625, 0.03125]
	```

	Coarse matches = routing highways (wormholes).
	Fine matches = local structure only.

	### 5.4 Euclidean Bridge (Lossy but Necessary)

	```
	distance(i, j) = \|C(x_i) - C(x_j)\|

	Use ONLY when interfacing with Euclidean systems (optimizers, standard losses).
	Alignment is the Cantor-native metric.
	```

	### 5.5 Cantor Routing Bias (for Attention)

	```
	bias[i,j] = alignment(i, j) # precomputed [S, S] matrix

	attn_scores = (Q @ K.T / √d) + λ · bias

	where λ = learnable routing weight
	```

	### 5.6 Alpha Modulation

	```
	α → 0.0: Pure ternary (Cantor dust, maximally disconnected)
	α → 0.5: Triadic equilibrium (proven stable zone: 0.44-0.50)
	α → 1.0: Filled (continuous, no fractal structure)
	```

	---

	## 6. Cantor Topological Ropes

	Position encodings that encode structural hierarchy, not just sequence order.

	### 6.1 Standard RoPE (Baseline)

	```
	θ_i = 10000^(-2i/d)
	R(m) = [cos(mθ_i), -sin(mθ_i); sin(mθ_i), cos(mθ_i)]

	for dimension pair (2i, 2i+1) at position m
	```

	### 6.2 BeatrixRoPE (Devil's Staircase Warping)

	```
	pos_beatrix(m) = C(m / seq_len) # Cantor function of normalized position

	R_beatrix(m) = R(pos_beatrix(m) × seq_len)
	```

	Tokens in same ternary branch get similar positions → attend easily.
	Creates hierarchical plateaus.

	### 6.3 CantorRoPE (Wormhole Shortcuts)

	```
	pos_cantor(m) = trend × m + deviation × wormhole(m)

	where:
	trend = 1.0 (aligns macro slope with standard RoPE)
	deviation = learnable perturbation scale
	wormhole(m) = branch_path_alignment signal
	```

	Tokens with aligned branch paths can shortcut regardless of sequential distance.

	### 6.4 Aligned Triad (Proven Configuration)

	```
	Standard: linear baseline "this comes after that"
	Beatrix: hierarchical plateaus "these belong together"
	Cantor: wormhole perturbations "these can shortcut"

	All share same macro slope (trend=1.0), different micro structure.
	```

	### 6.5 Tower Assignment

	```
	Tower_positive = BeatrixRoPE(...) # hierarchical reasoning
	Tower_negative = CantorRoPE(...) # wormhole reasoning

	Signed pairs create differential forces in oscillator fusion.
	```

	---

	## 7. Beatrix Oscillation Formulas (GeoFractal Router)

	Physics-based fusion replacing static weighted sums. Tower outputs are force fields, not opinions to average.

	### 7.1 Covariant Dynamics

	```
	dx/dt = v
	dv/dt = -2β(t)·v - ω²·Log_x(x_ref) + κ(t)·u_towers + γ(t)·ξ_guide

	where:
	x = position on manifold
	v = velocity in tangent space
	β(t) = damping schedule
	ω = spring frequency
	x_ref = conditioning anchor
	κ(t) = tower coupling strength
	u_towers = force from tower opinions
	γ(t) = guidance strength
	ξ_guide = external guidance (DINO, text, etc.)
	```

	### 7.2 Manifold Operations

	```
	Log_x(y) = y - x # tangent vector from x toward y
	Exp_x(v) = x + v # move along tangent vector
	PT_{x→y}(v) = v # parallel transport (flat approx)
	```

	### 7.3 Tower Force Generation

	```
	For N towers with signed pairs:
	force_i = proj_i(tower_output_i) # [B, manifold_dim]
	u_towers = Σ_i w_i · force_i # weighted combination

	Positive towers push toward structure.
	Negative towers push away from collapse.
	```

	### 7.4 Tesla 3-6-9 Schedule

	```
	β(t) = β_base + resonance(t)

	resonance(t) = 0.1·sin(3πt) + 0.05·sin(6πt) + 0.025·sin(9πt)

	Resonant peaks at t = 1/3, 2/3, 1.0
	Energy doesn't flow linearly — it oscillates.
	```

	### 7.5 Schedule Types

	\| Schedule \| Formula \|
	\|----------\|---------\|
	\| Constant \| `s(t) = start` \|
	\| Linear \| `s(t) = start + (end - start) · t` \|
	\| Cosine \| `s(t) = end + (start - end) · 0.5(1 + cos(πt))` \|
	\| Sigmoid \| `s(t) = start + (end - start) · σ(12(t - 0.5))` \|
	\| Tesla 3-6-9 \| `s(t) = linear(t) + resonance(t)` \|

	### 7.6 Intrinsic Tension τ

	```
	τ = σ(gain · (Σ_i w_i · invariant_i - equilibrium))

	where:
	invariant_i = geometric invariants (Vol², edge stats, etc.)
	w_i = learned per-invariant weights
	gain = steepness of sigmoid response
	equilibrium = learned bias

	τ → 0: Pure spring (geometric constraint dominates)
	τ → 1: Pure control (tower forces dominate)
	```

	### 7.7 Stability Criterion

	```
	Eigenvalues of linearized system:
	λ = -β ± √(β² - (1-τ)ω²)

	Overdamped: β² > (1-τ)ω² (stable, no oscillation)
	Underdamped: β² < (1-τ)ω² (oscillatory)
	Critical: β² = (1-τ)ω² (fastest convergence)
	```

	### 7.8 Energy Tracking

	```
	E_kinetic = 0.5 · \|\|v\|\|²
	E_potential = 0.5 · ω² · \|\|Log_x(x_ref)\|\|²
	E_total = E_kinetic + E_potential

	Healthy training: E_total decreases over integration steps.
	```

	---

	## 8. K-Simplex Linear (Near-Zero Params)

	Replaces `nn.Linear` with geometric routing through simplex structure.

	### 8.1 Architecture

	```
	Input (B, input_dim)
	→ chunk into (B, num_simplices, K+1) groups
	→ per-scalar entry into vertex (K+1 options)
	→ private hidden projection per vertex (depth = K+1)
	→ pairwise signal passages between all vertex pairs
	→ attenuation gates on pairwise influence
	→ exit: weighted sum of vertex states
	Output (B, output_dim)
	```

	### 8.2 Parameter Count

	```
	Per simplex (K+1 inputs):
	Entry: (K+1) × (K+1) × hidden
	Vertex: (K+1) × hidden
	Pairwise: C(K+1, 2) × 3 × hidden
	Attenuate: C(K+1, 2) × 2
	Exit: (K+1) × hidden + (K+1)

	For K=4, input_dim=512:
	103 simplices × 300 params = 30,900
	vs nn.Linear: 262,656
	Ratio: 0.118x (11.8% of linear params)
	```

	### 8.3 Structural Comparison

	```
	Structure size per simplex: (K+1) × (K+1) × C(K+1,2)

	K=2: 3×3×3 = 27
	K=4: 5×5×10 = 250
	K=6: 7×7×21 = 1029
	```

	### 8.4 Results

	```
	Fashion-MNIST:
	KSimplex-k4: 85.94% with 8,511 params
	MLP baseline: 89.00% with 101,770 params
	Ratio: 11.5× more parameter-efficient

	Epoch 1: 84.28% test (instant useful signal)
	Epoch 19: 85.94% test (stable convergence)
	```

	---

	## 9. K-Simplex Deformation Limitations

	Critical stability boundaries from extensive geometric explorer experiments.

	### 9.1 Stability Zones by Configuration

	\| Configuration \| Differentiation Zone \| Collapse Threshold \|
	\|---------------\|---------------------\|-------------------\|
	\| k=1-4, edim=16 \| 0.15 - 0.35 \| ~0.50 \|
	\| k=1-4, edim=32 \| 0.15 - 0.50 \| >2.0 \|
	\| k=1-6, edim=16 \| 0.35 - 0.45 \| ~0.50 \|
	\| k=1-6, edim=32 \| 0.25 - 0.60 \| >2.0 \|

	### 9.2 Embedding Dimension Safety Ratio

	```
	stability_ratio = edim / k_max

	ratio ≥ 8× → Very stable, deform up to 2.0
	ratio ≥ 4× → Comfortable margin
	ratio ≥ 2× → Tight but functional
	ratio < 2× → Dangerous, frequent invalidity
	```

	### 9.3 Deformation Behavior

	```
	Low deform (0 - 0.15):
	Clear k-level hierarchy
	Vol² decreases exponentially with k
	Conservative but safe

	Medium deform (0.15 - 0.35): ← OPTIMAL ZONE
	Distinct geometric signatures per k
	Maximum useful differentiation
	Training should target this range

	High deform (> 0.5):
	Noise dominates
	k-levels converge (lose meaning)
	Geometric structure destroyed
	```

	### 9.4 Late-Stage K-Simplex Invalidity

	```
	As k increases:
	- CM determinant computation becomes numerically unstable
	- More edge configurations become geometrically impossible
	- Deeper layers produce invalid simplex configurations

	k=4 in 32D: stable with wide margin
	k=5 in 32D: functional but tighter
	k=6 in 32D: approaching invalidity ceiling

	Recommendation: k=4 (pentachoron) as primary, k≤3 for tight budgets
	```

	### 9.5 Cross-Entropy Degeneracy Problem

	```
	Cross-entropy applied directly to simplex features:
	→ Vertices converge (minimizing distance to class boundary)
	→ Volume → 0 (simplex collapses)
	→ α diverges from triadic equilibrium
	→ Geometric structure destroyed after sufficient epochs

	Solution: Use crystal loss or basin loss, NOT cross-entropy on geometric features.
	```

	---

	## 10. Cross-Contrast Capacity Tests

	Validating that geometric structure survives training and provides meaningful classification signal.

	### 10.1 Geometric Cross-Contrastive Loss

	```
	sim_matrix = (x̂ @ x̂.T) / τ # [B, B] embedding similarity

	cantor_positives = (\|C(i) - C(j)\| < θ_cantor) AND (\|Vol(i) - Vol(j)\| < θ_vol)

	L_cross = -log(Σ_j∈positives exp(sim_ij) / Σ_j∈all exp(sim_ij))

	where positives are defined by geometric proximity, not class labels
	```

	### 10.2 Capacity Invariants to Monitor

	```
	1. Vol² > 0 for all simplices (validity)
	2. α ∈ [0.44, 0.50] (triadic equilibrium)
	3. Edge length variance < threshold (structural uniformity)
	4. Cantor prototype separation > margin (class distinctness)
	5. Crystal distance to prototype ~ d_target (geometric alignment)
	```

	### 10.3 Differential Cross-Contrast (Tower Pairs)

	```
	For positive/negative tower pairs:
	Δ_force = force_positive - force_negative

	L_differential = -log(σ(Δ_force · direction_to_correct_class))
	+ log(σ(Δ_force · direction_to_incorrect_class))

	Signed pairs create differential forces, not just different opinions.
	```

	### 10.4 Cross-Scale Consistency

	```
	For scales s₁, s₂:
	features_s1 = proj_s1(backbone_features)
	features_s2 = proj_s2(backbone_features)

	L_consistency = \|\|rank_order(sim_s1) - rank_order(sim_s2)\|\|₂

	Ensures geometric relationships are preserved across crystal scales.
	```

	### 10.5 OOD Detection via Geometric Violation

	```
	In-distribution: Vol² > 0, α stable, Cantor coherent
	Out-of-distribution: Violations of above

	OOD_score = (1 - σ(Vol² · 10⁶)) + (\|α - 0.5\|) + (1 - compat_max)
	```

	### 10.6 Scaling Limitation (Known)

	```
	Cross-contrastive loss across full vocabulary:
	O(V²) pairwise comparisons

	V=100 (CIFAR-100): 10K pairs → feasible
	V=1000 (ImageNet): 1M pairs → expensive
	V=50000 (tokenizer): 2.5B pairs → infeasible

	Solution: Hierarchical contrastive within Cantor branches.
	Only contrast within same coarse branch (routing highways).
	Fine branches → local contrast only.
	```

	---

	## Appendix A: Proven Results Summary

	\| Model \| Task \| Accuracy \| Params \| Key Innovation \|
	\|-------\|------\|----------\|--------\|----------------\|
	\| David \| ImageNet (CLIP bigG) \| 86% \| ~120K \| Multi-scale crystal \|
	\| David \| CIFAR-100 \| 74.87% \| 393K \| Crystal prototypes \|
	\| David \| CIFAR-100 \| ~92% \| 78KB \| Extreme compression \|
	\| geo-beatrix \| CIFAR-100 \| 67.69% \| — \| NO attention, NO CE \|
	\| KSimplex Attention \| FMNIST \| 89.13% \| — \| Geometric attention \|
	\| KSimplex Attention \| CIFAR-10 \| 84.59% \| — \| Conv stem + geo attn \|
	\| KSimplex Attention \| CIFAR-100 \| 69.08% \| — \| Multi-layer sharpening \|
	\| KSimplex Linear \| FMNIST \| 85.94% \| 8,511 \| 11.5× efficiency \|
	\| KSimplex LLM \| Shakespeare \| PPL 113 \| 54M \| 100% geo validity \|
	\| Beeper v5 \| Ethics \| Coherent \| Random \| Architecture IS intelligence \|

	## Appendix B: Formula Dependencies

	```
	┌─────────────┐
	│ Cayley-Menger│ ← structural invariant
	└──────┬──────┘
	│
	┌────────────┼────────────┐
	▼ ▼ ▼
	┌──────────┐ ┌──────────┐ ┌──────────┐
	│ K-Simplex│ │ Crystal │ │ Basin │
	│ Channel │ │ Loss │ │ Compat │
	└────┬─────┘ └────┬─────┘ └────┬─────┘
	│ │ │
	▼ ▼ ▼
	┌──────────────────────────────────┐
	│ Cantor Lens │
	│ (Staircase + Alignment + Bias) │
	└──────────────┬───────────────────┘
	│
	┌────────┼────────┐
	▼ ▼ ▼
	┌─────────┐ ┌──────┐ ┌──────────┐
	│ Topo │ │ Osc │ │ KSimplex │
	│ Ropes │ │ Fuse │ │ Linear │
	└─────────┘ └──────┘ └──────────┘
	```

	## Appendix C: What Kills Geometry (Known Failure Modes)

	1. Cross-entropy on geometric features → simplex collapse
	2. Distance on Cantor set → meaningless (use alignment)
	3. Deformation > 0.35 at edim/k < 4 → invalidity
	4. k > 4 without edim ≥ 8k → numerical instability
	5. Uniform Cantor level weights → hides 8× routing significance difference
	6. Resizing crystal anchors across scales → destroys pentachoron geometry (use separate init per scale)
	7. Dropout scaling with √dim → inconsistent information flow across scales