Summarization
Transformers
bwsk
combinator-analysis
Mixture of Experts
reversible-backprop
convergence-training
Eval Results (legacy)
Instructions to use tzervas/bwsk-switch-base-8 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use tzervas/bwsk-switch-base-8 with Transformers:
# Use a pipeline as a high-level helper # Warning: Pipeline type "summarization" is no longer supported in transformers v5. # You must load the model directly (see below) or downgrade to v4.x with: # 'pip install "transformers<5.0.0' from transformers import pipeline pipe = pipeline("summarization", model="tzervas/bwsk-switch-base-8")# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("tzervas/bwsk-switch-base-8", dtype="auto") - Notebooks
- Google Colab
- Kaggle
| license: mit | |
| base_model: google/switch-base-8 | |
| library_name: transformers | |
| pipeline_tag: summarization | |
| tags: | |
| - bwsk | |
| - combinator-analysis | |
| - moe | |
| - reversible-backprop | |
| - convergence-training | |
| datasets: | |
| - wikitext | |
| metrics: | |
| - perplexity | |
| model-index: | |
| - name: bwsk-switch-base-8 | |
| results: | |
| - task: | |
| type: summarization | |
| name: Fine-tune (Conventional) | |
| dataset: | |
| name: wikitext | |
| type: wikitext | |
| metrics: | |
| - name: perplexity | |
| type: perplexity | |
| value: 27.7215 | |
| verified: false | |
| - task: | |
| type: summarization | |
| name: Fine-tune (BWSK Analyzed) | |
| dataset: | |
| name: wikitext | |
| type: wikitext | |
| metrics: | |
| - name: perplexity | |
| type: perplexity | |
| value: 28.6584 | |
| verified: false | |
| - task: | |
| type: summarization | |
| name: Fine-tune (BWSK Reversible) | |
| dataset: | |
| name: wikitext | |
| type: wikitext | |
| metrics: | |
| - name: perplexity | |
| type: perplexity | |
| value: 27.9624 | |
| verified: false | |
| - task: | |
| type: summarization | |
| name: From Scratch (Conventional) | |
| dataset: | |
| name: wikitext | |
| type: wikitext | |
| metrics: | |
| - name: perplexity | |
| type: perplexity | |
| value: 290.6109 | |
| verified: false | |
| - task: | |
| type: summarization | |
| name: From Scratch (BWSK Analyzed) | |
| dataset: | |
| name: wikitext | |
| type: wikitext | |
| metrics: | |
| - name: perplexity | |
| type: perplexity | |
| value: 288.1153 | |
| verified: false | |
| - task: | |
| type: summarization | |
| name: From Scratch (BWSK Reversible) | |
| dataset: | |
| name: wikitext | |
| type: wikitext | |
| metrics: | |
| - name: perplexity | |
| type: perplexity | |
| value: 299.3535 | |
| verified: false | |
| # BWSK Switch-Base-8 | |
| **Switch-Base-8** (220M params) trained in **6 variants** (3 BWSK modes x 2 experiments) on WikiText-2 with full convergence training and early stopping. | |
| This repo contains all model weights, configs, and training results in a single consolidated repository. | |
| ## What is BWSK? | |
| BWSK is a framework that classifies every neural network operation as **S-type** (information-preserving, reversible, coordination-free) or **K-type** (information-erasing, synchronization point) using combinator logic. This classification enables reversible backpropagation through S-phases to save memory, and CALM-based parallelism analysis. | |
| ## Model Overview | |
| | Property | Value | | |
| |----------|-------| | |
| | **Base Model** | [google/switch-base-8](https://huggingface.co/google/switch-base-8) | | |
| | **Architecture** | Moe (seq2seq) | | |
| | **Parameters** | 220M | | |
| | **Dataset** | WikiText-2 | | |
| | **Eval Metric** | Perplexity | | |
| ## S/K Classification | |
| | Type | Ratio | | |
| |------|-------| | |
| | **S-type** (information-preserving) | 52.6% | | |
| | **K-type** (information-erasing) | 38.7% | | |
| | **Gray** (context-dependent) | 8.6% | | |
| ## Fine-tune Results | |
| | Mode | Final Loss | Val Perplexity | Test Perplexity | Peak Memory | Time | Epochs | | |
| |------|------------|----------|----------|----------|----------|----------| | |
| | Conventional | 2.9923 | 29.02 | 27.72 | 15.2 GB | 1.5h | 5 | | |
| | BWSK Analyzed | 3.1352 | 29.99 | 28.66 | 15.2 GB | 1.8h | 4 | | |
| | BWSK Reversible | 3.2770 | 29.24 | 27.96 | 15.2 GB | 2.5h | 5 | | |
| **Memory savings (reversible vs conventional):** 0.0% | |
| ## From Scratch Results | |
| | Mode | Final Loss | Val Perplexity | Test Perplexity | Peak Memory | Time | Epochs | | |
| |------|------------|----------|----------|----------|----------|----------| | |
| | Conventional | 5.5342 | 289.26 | 290.61 | 14.2 GB | 1.8h | 5 | | |
| | BWSK Analyzed | 5.2518 | 288.67 | 288.12 | 14.2 GB | 1.8h | 5 | | |
| | BWSK Reversible | 5.0745 | 297.67 | 299.35 | 14.1 GB | 1.8h | 5 | | |
| **Memory savings (reversible vs conventional):** 0.5% | |
| ## Repository Structure | |
| ``` | |
| βββ README.md | |
| βββ results.json | |
| βββ finetune-conventional/ | |
| β βββ model.safetensors | |
| β βββ config.json | |
| β βββ training_results.json | |
| βββ finetune-bwsk-analyzed/ | |
| β βββ model.safetensors | |
| β βββ config.json | |
| β βββ training_results.json | |
| βββ finetune-bwsk-reversible/ | |
| β βββ model.safetensors | |
| β βββ config.json | |
| β βββ training_results.json | |
| βββ scratch-conventional/ | |
| β βββ model.safetensors | |
| β βββ config.json | |
| β βββ training_results.json | |
| βββ scratch-bwsk-analyzed/ | |
| β βββ model.safetensors | |
| β βββ config.json | |
| β βββ training_results.json | |
| βββ scratch-bwsk-reversible/ | |
| β βββ model.safetensors | |
| β βββ config.json | |
| β βββ training_results.json | |
| ``` | |
| ## Usage | |
| Load a specific variant: | |
| ```python | |
| from transformers import AutoModelForSeq2SeqLM, AutoTokenizer | |
| # Load fine-tuned conventional variant | |
| model = AutoModelForSeq2SeqLM.from_pretrained( | |
| "tzervas/bwsk-switch-base-8", subfolder="finetune-conventional" | |
| ) | |
| tokenizer = AutoTokenizer.from_pretrained( | |
| "tzervas/bwsk-switch-base-8", subfolder="finetune-conventional" | |
| ) | |
| # Load from-scratch BWSK reversible variant | |
| model = AutoModelForSeq2SeqLM.from_pretrained( | |
| "tzervas/bwsk-switch-base-8", subfolder="scratch-bwsk-reversible" | |
| ) | |
| ``` | |
| ## Training Configuration | |
| | Setting | Value | | |
| |---------|-------| | |
| | **Optimizer** | AdamW | | |
| | **LR (fine-tune)** | 3e-05 | | |
| | **LR (from-scratch)** | 2e-04 | | |
| | **LR Schedule** | Cosine with warmup | | |
| | **Max Grad Norm** | 1.0 | | |
| | **Mixed Precision** | AMP (float16) | | |
| | **Early Stopping** | Patience 3 | | |
| | **Batch Size** | 1 | | |
| | **Sequence Length** | 256 | | |
| ## Links | |
| - [GitHub Repository](https://github.com/tzervas/ai-s-combinator) | |
| - [Whitepaper](https://github.com/tzervas/ai-s-combinator/blob/main/docs/WHITEPAPER.md) | |
| - [Full Training Report](https://github.com/tzervas/ai-s-combinator/blob/main/docs/FULL_TRAINING_REPORT.md) | |
| ## Citation | |
| ```bibtex | |
| @software{zervas2026bwsk, | |
| author = {Zervas, Tyler}, | |
| title = {BWSK: Combinator-Typed Neural Network Analysis}, | |
| year = {2026}, | |
| url = {https://github.com/tzervas/ai-s-combinator}, | |
| } | |
| ``` | |
| ## License | |
| MIT | |