Add BWSK model card

791a08b verified 3 months ago

6.3 kB

	---
	license: mit
	base_model: google/switch-base-8
	library_name: transformers
	pipeline_tag: summarization
	tags:
	- bwsk
	- combinator-analysis
	- moe
	- reversible-backprop
	- convergence-training
	datasets:
	- wikitext
	metrics:
	- perplexity
	model-index:
	- name: bwsk-switch-base-8
	results:
	- task:
	type: summarization
	name: Fine-tune (Conventional)
	dataset:
	name: wikitext
	type: wikitext
	metrics:
	- name: perplexity
	type: perplexity
	value: 27.7215
	verified: false
	- task:
	type: summarization
	name: Fine-tune (BWSK Analyzed)
	dataset:
	name: wikitext
	type: wikitext
	metrics:
	- name: perplexity
	type: perplexity
	value: 28.6584
	verified: false
	- task:
	type: summarization
	name: Fine-tune (BWSK Reversible)
	dataset:
	name: wikitext
	type: wikitext
	metrics:
	- name: perplexity
	type: perplexity
	value: 27.9624
	verified: false
	- task:
	type: summarization
	name: From Scratch (Conventional)
	dataset:
	name: wikitext
	type: wikitext
	metrics:
	- name: perplexity
	type: perplexity
	value: 290.6109
	verified: false
	- task:
	type: summarization
	name: From Scratch (BWSK Analyzed)
	dataset:
	name: wikitext
	type: wikitext
	metrics:
	- name: perplexity
	type: perplexity
	value: 288.1153
	verified: false
	- task:
	type: summarization
	name: From Scratch (BWSK Reversible)
	dataset:
	name: wikitext
	type: wikitext
	metrics:
	- name: perplexity
	type: perplexity
	value: 299.3535
	verified: false
	---

	# BWSK Switch-Base-8

	Switch-Base-8 (220M params) trained in 6 variants (3 BWSK modes x 2 experiments) on WikiText-2 with full convergence training and early stopping.

	This repo contains all model weights, configs, and training results in a single consolidated repository.

	## What is BWSK?

	BWSK is a framework that classifies every neural network operation as S-type (information-preserving, reversible, coordination-free) or K-type (information-erasing, synchronization point) using combinator logic. This classification enables reversible backpropagation through S-phases to save memory, and CALM-based parallelism analysis.

	## Model Overview

	\| Property \| Value \|
	\|----------\|-------\|
	\| Base Model \| [google/switch-base-8](https://huggingface.co/google/switch-base-8) \|
	\| Architecture \| Moe (seq2seq) \|
	\| Parameters \| 220M \|
	\| Dataset \| WikiText-2 \|
	\| Eval Metric \| Perplexity \|

	## S/K Classification

	\| Type \| Ratio \|
	\|------\|-------\|
	\| S-type (information-preserving) \| 52.6% \|
	\| K-type (information-erasing) \| 38.7% \|
	\| Gray (context-dependent) \| 8.6% \|

	## Fine-tune Results

	\| Mode \| Final Loss \| Val Perplexity \| Test Perplexity \| Peak Memory \| Time \| Epochs \|
	\|------\|------------\|----------\|----------\|----------\|----------\|----------\|
	\| Conventional \| 2.9923 \| 29.02 \| 27.72 \| 15.2 GB \| 1.5h \| 5 \|
	\| BWSK Analyzed \| 3.1352 \| 29.99 \| 28.66 \| 15.2 GB \| 1.8h \| 4 \|
	\| BWSK Reversible \| 3.2770 \| 29.24 \| 27.96 \| 15.2 GB \| 2.5h \| 5 \|

	Memory savings (reversible vs conventional): 0.0%

	## From Scratch Results

	\| Mode \| Final Loss \| Val Perplexity \| Test Perplexity \| Peak Memory \| Time \| Epochs \|
	\|------\|------------\|----------\|----------\|----------\|----------\|----------\|
	\| Conventional \| 5.5342 \| 289.26 \| 290.61 \| 14.2 GB \| 1.8h \| 5 \|
	\| BWSK Analyzed \| 5.2518 \| 288.67 \| 288.12 \| 14.2 GB \| 1.8h \| 5 \|
	\| BWSK Reversible \| 5.0745 \| 297.67 \| 299.35 \| 14.1 GB \| 1.8h \| 5 \|

	Memory savings (reversible vs conventional): 0.5%

	## Repository Structure

	```
	├── README.md
	├── results.json
	├── finetune-conventional/
	│ ├── model.safetensors
	│ ├── config.json
	│ └── training_results.json
	├── finetune-bwsk-analyzed/
	│ ├── model.safetensors
	│ ├── config.json
	│ └── training_results.json
	├── finetune-bwsk-reversible/
	│ ├── model.safetensors
	│ ├── config.json
	│ └── training_results.json
	├── scratch-conventional/
	│ ├── model.safetensors
	│ ├── config.json
	│ └── training_results.json
	├── scratch-bwsk-analyzed/
	│ ├── model.safetensors
	│ ├── config.json
	│ └── training_results.json
	├── scratch-bwsk-reversible/
	│ ├── model.safetensors
	│ ├── config.json
	│ └── training_results.json
	```

	## Usage

	Load a specific variant:

	```python
	from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

	# Load fine-tuned conventional variant
	model = AutoModelForSeq2SeqLM.from_pretrained(
	"tzervas/bwsk-switch-base-8", subfolder="finetune-conventional"
	)
	tokenizer = AutoTokenizer.from_pretrained(
	"tzervas/bwsk-switch-base-8", subfolder="finetune-conventional"
	)

	# Load from-scratch BWSK reversible variant
	model = AutoModelForSeq2SeqLM.from_pretrained(
	"tzervas/bwsk-switch-base-8", subfolder="scratch-bwsk-reversible"
	)
	```

	## Training Configuration

	\| Setting \| Value \|
	\|---------\|-------\|
	\| Optimizer \| AdamW \|
	\| LR (fine-tune) \| 3e-05 \|
	\| LR (from-scratch) \| 2e-04 \|
	\| LR Schedule \| Cosine with warmup \|
	\| Max Grad Norm \| 1.0 \|
	\| Mixed Precision \| AMP (float16) \|
	\| Early Stopping \| Patience 3 \|
	\| Batch Size \| 1 \|
	\| Sequence Length \| 256 \|

	## Links

	- [GitHub Repository](https://github.com/tzervas/ai-s-combinator)
	- [Whitepaper](https://github.com/tzervas/ai-s-combinator/blob/main/docs/WHITEPAPER.md)
	- [Full Training Report](https://github.com/tzervas/ai-s-combinator/blob/main/docs/FULL_TRAINING_REPORT.md)

	## Citation

	```bibtex
	@software{zervas2026bwsk,
	author = {Zervas, Tyler},
	title = {BWSK: Combinator-Typed Neural Network Analysis},
	year = {2026},
	url = {https://github.com/tzervas/ai-s-combinator},
	}
	```

	## License

	MIT